Road to ChatGPT: The Math of the Model Behind It

Explore the transition from RNNs to transformers, leading up to GPT-3—a colossal language model with 175 billion parameters, representing a major advancement from GPT-1 and GPT-2.

We'll cover the following

History
What we’ll learn
The structure of RNNs
The main limitations of RNNs
Long short-term memory (LSTM)

Since its foundation in 2015, OpenAI has invested in the research and development of the class of models called generative pre-trained transformers (GPT), and they have captured everyone’s attention as being the engine behind ChatGPT.

History

GPT models belong to the architectural framework of transformers introduced in 2017 by Google researchers in the “Attention Is All You Need” paper.

The transformer architecture was introduced to overcome the limitations of traditional recurrent neural networks (RNNs). RNNs were first introduced in the 1980s by researchers at the Los Alamos National Laboratory, but they did not gain much attention until the 1990s. The original idea behind RNNs was that of processing sequential data or time series data, keeping information across time steps.

Indeed, up to that moment in time, the classic artificial neural network (ANN) architecture was that of the feedforward ANN, where the output of each hidden layer is the input of the next one without maintaining information about past layers.

What we’ll learn

To understand the idea behind the transformer, we need to start from its origins. We will discuss the following topics:

The structure of RNNs
The main limitations of RNNs
How those limitations have been overcome with the introduction of new architectural elements, including positional encoding, self-attention, and the feedforward layer
How we got to the state-of-the-art models like GPT and ChatGPT

Let’s start with the architecture of transformers’ predecessors.

The structure of RNNs

Let’s imagine we want to predict a house price. If we only had today’s price for it, we could use a feedforward architecture where we apply a non-linear transformation to the input via a hidden layer (with an activation function) and get the forecast of the price for tomorrow. Here is how:

Get hands-on with 1400+ tech skills courses.

Getting Started

Introduction to Generative AI

OpenAI and ChatGPT—Beyond the Market Hype

Getting Familiar with ChatGPT

Understanding Prompt Design

Boosting Day-to-Day Productivity with ChatGPT

Developing the Future with ChatGPT

Mastering Marketing with ChatGPT

Research Reinvented with ChatGPT

OpenAI and ChatGPT for Enterprises—Introducing Azure OpenAI

Trending Use Cases for Enterprises

Epilogue and Final Thoughts

Appendix

Road to ChatGPT: The Math of the Model Behind It

History

What we’ll learn

The structure of RNNs