How a "GPT" Works

What is a GPT?

In Artificial Intelligence, a GPT stands for "Generative Pre-Trained Transformer." This means that the GPT (called the model) is able to "synthesize" an output to a user input because it was given data previously to be trained on, and finds patterns hidden within it. The word transformer has a little more backstory that we need to dive deep into.

Most say that the idea of a transformer was first introduced in 2017, in an article titled "Attention Is All You Need" by 8 Google Scientists. This article revolutionized a new way of thinking of training models: allowing them to dynamically focus on important input elements, called the attention mechanism (Vaswani, Ashish et al., 2017). In a modern Transformer Network (at least the one I will explain in this article), the last token is used to determine the next token/output. This token is modified in value, using the context given by the user in previous interactions and the prompt itself.

Steps in a GPT

*credit to 3Blue1Brown, I will be using his explanation of how a Decoder only Transformer works.

1) Token Assigning: A Body of Text is broken down into what are called “tokens” (sometimes words, parts of words, a few pixels on a screen, etc.). Here's an example that we will use,


|I| |love| |a| |lot| |of| |cho| |co| |late| |.| |Th||is| |me||ans| |I| |li||ke|(Notice how "chocolate" is broken down into several chunks)


Each Token is then assigned a vector. For example, the token “of” could be assigned, 



(A Vector is nothing but a line in a high-dimensional space. Yes, there are more than three dimensions.)

 Credit: Desmos 3D, used it to model a vector. 



3) Visualizing Token (this is just for us to see what's going on): Visualizing a vector in the manner above, we can understand that words, in our case, of similar meaning, will be closer to each other. If vector((1,0,0),(3,4,7)) is the word “like,” vector((1,0,0),(3,6,7)) could be the word “love.”


 
Credit: Desmos 3D, used it to model vectors.



4) Doing this for all tokens. The Network does this for everything.

5) Passing through the Attention Mechanism: This is passed into the “attention mechanism,” (sounds familiar right) where the values inside each of the token’s matrices change closer to the objects near it. In our case, "chocolate" gives context to what the user "love[s]." These tokens will be focused on more than others. 

6) Passing through the Multi-Layer Perceptron: After completing its context routine, tokens go into the “Multi-Layer Perceptron” (MLP), a process which all tokens go through where they get transformed to help get the desired output (it's really how all Neural Networks work to get an output). It takes in each token as an input and tries to filter out  discrepancies with the tokens that could later affect the output.

7) Probability Distribution: Repeat this several times to refine the meaning of the tokens. At the end, the last token in the matrix will determine the "probability distribution," or the probability of several possible next tokens. The one with the most probability (on a 0-1 scale) will be picked as the next token. Sometimes, they aren't on a 0-1 scale and an ArgMax or SoftMax are needed to put it back into this scale. Here's my article about ArgMax and SoftMax. 

8) Repeat: Repeat this several times until you get the answer you are looking for. So, our response could be:

"chocolate."

Why use them? Where are they used in the real world?

GPTs can be very useful when it comes to just getting a simple answer. Most people use ChatGPT by OpenAI in order to be more efficient, find a simpler and cleaner explanation than what the internet could offer, learning from it, and much more. Some other examples are Microsoft Copilot and Gemini by Google.

Fun Fact: Siri by Apple and Google Translate are AI Models!

In a broader sense, Transformers are used to help create coherent sentences that previous Machine Learning models could not do. This field is called Natural Language Processing, where they attempt to make AI understand and retain what it reads from humans and write it as well. 

Therefore, it is typically used for tasks that require a lot of context clues and large inputs, boosting speed by focusing on the important key tokens (like humans do). In the real world, transformers can be used for machine translation (using a machine to translate something from one language to another), sentiment analysis, analyze input data, and much more. 

Why shouldn't you use them? What are the drawbacks?

When it comes to visual processing, transformers in general are not your friend. Transformers need information to be in specific, structured, sequences of data. Sure, Visual Transformers can provide this, but they aren't extremely effective compared to other models. They are also not good when you only have a limited data set and can only make a small model just like a human (we need a bunch of "training" in order to gain context into something). Transformers always need large amounts of data to be trained with, which makes them very good for big scale projects. 

There are several other types of networks, such as Convolutional Neural Networks (CNNs are used with image processing), Recurrent Neural Networks (RNNs are used with smaller data size and smaller model needs), Kolmogorov-Arnold Networks (KANs are pretty brand new and are an alternative to Multi-Layer Perceptrons), Neocognitron (used for pattern recognition tests and inspired the CNN), and many more. 

*Note: CNNs and RNNs don't have a "single pin-point" to who created it, so I linked IBM articles that talk about each one in-depth.

Works Referenced/Extra Resources

These were the works I referenced while researching for this article. I also included some other resources that I thought would be beneficial for everybody. All other resources that were used are already linked above.

"What are transformers in artificial intelligence?" by AWS.
"But what is a GPT? Visual intro to transformers | Chapter 5, Deep Learning" by 3Blue1Brown.
"The Attention Mechanism from Scratch" by Stefania Cristina (Machine Learning Mastery).
"What Is a Transformer Model?" by Rick Merritt (NVIDIA).
"A Comprehensive Overview of Transformer-Based Models: Encoders, Decoders, and More" by Minhajul Hoque (Medium).
"Understanding Transformer Neural Network Model in Deep Learning and NLP" by Turing.
"Transformers in Machine Learning" by GeeksforGeeks.

Comments

Popular posts from this blog

What is a Multimodal LLM?

Top 3 Breakthroughs in Computer Vision in 2024

A Mathematical Explanation of Gradient Descent