- Ali's Newsletter
- Posts
- π Building a GPT Model from Scratch: A Step-by-Step Guide
π Building a GPT Model from Scratch: A Step-by-Step Guide
In the rapidly evolving field of artificial intelligence, large language models (LLMs) like GPT have revolutionized how we interact with machines π€β¨.
While pre-trained models are readily available, understanding their inner workings by implementing one from scratch provides invaluable insights for researchers, developers, and enthusiasts alike.
This article draws from the principles outlined in Sebastian Raschka's βBuild a Large Language Model (From Scratch)β, focusing on Chapter 4, where we construct a GPT-like architecture using PyTorch.
We'll walk through the main steps to implement a simplified GPT model (~124M parameters) β akin to GPT-2βs smallest variant.
This decoder-only transformer includes:
πΉ Token embeddings
πΉ Positional encodings
πΉ Multi-head attention
πΉ Feed-forward layers
πΉ Layer normalization
πΉ Residual connections
By the end, you'll have a functional model capable of generating text π¬
π§° Prerequisites
Basic knowledge of Python, PyTorch, and transformer concepts.
Weβll also use the tiktoken library for tokenization.
Install dependencies:
pip install torch tiktokenπ§© Step 1: Define Model Configuration and Imports
Start by setting up the environment.
Define a configuration dictionary with hyperparameters for the model β this centralizes settings and makes scaling or modification easy.
import torch
import torch.nn as nn
import tiktoken
GPT_CONFIG_124M = {
"vocab_size": 50257, # Vocabulary size from GPT-2 BPE tokenizer
"context_length": 1024, # Maximum sequence length
"emb_dim": 768, # Embedding dimension
"n_heads": 12, # Number of attention heads
"n_layers": 12, # Number of transformer layers
"drop_rate": 0.1, # Dropout rate for regularization
"qkv_bias": False # No bias in query-key-value projections
}
This config mirrors GPT-2βs small model.
The vocabulary size comes from the Byte Pair Encoding (BPE) tokenizer, and the context length determines positional embedding limits.
π§ Step 2: Implement Token and Positional Embeddings
Embeddings convert token IDs into dense vectors.
Token embeddings map vocabulary items to vectors.
Positional embeddings add sequence order information (since transformers lack inherent position awareness).
class GPTModel(nn.Module):
def __init__(self, cfg):
super().__init__()
self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"])
self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"])
self.drop_emb = nn.Dropout(cfg["drop_rate"])
# Placeholder for later components...
def forward(self, in_idx):
batch_size, seq_len = in_idx.shape
tok_embeds = self.tok_emb(in_idx)
pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))
x = tok_embeds + pos_embeds # Combine embeddings
x = self.drop_emb(x) # Apply dropout
# Continue with transformer blocks...
Here, in_idx is a tensor of token IDs.
We add positional embeddings generated from a range of indices (0 β seq_len-1).
βοΈ Step 3: Add Layer Normalization
Layer normalization stabilizes training by normalizing activations across features for each input.
It includes learnable scale and shift parameters for flexibility.
class LayerNorm(nn.Module):
def __init__(self, emb_dim):
super().__init__()
self.eps = 1e-5 # Small epsilon for numerical stability
self.scale = nn.Parameter(torch.ones(emb_dim))
self.shift = nn.Parameter(torch.zeros(emb_dim))
def forward(self, x):
mean = x.mean(dim=-1, keepdim=True)
var = x.var(dim=-1, keepdim=True, unbiased=False)
norm_x = (x - mean) / torch.sqrt(var + self.eps)
return self.scale * norm_x + self.shift
This normalizes inputs to have zero mean and unit variance, then applies affine transformations.
βοΈ Step 4: Build the Feed-Forward Network
The feed-forward (FF) sub-layer in each transformer block processes embeddings position-wise.
It expands the dimension (typically 4Γ), applies GELU, and then contracts back.
First, define GELU:
class GELU(nn.Module):
def __init__(self):
super().__init__()
def forward(self, x):
return 0.5 * x * (1 + torch.tanh(
torch.sqrt(torch.tensor(2.0 / torch.pi)) *
(x + 0.044715 * torch.pow(x, 3))
))
Then, define the Feed-Forward module:
class FeedForward(nn.Module):
def __init__(self, cfg):
super().__init__()
self.layers = nn.Sequential(
nn.Linear(cfg["emb_dim"], 4 * cfg["emb_dim"]),
GELU(),
nn.Linear(4 * cfg["emb_dim"], cfg["emb_dim"]),
)
def forward(self, x):
return self.layers(x)
This MLP introduces non-linearity and extra capacity to the model.
πΈοΈ Step 5: Integrate Multi-Head Attention
Multi-head attention allows the model to focus on different parts of the input simultaneously π§.
# Assuming MultiHeadAttention is defined as per previous chapters
class MultiHeadAttention(nn.Module):
# Implementation details: query, key, value projections, causal masking, etc.
# For brevity, refer to standard transformer attention code.
pass
In practice, this includes Q, K, V linear projections, scaled dot-product attention, and concatenation of heads.
π§± Step 6: Construct the Transformer Block
Each transformer block combines attention and feed-forward layers with residuals and normalization.
Residuals (skip connections) help gradients flow through deep networks.
class TransformerBlock(nn.Module):
def __init__(self, cfg):
super().__init__()
self.att = MultiHeadAttention(
d_in=cfg["emb_dim"],
d_out=cfg["emb_dim"],
context_length=cfg["context_length"],
num_heads=cfg["n_heads"],
dropout=cfg["drop_rate"],
qkv_bias=cfg["qkv_bias"]
)
self.ff = FeedForward(cfg)
self.norm1 = LayerNorm(cfg["emb_dim"])
self.norm2 = LayerNorm(cfg["emb_dim"])
self.drop_shortcut = nn.Dropout(cfg["drop_rate"])
def forward(self, x):
# Attention with residual
shortcut = x
x = self.norm1(x)
x = self.att(x)
x = self.drop_shortcut(x)
x += shortcut
# Feed-forward with residual
shortcut = x
x = self.norm2(x)
x = self.ff(x)
x = self.drop_shortcut(x)
x += shortcut
return x
π§© Pre-normalization (norm before sub-layers) is used β as in modern GPT variants.
ποΈ Step 7: Assemble the Full GPT Model
Stack multiple transformer blocks, add final normalization, and an output head to produce logits over the vocabulary.
class GPTModel(nn.Module):
def __init__(self, cfg):
super().__init__()
self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"])
self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"])
self.drop_emb = nn.Dropout(cfg["drop_rate"])
self.trf_blocks = nn.Sequential(
*[TransformerBlock(cfg) for _ in range(cfg["n_layers"])]
)
self.final_norm = LayerNorm(cfg["emb_dim"])
self.out_head = nn.Linear(cfg["emb_dim"], cfg["vocab_size"], bias=False)
def forward(self, in_idx):
batch_size, seq_len = in_idx.shape
tok_embeds = self.tok_emb(in_idx)
pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))
x = tok_embeds + pos_embeds
x = self.drop_emb(x)
x = self.trf_blocks(x)
x = self.final_norm(x)
logits = self.out_head(x)
return logits
Instantiate your model:
model = GPTModel(GPT_CONFIG_124M)
βοΈ Step 8: Implement Text Generation
To generate text, use autoregressive sampling:
Predict the next token β append it β repeat π
def generate_text_simple(model, idx, max_new_tokens, context_size):
for _ in range(max_new_tokens):
idx_cond = idx[:, -context_size:] # Crop to context length
with torch.no_grad():
logits = model(idx_cond)
logits = logits[:, -1, :] # Last token's logits
probas = torch.softmax(logits, dim=-1)
idx_next = torch.argmax(probas, dim=-1, keepdim=True)
idx = torch.cat((idx, idx_next), dim=1)
return idx
Tokenize input, generate, and decode:
tokenizer = tiktoken.get_encoding("gpt2")
π§Ύ Conclusion
π― This implementation yields a GPT-like model ready for training (covered in later chapters).
With ~124M parameters, it demonstrates the elegance of transformers β repetitive blocks that scale beautifully.
Experiment by training on datasets like OpenWebText, and explore optimizations like weight tying.
π For the full notebook, visit rasbt/LLMs-from-scratch on GitHub.
π Stay tuned for more on LLMs in upcoming newsletters β subscribe for deep dives into AI frontiers! π