• Ali's Newsletter
  • Posts
  • πŸš€ Building a GPT Model from Scratch: A Step-by-Step Guide

πŸš€ Building a GPT Model from Scratch: A Step-by-Step Guide

In the rapidly evolving field of artificial intelligence, large language models (LLMs) like GPT have revolutionized how we interact with machines πŸ€–βœ¨.

While pre-trained models are readily available, understanding their inner workings by implementing one from scratch provides invaluable insights for researchers, developers, and enthusiasts alike.

This article draws from the principles outlined in Sebastian Raschka's β€œBuild a Large Language Model (From Scratch)”, focusing on Chapter 4, where we construct a GPT-like architecture using PyTorch.

We'll walk through the main steps to implement a simplified GPT model (~124M parameters) β€” akin to GPT-2’s smallest variant.

This decoder-only transformer includes:
πŸ”Ή Token embeddings
πŸ”Ή Positional encodings
πŸ”Ή Multi-head attention
πŸ”Ή Feed-forward layers
πŸ”Ή Layer normalization
πŸ”Ή Residual connections

By the end, you'll have a functional model capable of generating text πŸ’¬

🧰 Prerequisites

Basic knowledge of Python, PyTorch, and transformer concepts.
We’ll also use the tiktoken library for tokenization.

Install dependencies:

pip install torch tiktoken

🧩 Step 1: Define Model Configuration and Imports

Start by setting up the environment.
Define a configuration dictionary with hyperparameters for the model β€” this centralizes settings and makes scaling or modification easy.

import torch
import torch.nn as nn
import tiktoken

GPT_CONFIG_124M = {
    "vocab_size": 50257,    # Vocabulary size from GPT-2 BPE tokenizer
    "context_length": 1024, # Maximum sequence length
    "emb_dim": 768,         # Embedding dimension
    "n_heads": 12,          # Number of attention heads
    "n_layers": 12,         # Number of transformer layers
    "drop_rate": 0.1,       # Dropout rate for regularization
    "qkv_bias": False       # No bias in query-key-value projections
}

This config mirrors GPT-2’s small model.
The vocabulary size comes from the Byte Pair Encoding (BPE) tokenizer, and the context length determines positional embedding limits.

🧠 Step 2: Implement Token and Positional Embeddings

Embeddings convert token IDs into dense vectors.

  • Token embeddings map vocabulary items to vectors.

  • Positional embeddings add sequence order information (since transformers lack inherent position awareness).

class GPTModel(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"])
        self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"])
        self.drop_emb = nn.Dropout(cfg["drop_rate"])
        # Placeholder for later components...
    
    def forward(self, in_idx):
        batch_size, seq_len = in_idx.shape
        tok_embeds = self.tok_emb(in_idx)
        pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))
        x = tok_embeds + pos_embeds  # Combine embeddings
        x = self.drop_emb(x)         # Apply dropout
        # Continue with transformer blocks...

Here, in_idx is a tensor of token IDs.
We add positional embeddings generated from a range of indices (0 β†’ seq_len-1).

βš–οΈ Step 3: Add Layer Normalization

Layer normalization stabilizes training by normalizing activations across features for each input.
It includes learnable scale and shift parameters for flexibility.

class LayerNorm(nn.Module):
    def __init__(self, emb_dim):
        super().__init__()
        self.eps = 1e-5  # Small epsilon for numerical stability
        self.scale = nn.Parameter(torch.ones(emb_dim))
        self.shift = nn.Parameter(torch.zeros(emb_dim))

    def forward(self, x):
        mean = x.mean(dim=-1, keepdim=True)
        var = x.var(dim=-1, keepdim=True, unbiased=False)
        norm_x = (x - mean) / torch.sqrt(var + self.eps)
        return self.scale * norm_x + self.shift

This normalizes inputs to have zero mean and unit variance, then applies affine transformations.

βš™οΈ Step 4: Build the Feed-Forward Network

The feed-forward (FF) sub-layer in each transformer block processes embeddings position-wise.
It expands the dimension (typically 4Γ—), applies GELU, and then contracts back.

First, define GELU:

class GELU(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, x):
        return 0.5 * x * (1 + torch.tanh(
            torch.sqrt(torch.tensor(2.0 / torch.pi)) * 
            (x + 0.044715 * torch.pow(x, 3))
        ))

Then, define the Feed-Forward module:

class FeedForward(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Linear(cfg["emb_dim"], 4 * cfg["emb_dim"]),
            GELU(),
            nn.Linear(4 * cfg["emb_dim"], cfg["emb_dim"]),
        )

    def forward(self, x):
        return self.layers(x)

This MLP introduces non-linearity and extra capacity to the model.

πŸ•ΈοΈ Step 5: Integrate Multi-Head Attention

Multi-head attention allows the model to focus on different parts of the input simultaneously πŸ§­.

# Assuming MultiHeadAttention is defined as per previous chapters
class MultiHeadAttention(nn.Module):
    # Implementation details: query, key, value projections, causal masking, etc.
    # For brevity, refer to standard transformer attention code.
    pass

In practice, this includes Q, K, V linear projections, scaled dot-product attention, and concatenation of heads.

🧱 Step 6: Construct the Transformer Block

Each transformer block combines attention and feed-forward layers with residuals and normalization.
Residuals (skip connections) help gradients flow through deep networks.

class TransformerBlock(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.att = MultiHeadAttention(
            d_in=cfg["emb_dim"],
            d_out=cfg["emb_dim"],
            context_length=cfg["context_length"],
            num_heads=cfg["n_heads"], 
            dropout=cfg["drop_rate"],
            qkv_bias=cfg["qkv_bias"]
        )
        self.ff = FeedForward(cfg)
        self.norm1 = LayerNorm(cfg["emb_dim"])
        self.norm2 = LayerNorm(cfg["emb_dim"])
        self.drop_shortcut = nn.Dropout(cfg["drop_rate"])

    def forward(self, x):
        # Attention with residual
        shortcut = x
        x = self.norm1(x)
        x = self.att(x)
        x = self.drop_shortcut(x)
        x += shortcut

        # Feed-forward with residual
        shortcut = x
        x = self.norm2(x)
        x = self.ff(x)
        x = self.drop_shortcut(x)
        x += shortcut

        return x

🧩 Pre-normalization (norm before sub-layers) is used β€” as in modern GPT variants.

πŸ—οΈ Step 7: Assemble the Full GPT Model

Stack multiple transformer blocks, add final normalization, and an output head to produce logits over the vocabulary.

class GPTModel(nn.Module):
    def __init__(self, cfg):
        super().__init__()
        self.tok_emb = nn.Embedding(cfg["vocab_size"], cfg["emb_dim"])
        self.pos_emb = nn.Embedding(cfg["context_length"], cfg["emb_dim"])
        self.drop_emb = nn.Dropout(cfg["drop_rate"])
        
        self.trf_blocks = nn.Sequential(
            *[TransformerBlock(cfg) for _ in range(cfg["n_layers"])]
        )

        self.final_norm = LayerNorm(cfg["emb_dim"])
        self.out_head = nn.Linear(cfg["emb_dim"], cfg["vocab_size"], bias=False)

    def forward(self, in_idx):
        batch_size, seq_len = in_idx.shape
        tok_embeds = self.tok_emb(in_idx)
        pos_embeds = self.pos_emb(torch.arange(seq_len, device=in_idx.device))
        x = tok_embeds + pos_embeds
        x = self.drop_emb(x)
        x = self.trf_blocks(x)
        x = self.final_norm(x)
        logits = self.out_head(x)
        return logits

Instantiate your model:

model = GPTModel(GPT_CONFIG_124M)

✍️ Step 8: Implement Text Generation

To generate text, use autoregressive sampling:
Predict the next token β†’ append it β†’ repeat πŸ”

def generate_text_simple(model, idx, max_new_tokens, context_size):
    for _ in range(max_new_tokens):
        idx_cond = idx[:, -context_size:]  # Crop to context length
        with torch.no_grad():
            logits = model(idx_cond)
        logits = logits[:, -1, :]  # Last token's logits
        probas = torch.softmax(logits, dim=-1)
        idx_next = torch.argmax(probas, dim=-1, keepdim=True)
        idx = torch.cat((idx, idx_next), dim=1)
    return idx

Tokenize input, generate, and decode:

tokenizer = tiktoken.get_encoding("gpt2")

🧾 Conclusion

🎯 This implementation yields a GPT-like model ready for training (covered in later chapters).
With ~124M parameters, it demonstrates the elegance of transformers β€” repetitive blocks that scale beautifully.

Experiment by training on datasets like OpenWebText, and explore optimizations like weight tying.

πŸ“˜ For the full notebook, visit rasbt/LLMs-from-scratch on GitHub.

πŸ’Œ Stay tuned for more on LLMs in upcoming newsletters β€” subscribe for deep dives into AI frontiers! πŸš€