Ali's Newsletter
Posts
⚙️ Training the GPT Model from Scratch: A Step-by-Step Guide (Part 2)

⚙️ Training the GPT Model from Scratch: A Step-by-Step Guide (Part 2)

Welcome back to our series on building large language models (LLMs) from scratch 👋

Ali Ali
October 09, 2025

In Part 1, we focused on the architecture of a GPT-like model, inspired by Sebastian Raschka’s “Build a Large Language Model (From Scratch)” (Chapter 4).

Now, in Part 2 (based on Chapter 5), we dive into pretraining the model on unlabeled data 🧠📊

Pretraining involves optimizing the model to predict the next token in a sequence, enabling it to learn language patterns, grammar, and facts from vast text corpora.

We’ll cover:
🔹 Data preparation
🔹 Loss calculation & evaluation metrics
🔹 The training loop
🔹 Advanced text generation sampling
🔹 Saving/loading checkpoints
🔹 Integrating pretrained weights from OpenAI

By the end, you’ll have a functional pretrained GPT-2 small (124M parameters) model! 🚀

🧰 Prerequisites

PyTorch
tiktoken
The GPT model from Part 1

Install dependencies:

pip install torch tiktoken tensorflow tqdm

Step 1: Prepare the Dataset and Dataloaders

Load a text corpus (e.g., a short story for demo) and split it into train/validation sets.
Then, create dataloaders to batch tokenized sequences.

import torch
import tiktoken
from previous_chapters import create_dataloader_v1  # From Chapter 2

# Download sample text if needed
file_path = "the-verdict.txt"
url = "https://raw.githubusercontent.com/rasbt/LLMs-from-scratch/main/ch02/01_main-chapter-code/the-verdict.txt"
import urllib.request
if not os.path.exists(file_path):
    with urllib.request.urlopen(url) as response:
        text_data = response.read().decode('utf-8')
    with open(file_path, "w", encoding="utf-8") as file:
        file.write(text_data)
else:
    with open(file_path, "r", encoding="utf-8") as file:
        text_data = file.read()

tokenizer = tiktoken.get_encoding("gpt2")
train_ratio = 0.90
split_idx = int(train_ratio * len(text_data))
train_data = text_data[:split_idx]
val_data = text_data[split_idx:]

torch.manual_seed(123)
train_loader = create_dataloader_v1(
    train_data,
    batch_size=2,
    max_length=256,  # Context length
    stride=256,
    drop_last=True,
    shuffle=True,
    num_workers=0
)

val_loader = create_dataloader_v1(
    val_data,
    batch_size=2,
    max_length=256,
    stride=256,
    drop_last=False,
    shuffle=False,
    num_workers=0
)

This creates non-overlapping batches of input/target sequences (targets are shifted by 1).

📊 Step 2: Define Evaluation Metrics (Cross-Entropy and Perplexity)

Use cross-entropy loss for next-token prediction.
Perplexity (exp(loss)) measures model uncertainty — lower is better.

def calc_loss_loader(data_loader, model, device, num_batches=None):
    total_loss = 0.
    if len(data_loader) == 0:
        return float("nan")
    elif num_batches is None:
        num_batches = len(data_loader)
    else:
        num_batches = min(num_batches, len(data_loader))
    for i, (inputs, targets) in enumerate(data_loader):
        if i < num_batches:
            inputs, targets = inputs.to(device), targets.to(device)
            logits = model(inputs)
            loss = torch.nn.functional.cross_entropy(logits.flatten(0, 1), targets.flatten())
            total_loss += loss.item()
        else:
            break
    return total_loss / num_batches

# Example usage
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
with torch.no_grad():
    train_loss = calc_loss_loader(train_loader, model, device)
    val_loss = calc_loss_loader(val_loader, model, device)
print("Initial Training loss:", train_loss)
print("Initial Validation loss:", val_loss)

perplexity = torch.exp(torch.tensor(train_loss))
print("Perplexity:", perplexity)

📉 Initial losses are high (~10.9); perplexity ~47,000 for untrained models — as expected!

🔁 Step 3: Implement the Training Loop

Train using AdamW optimizer, tracking losses and generating samples periodically to monitor progress 📈

def train_model_simple(model, train_loader, val_loader, optimizer, device, num_epochs,
                       eval_freq, eval_iter, start_context, tokenizer):
    train_losses, val_losses, track_tokens_seen = [], [], []
    tokens_seen, global_step = 0, -1

    for epoch in range(num_epochs):
        model.train()
        for input_batch, target_batch in train_loader:
            optimizer.zero_grad()
            loss = calc_loss_batch(input_batch, target_batch, model, device)  # Single batch loss
            loss.backward()
            optimizer.step()
            tokens_seen += input_batch.numel()
            global_step += 1

            if global_step % eval_freq == 0:
                train_loss, val_loss = evaluate_model(model, train_loader, val_loader, device, eval_iter)
                train_losses.append(train_loss)
                val_losses.append(val_loss)
                track_tokens_seen.append(tokens_seen)
                print(f"Ep {epoch+1} (Step {global_step:06d}): Train loss {train_loss:.3f}, Val loss {val_loss:.3f}")

        generate_and_print_sample(model, tokenizer, device, start_context)
    return train_losses, val_losses, track_tokens_seen

# Run training
optimizer = torch.optim.AdamW(model.parameters(), lr=0.0004, weight_decay=0.1)
num_epochs = 10
train_losses, val_losses, tokens_seen = train_model_simple(
    model, train_loader, val_loader, optimizer, device,
    num_epochs=num_epochs, eval_freq=5, eval_iter=5,
    start_context="Every effort moves you", tokenizer=tokenizer
)

📉 Losses decrease over epochs, and generated text improves from gibberish to coherent language 🧩

🎲 Step 4: Enhance Text Generation with Sampling Techniques

Add temperature and top-k sampling for more diverse and creative outputs 🎨

def generate(model, idx, max_new_tokens, context_size, temperature=0.0, top_k=None, eos_id=None):
    for _ in range(max_new_tokens):
        idx_cond = idx[:, -context_size:]
        with torch.no_grad():
            logits = model(idx_cond)
        logits = logits[:, -1, :]

        if top_k is not None:
            top_logits, _ = torch.topk(logits, top_k)
            min_val = top_logits[:, -1]
            logits = torch.where(logits < min_val, torch.tensor(float("-inf")).to(logits.device), logits)

        if temperature > 0.0:
            logits = logits / temperature
            logits = logits - logits.max(dim=-1, keepdim=True).values
            probs = torch.softmax(logits, dim=-1)
            idx_next = torch.multinomial(probs, num_samples=1)
        else:
            idx_next = torch.argmax(logits, dim=-1, keepdim=True)

        if idx_next == eos_id:
            break

        idx = torch.cat((idx, idx_next), dim=1)
    return idx

# Example
token_ids = generate(
    model=model,
    idx=text_to_token_ids("Every effort moves you", tokenizer).to(device),
    max_new_tokens=25,
    context_size=256,
    top_k=50,
    temperature=1.5
)
print(token_ids_to_text(token_ids, tokenizer))

🔥 Temperature softens probabilities, and top-k limits predictions to the top candidates for controlled creativity.

💾 Step 5: Save and Load Model Checkpoints

Save and resume training anytime 🧱

# Save
torch.save({
    "model_state_dict": model.state_dict(),
    "optimizer_state_dict": optimizer.state_dict(),
}, "model_and_optimizer.pth")

# Load
checkpoint = torch.load("model_and_optimizer.pth", weights_only=True)
model.load_state_dict(checkpoint["model_state_dict"])
optimizer.load_state_dict(checkpoint["optimizer_state_dict"])
model.train()

✅ This enables interrupting and resuming training seamlessly.

🧠 Step 6: Load Pretrained OpenAI Weights (Optional)

Want a performance boost without full pretraining?
You can load GPT-2 weights directly from OpenAI ⚡

from gpt_download import download_and_load_gpt2  # Helper script

model_size = "124M"
models_dir = "gpt2"
settings, params = download_and_load_gpt2(model_size=model_size, models_dir=models_dir)

# Update config and instantiate
NEW_CONFIG = {
    "vocab_size": 50257,
    "context_length": 1024,
    "emb_dim": 768,
    "n_heads": 12,
    "n_layers": 12,
    "drop_rate": 0.0,
    "qkv_bias": True
}
gpt = GPTModel(NEW_CONFIG)

# Transfer weights (using assign function from earlier)
load_weights_into_gpt(gpt, params)
gpt.to(device)
gpt.eval()

# Generate
token_ids = generate(
    model=gpt,
    idx=text_to_token_ids("Every effort moves you", tokenizer).to(device),
    max_new_tokens=25,
    context_size=1024,
    top_k=50,
    temperature=1.5
)
print(token_ids_to_text(token_ids, tokenizer))

💡 This loads official GPT-2 small weights into your custom model — instantly giving it powerful pretrained capabilities.

🏁 Conclusion

🎯 Pretraining equips your LLM with broad language understanding.
With this setup, you’ve now trained (or loaded) a GPT model capable of generating text — the foundation for any downstream fine-tuning.

Next steps:
👉 Finetune for specific NLP tasks (covered in future parts).

📘 For the full code, visit rasbt/LLMs-from-scratch on GitHub.

💌 Stay tuned for upcoming deep dives into AI & transformers — subscribe to never miss a beat! ⚡🤖