Introduction

Why Build a GPT From Scratch?

Large language models like GPT-4 feel like magic — until you build one yourself. Then they feel like math.

This course takes you through the complete pipeline for building a miniature GPT in pure Python, with no ML frameworks. You will implement every component from scratch: the automatic differentiation engine, the tokenizer, the linear layer, softmax, RMS normalization, cross-entropy loss, scaled dot-product attention, multi-head attention, the training loop, and the Adam optimizer.

The result is a small model that learns to generate names character by character. It is the same architecture as GPT-2, just smaller.

What is Autograd?

Most ML code hides the calculus. You call loss.backward() and gradients appear. This course shows you exactly how that works.

You will build a Value class that tracks every arithmetic operation in a computation graph. When you call backward(), it walks the graph in reverse topological order and applies the chain rule at each node — the same algorithm used by PyTorch and JAX under the hood.

How This Course Works

Each lesson introduces one concept, provides a short explanation, and asks you to implement a single function. The starter code includes all the infrastructure you need — you only implement the function described.

By the final lesson, you will have implemented every component needed to train a transformer language model.

What You Will Learn

This course contains 15 lessons organized into 5 chapters:

  1. Autograd Engine -- Value class, arithmetic operations, backpropagation.
  2. Data & Tokens -- Tokenizer, character-level vocabulary, training pairs.
  3. Neural Net Primitives -- Linear layer, softmax, RMS normalization, cross-entropy loss.
  4. The Transformer -- Single-head attention, multi-head attention.
  5. Training -- SGD training loop, Adam optimizer.

Let's start.

Next →