Introduction
Why Build a GPT From Scratch?
Large language models like GPT-4 feel like magic — until you build one yourself. Then they feel like math.
This course takes you through the complete pipeline for building a miniature GPT in pure Python, with no ML frameworks. You will implement every component from scratch: the automatic differentiation engine, the tokenizer, the linear layer, softmax, RMS normalization, cross-entropy loss, scaled dot-product attention, multi-head attention, the training loop, and the Adam optimizer.
The result is a small model that learns to generate names character by character. It is the same architecture as GPT-2, just smaller.
What is Autograd?
Most ML code hides the calculus. You call loss.backward() and gradients appear. This course shows you exactly how that works.
You will build a Value class that tracks every arithmetic operation in a computation graph. When you call backward(), it walks the graph in reverse topological order and applies the chain rule at each node — the same algorithm used by PyTorch and JAX under the hood.
How This Course Works
Each lesson introduces one concept, provides a short explanation, and asks you to implement a single function. The starter code includes all the infrastructure you need — you only implement the function described.
By the final lesson, you will have implemented every component needed to train a transformer language model.
What You Will Learn
This course contains 15 lessons organized into 5 chapters:
- Autograd Engine -- Value class, arithmetic operations, backpropagation.
- Data & Tokens -- Tokenizer, character-level vocabulary, training pairs.
- Neural Net Primitives -- Linear layer, softmax, RMS normalization, cross-entropy loss.
- The Transformer -- Single-head attention, multi-head attention.
- Training -- SGD training loop, Adam optimizer.
Let's start.