Lesson 5 of 17

Backpropagation

The Backward Pass

Backpropagation applies the chain rule in reverse topological order. For each node, we multiply the gradient flowing in (from the output side) by each local gradient and accumulate it into the child's gradient.

The Chain Rule

If z = f(a, b), and we already know dL/dz (gradient from the loss to z), then:

dL/da = (dz/da) * (dL/dz) = local_grad_a * z.grad
dL/db = (dz/db) * (dL/dz) = local_grad_b * z.grad

The backward() Method

  1. Build the topological order
  2. Set the root gradient to 1 (the loss w.r.t. itself is 1)
  3. Walk in reverse order: for each node, propagate its gradient to its children
def backward(self):
    topo = []
    visited = set()
    def build_topo(v):
        if v not in visited:
            visited.add(v)
            for child in v._children:
                build_topo(child)
            topo.append(v)
    build_topo(self)
    self.grad = 1
    for v in reversed(topo):
        for child, local_grad in zip(v._children, v._local_grads):
            child.grad += local_grad * v.grad

Note +=: gradients accumulate because the same Value may appear multiple times in the graph (e.g., a * a).

Verifying by Hand

For z = a * b:

  • z._local_grads = (b.data, a.data)
  • After z.backward(): a.grad = b.data * 1, b.grad = a.data * 1

For w = a + b:

  • w._local_grads = (1, 1)
  • After w.backward(): a.grad = 1, b.grad = 1

Your Task

Add backward() to the Value class.

Python runtime loading...
Loading...
Click "Run" to execute your code.