Lesson 12 of 15

Weight Initialization

Why Initialization Matters

All-zero weights are catastrophic: every neuron in a layer computes the same gradient, and all weights update identically — this is the symmetry problem. We break symmetry with random initialization.

But the scale of initial weights matters too:

  • Too large: activations saturate (sigmoid outputs ≈ 0 or 1), gradients vanish
  • Too small: signals shrink through layers, gradients vanish from the other direction

Xavier / Glorot Initialization

For a layer with ninn_{in} inputs and noutn_{out} outputs, Xavier initialization draws weights uniformly from:

wUniform(6nin+nout,  6nin+nout)w \sim \text{Uniform}\left(-\sqrt{\frac{6}{n_{in} + n_{out}}}, \;\sqrt{\frac{6}{n_{in} + n_{out}}}\right)

This maintains the variance of activations across layers for sigmoid/tanh networks.

He Initialization

For ReLU networks, He initialization uses:

wN(0,2nin)w \sim \mathcal{N}\left(0, \sqrt{\frac{2}{n_{in}}}\right)

ReLU kills half its inputs (the negatives), so we double the variance to compensate.

When to Use Which

  • Xavier/Glorot: Best for sigmoid, tanh, and linear activations — assumes activations are symmetric around zero
  • He: Best for ReLU and its variants (Leaky ReLU, GELU, SiLU) — accounts for the fact that ReLU zeros out half the inputs

Using He initialization with ReLU networks prevents the "dying neurons" problem where activations shrink to zero in deep networks.

Your Task

Implement:

  • xavier_init(fan_in, fan_out, seed=42) — weight matrix of shape (fan_out, fan_in) using uniform Xavier initialization
  • he_init(fan_in, fan_out, seed=42) — weight matrix of shape (fan_out, fan_in) using He (normal) initialization: random.gauss(0, sqrt(2/fan_in))
Python runtime loading...
Loading...
Click "Run" to execute your code.