Weight Initialization

Why Initialization Matters

All-zero weights are catastrophic: every neuron in a layer computes the same gradient, and all weights update identically — this is the symmetry problem. We break symmetry with random initialization.

But the scale of initial weights matters too:

Too large: activations saturate (sigmoid outputs ≈ 0 or 1), gradients vanish
Too small: signals shrink through layers, gradients vanish from the other direction

Xavier / Glorot Initialization

For a layer with $n_{in}$ inputs and $n_{out}$ outputs, Xavier initialization draws weights uniformly from:

$w \sim \text{Uniform}\left(-\sqrt{\frac{6}{n_{in} + n_{out}}}, \;\sqrt{\frac{6}{n_{in} + n_{out}}}\right)$

This maintains the variance of activations across layers for sigmoid/tanh networks.

He Initialization

For ReLU networks, He initialization uses:

$w \sim \mathcal{N}\left(0, \sqrt{\frac{2}{n_{in}}}\right)$

ReLU kills half its inputs (the negatives), so we double the variance to compensate.

When to Use Which

Xavier/Glorot: Best for sigmoid, tanh, and linear activations — assumes activations are symmetric around zero
He: Best for ReLU and its variants (Leaky ReLU, GELU, SiLU) — accounts for the fact that ReLU zeros out half the inputs

Using He initialization with ReLU networks prevents the "dying neurons" problem where activations shrink to zero in deep networks.

Your Task

Implement:

xavier_init(fan_in, fan_out, seed=42) — weight matrix of shape (fan_out, fan_in) using uniform Xavier initialization
he_init(fan_in, fan_out, seed=42) — weight matrix of shape (fan_out, fan_in) using He (normal) initialization: random.gauss(0, sqrt(2/fan_in))

← Previous Next →

Python runtime loading...

Click "Run" to execute your code.