Lesson 12 of 15
Weight Initialization
Why Initialization Matters
All-zero weights are catastrophic: every neuron in a layer computes the same gradient, and all weights update identically — this is the symmetry problem. We break symmetry with random initialization.
But the scale of initial weights matters too:
- Too large: activations saturate (sigmoid outputs ≈ 0 or 1), gradients vanish
- Too small: signals shrink through layers, gradients vanish from the other direction
Xavier / Glorot Initialization
For a layer with inputs and outputs, Xavier initialization draws weights uniformly from:
This maintains the variance of activations across layers for sigmoid/tanh networks.
He Initialization
For ReLU networks, He initialization uses:
ReLU kills half its inputs (the negatives), so we double the variance to compensate.
When to Use Which
- Xavier/Glorot: Best for sigmoid, tanh, and linear activations — assumes activations are symmetric around zero
- He: Best for ReLU and its variants (Leaky ReLU, GELU, SiLU) — accounts for the fact that ReLU zeros out half the inputs
Using He initialization with ReLU networks prevents the "dying neurons" problem where activations shrink to zero in deep networks.
Your Task
Implement:
xavier_init(fan_in, fan_out, seed=42)— weight matrix of shape(fan_out, fan_in)using uniform Xavier initializationhe_init(fan_in, fan_out, seed=42)— weight matrix of shape(fan_out, fan_in)using He (normal) initialization:random.gauss(0, sqrt(2/fan_in))
Python runtime loading...
Loading...
Click "Run" to execute your code.