Lesson 5 of 15

Activation Derivatives

Differentiating Activations

Backpropagation requires the derivative of each activation function. These derivatives appear in every gradient computation.

Sigmoid Derivative

The sigmoid has a beautiful self-referential derivative (recall that σ\sigma here is the sigmoid function, not the statistical standard deviation):

dσdx=σ(x)(1σ(x))\frac{d\sigma}{dx} = \sigma(x)(1 - \sigma(x))

Derivation: Let s=σ(x)=(1+ex)1s = \sigma(x) = (1 + e^{-x})^{-1}.

dsdx=ex(1+ex)2=11+exex1+ex=s(1s)\frac{ds}{dx} = \frac{e^{-x}}{(1+e^{-x})^2} = \frac{1}{1+e^{-x}} \cdot \frac{e^{-x}}{1+e^{-x}} = s(1-s)

Maximum value is 14\frac{1}{4} at x=0x=0. This "saturation" near 0 or 1 causes the vanishing gradient problem in deep networks.

ReLU Derivative

ReLU has a simple piecewise derivative:

dReLUdx={1x>00x0\frac{d\text{ReLU}}{dx} = \begin{cases} 1 & x > 0 \\ 0 & x \leq 0 \end{cases}

This is just an indicator function. Units with x0x \leq 0 contribute zero gradient — the "dead ReLU" problem. But for active units, the gradient flows through unchanged.

Your Task

Implement:

  • sigmoid_grad(x) — derivative of sigmoid at xx
  • relu_grad(x) — derivative of ReLU at xx (return 0.0 when x0x \leq 0)
Python runtime loading...
Loading...
Click "Run" to execute your code.