Activation Derivatives

Differentiating Activations

Backpropagation requires the derivative of each activation function. These derivatives appear in every gradient computation.

Sigmoid Derivative

The sigmoid has a beautiful self-referential derivative (recall that $\sigma$ here is the sigmoid function, not the statistical standard deviation):

$\frac{d\sigma}{dx} = \sigma(x)(1 - \sigma(x))$

Derivation: Let $s = \sigma(x) = (1 + e^{-x})^{-1}$ .

$\frac{ds}{dx} = \frac{e^{-x}}{(1+e^{-x})^2} = \frac{1}{1+e^{-x}} \cdot \frac{e^{-x}}{1+e^{-x}} = s(1-s)$

Maximum value is $\frac{1}{4}$ at $x=0$ . This "saturation" near 0 or 1 causes the vanishing gradient problem in deep networks.

ReLU Derivative

ReLU has a simple piecewise derivative:

$\frac{d\text{ReLU}}{dx} = \begin{cases} 1 & x > 0 \\ 0 & x \leq 0 \end{cases}$

This is just an indicator function. Units with $x \leq 0$ contribute zero gradient — the "dead ReLU" problem. But for active units, the gradient flows through unchanged.

Your Task

Implement:

sigmoid_grad(x) — derivative of sigmoid at $x$
relu_grad(x) — derivative of ReLU at $x$ (return 0.0 when $x \leq 0$ )

← Previous Next →

Python runtime loading...

Click "Run" to execute your code.