Lesson 2 of 15

Activation Functions

Why Activation Functions?

Without an activation function, stacking multiple layers collapses to a single linear transformation:

W2(W1x+b1)+b2=(W2W1)x+(W2b1+b2)W_2(W_1 x + b_1) + b_2 = (W_2 W_1)x + (W_2 b_1 + b_2)

This is just another linear function. A deep network of linear layers is no more powerful than a single layer. Non-linearity is what lets neural networks approximate any function.

The Three Classics

Sigmoid — maps any input to (0,1)(0, 1), useful for probabilities:

σ(x)=11+ex\sigma(x) = \frac{1}{1 + e^{-x}}

Notation note: Here σ\sigma denotes the sigmoid activation function. In statistics and finance, σ\sigma instead represents standard deviation or volatility.

ReLU (Rectified Linear Unit) — the most widely used, fast and sparse:

ReLU(x)=max(0,x)\text{ReLU}(x) = \max(0, x)

Tanh — maps to (1,1)(-1, 1), zero-centered (often better than sigmoid for hidden layers):

tanh(x)=exexex+ex\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}

Modern Activations

GELU (Gaussian Error Linear Unit) — used in GPT, BERT, and most modern transformers:

GELU(x)=xΦ(x)0.5x(1+tanh(2π(x+0.044715x3)))\text{GELU}(x) = x \cdot \Phi(x) \approx 0.5 x \left(1 + \tanh\left(\sqrt{\frac{2}{\pi}}(x + 0.044715 x^3)\right)\right)

Unlike ReLU's hard cutoff at 0, GELU provides a smooth, probabilistic gate — small negative inputs are attenuated rather than zeroed.

SiLU (Sigmoid Linear Unit, a.k.a. Swish) — used in EfficientNet, LLaMA, and many vision models:

SiLU(x)=xσ(x)=x1+ex\text{SiLU}(x) = x \cdot \sigma(x) = \frac{x}{1 + e^{-x}}

SiLU is smooth and non-monotonic: it dips slightly below zero near x1.28x \approx -1.28, which can help optimization.

Choosing an Activation

  • Hidden layers: ReLU (and variants) dominates modern networks — fast, avoids vanishing gradients
  • Transformer hidden layers: GELU is the standard choice (GPT, BERT)
  • Output layer for binary classification: Sigmoid (output is a probability)
  • Output layer for regression: No activation (linear output)
  • Output layer for multi-class: Softmax (next courses)

Your Task

Implement sigmoid(x), relu(x), tanh_act(x), gelu(x), and silu(x).

Python runtime loading...
Loading...
Click "Run" to execute your code.