Activation Functions
Why Activation Functions?
Without an activation function, stacking multiple layers collapses to a single linear transformation:
This is just another linear function. A deep network of linear layers is no more powerful than a single layer. Non-linearity is what lets neural networks approximate any function.
The Three Classics
Sigmoid — maps any input to , useful for probabilities:
Notation note: Here denotes the sigmoid activation function. In statistics and finance, instead represents standard deviation or volatility.
ReLU (Rectified Linear Unit) — the most widely used, fast and sparse:
Tanh — maps to , zero-centered (often better than sigmoid for hidden layers):
Modern Activations
GELU (Gaussian Error Linear Unit) — used in GPT, BERT, and most modern transformers:
Unlike ReLU's hard cutoff at 0, GELU provides a smooth, probabilistic gate — small negative inputs are attenuated rather than zeroed.
SiLU (Sigmoid Linear Unit, a.k.a. Swish) — used in EfficientNet, LLaMA, and many vision models:
SiLU is smooth and non-monotonic: it dips slightly below zero near , which can help optimization.
Choosing an Activation
- Hidden layers: ReLU (and variants) dominates modern networks — fast, avoids vanishing gradients
- Transformer hidden layers: GELU is the standard choice (GPT, BERT)
- Output layer for binary classification: Sigmoid (output is a probability)
- Output layer for regression: No activation (linear output)
- Output layer for multi-class: Softmax (next courses)
Your Task
Implement sigmoid(x), relu(x), tanh_act(x), gelu(x), and silu(x).