PCA (2D)

Principal Component Analysis (2D)

PCA finds the directions of maximum variance in the data. It is used for dimensionality reduction, visualisation, and noise removal.

Step 1 — Center the Data

Subtract the column mean from each feature so the data has zero mean:

$\tilde{x}_{ij} = x_{ij} - \bar{x}_j$

Step 2 — Covariance Matrix (2D)

For centered 2D data with $n$ points:

$\Sigma = \frac{1}{n-1} \begin{bmatrix} \sum \tilde{x}_1^2 & \sum \tilde{x}_1 \tilde{x}_2 \\ \sum \tilde{x}_1 \tilde{x}_2 & \sum \tilde{x}_2^2 \end{bmatrix}$

Step 3 — Explained Variance Ratio

Given eigenvalues $\lambda_1 \geq \lambda_2 \geq \ldots$ :

$\text{EVR}_i = \frac{\lambda_i}{\sum_j \lambda_j}$

If the first component's EVR is close to 1, most variance lives along a single direction.

The Curse of Dimensionality

PCA is one of the main defences against the curse of dimensionality — the phenomenon where high-dimensional spaces behave counter-intuitively:

Distances converge: In $d$ dimensions, the ratio between the nearest and farthest neighbour approaches 1 as $d \to \infty$ . This makes distance-based methods (k-NN, k-means, DBSCAN) unreliable.
Data becomes sparse: To maintain the same density of data points, you need exponentially more samples as dimensions grow. With $n$ fixed samples, the data "spreads thin" and every point looks like an outlier.
Overfitting risk increases: More features relative to samples means more opportunity for the model to memorise noise.

PCA combats this by projecting data onto the top $k$ principal components, discarding low-variance directions that are likely noise. As a rule of thumb, keep enough components to capture 90-95% of the total variance.

Your Task

Implement:

center(X) → subtract column means from each row
covariance_2d(X_centered) → $2 \times 2$ covariance matrix as a list of lists
explained_variance_ratio(eigenvalues) → list of ratios

← Previous Next →

Python runtime loading...

Click "Run" to execute your code.