Lesson 9 of 15

PCA (2D)

Principal Component Analysis (2D)

PCA finds the directions of maximum variance in the data. It is used for dimensionality reduction, visualisation, and noise removal.

Step 1 — Center the Data

Subtract the column mean from each feature so the data has zero mean:

x~ij=xijxˉj\tilde{x}_{ij} = x_{ij} - \bar{x}_j

Step 2 — Covariance Matrix (2D)

For centered 2D data with nn points:

Σ=1n1[x~12x~1x~2x~1x~2x~22]\Sigma = \frac{1}{n-1} \begin{bmatrix} \sum \tilde{x}_1^2 & \sum \tilde{x}_1 \tilde{x}_2 \\ \sum \tilde{x}_1 \tilde{x}_2 & \sum \tilde{x}_2^2 \end{bmatrix}

Step 3 — Explained Variance Ratio

Given eigenvalues λ1λ2\lambda_1 \geq \lambda_2 \geq \ldots:

EVRi=λijλj\text{EVR}_i = \frac{\lambda_i}{\sum_j \lambda_j}

If the first component's EVR is close to 1, most variance lives along a single direction.

The Curse of Dimensionality

PCA is one of the main defences against the curse of dimensionality — the phenomenon where high-dimensional spaces behave counter-intuitively:

  • Distances converge: In dd dimensions, the ratio between the nearest and farthest neighbour approaches 1 as dd \to \infty. This makes distance-based methods (k-NN, k-means, DBSCAN) unreliable.
  • Data becomes sparse: To maintain the same density of data points, you need exponentially more samples as dimensions grow. With nn fixed samples, the data "spreads thin" and every point looks like an outlier.
  • Overfitting risk increases: More features relative to samples means more opportunity for the model to memorise noise.

PCA combats this by projecting data onto the top kk principal components, discarding low-variance directions that are likely noise. As a rule of thumb, keep enough components to capture 90-95% of the total variance.

Your Task

Implement:

  • center(X) → subtract column means from each row
  • covariance_2d(X_centered)2×22 \times 2 covariance matrix as a list of lists
  • explained_variance_ratio(eigenvalues) → list of ratios
Python runtime loading...
Loading...
Click "Run" to execute your code.