Distance Metrics

Unsupervised learning algorithms like k-NN and k-means rely on measuring distance between data points. Different metrics capture different notions of similarity.

Euclidean Distance

The straight-line distance between two points $\mathbf{a}$ and $\mathbf{b}$ :

$d_{\text{Euclid}}(\mathbf{a}, \mathbf{b}) = \sqrt{\sum_{i=1}^{d} (a_i - b_i)^2}$

Manhattan Distance

The sum of absolute differences (also called $L_1$ distance or "city block" distance):

$d_{\text{Manhattan}}(\mathbf{a}, \mathbf{b}) = \sum_{i=1}^{d} |a_i - b_i|$

Cosine Similarity

Measures the angle between two vectors, ignoring magnitude:

$\text{cos\_sim}(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\| \|\mathbf{b}\|}$

$1$ means identical direction (angle = 0°)
$0$ means perpendicular (angle = 90°)
$-1$ means opposite direction (angle = 180°)

Cosine similarity is widely used in text and document similarity because it is scale-invariant.

A Warning: The Curse of Dimensionality

All distance metrics suffer in very high dimensions. As the number of features $d$ grows, the difference between the nearest and farthest points shrinks, making distances less meaningful. This is why dimensionality reduction (e.g., PCA, covered later in this course) is often applied before distance-based algorithms like k-NN or k-means.

Your Task

Implement:

euclidean(a, b) — Euclidean ( $L_2$ ) distance
manhattan(a, b) — Manhattan ( $L_1$ ) distance
cosine_similarity(a, b) — cosine similarity

← Previous Next →

Python runtime loading...

Click "Run" to execute your code.