Lesson 6 of 15

Distance Metrics

Distance Metrics

Unsupervised learning algorithms like k-NN and k-means rely on measuring distance between data points. Different metrics capture different notions of similarity.

Euclidean Distance

The straight-line distance between two points a\mathbf{a} and b\mathbf{b}:

dEuclid(a,b)=i=1d(aibi)2d_{\text{Euclid}}(\mathbf{a}, \mathbf{b}) = \sqrt{\sum_{i=1}^{d} (a_i - b_i)^2}

Manhattan Distance

The sum of absolute differences (also called L1L_1 distance or "city block" distance):

dManhattan(a,b)=i=1daibid_{\text{Manhattan}}(\mathbf{a}, \mathbf{b}) = \sum_{i=1}^{d} |a_i - b_i|

Cosine Similarity

Measures the angle between two vectors, ignoring magnitude:

cos_sim(a,b)=abab\text{cos\_sim}(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\| \|\mathbf{b}\|}

  • 11 means identical direction (angle = 0°)
  • 00 means perpendicular (angle = 90°)
  • 1-1 means opposite direction (angle = 180°)

Cosine similarity is widely used in text and document similarity because it is scale-invariant.

A Warning: The Curse of Dimensionality

All distance metrics suffer in very high dimensions. As the number of features dd grows, the difference between the nearest and farthest points shrinks, making distances less meaningful. This is why dimensionality reduction (e.g., PCA, covered later in this course) is often applied before distance-based algorithms like k-NN or k-means.

Your Task

Implement:

  • euclidean(a, b) — Euclidean (L2L_2) distance
  • manhattan(a, b) — Manhattan (L1L_1) distance
  • cosine_similarity(a, b) — cosine similarity
Python runtime loading...
Loading...
Click "Run" to execute your code.