Lesson 7 of 15

Cross-Entropy

Cross-Entropy

Cross-entropy H(P,Q)H(P, Q) measures the average number of bits needed to encode events from distribution PP using a code optimized for distribution QQ:

H(P,Q)=ipilog2qiH(P, Q) = -\sum_i p_i \log_2 q_i

Relationship to KL Divergence

H(P,Q)=H(P)+DKL(PQ)H(P, Q) = H(P) + D_{KL}(P \| Q)

This decomposition shows:

  • H(P)H(P) — the irreducible entropy of PP (minimum bits needed)
  • DKL(PQ)D_{KL}(P \| Q) — extra bits wasted by using the wrong model QQ

Since DKL0D_{KL} \geq 0, we have H(P,Q)H(P)H(P, Q) \geq H(P) always.

In Machine Learning

Cross-entropy is the standard loss function for classification. With true labels PP (one-hot vectors) and model predictions QQ (softmax outputs):

L=iyilog2y^i\mathcal{L} = -\sum_i y_i \log_2 \hat{y}_i

When PP is one-hot (true class probability = 1, all others = 0), this simplifies to just log2(y^true)-\log_2(\hat{y}_{\text{true}}).

import math

def cross_entropy(p, q):
    epsilon = 1e-15
    return -sum(p[i] * math.log2(q[i] + epsilon)
                for i in range(len(p)) if p[i] > 0)

p = [0.8, 0.2]
q = [0.6, 0.4]
print(round(cross_entropy(p, q), 4))  # 0.854

Your Task

Implement:

  • cross_entropy(p, q)H(P,Q)=pilog2(qi+ε)H(P,Q) = -\sum p_i \log_2(q_i + \varepsilon) for pi>0p_i > 0
  • cross_entropy_loss(y_true, y_pred) — same function, ML naming convention
Python runtime loading...
Loading...
Click "Run" to execute your code.