Lesson 12 of 15

One-Hot Encoding

Preparing DNA for Neural Networks

Neural networks work with numbers, not letters. To feed a DNA sequence into a model, we convert each base into a one-hot vector — a 4-element list with exactly one 1 and three 0s:

BaseEncoding
A[1, 0, 0, 0]
T[0, 1, 0, 0]
G[0, 0, 1, 0]
C[0, 0, 0, 1]

A sequence of length L becomes a matrix of shape L × 4.

ONE_HOT = {
    "A": [1, 0, 0, 0],
    "T": [0, 1, 0, 0],
    "G": [0, 0, 1, 0],
    "C": [0, 0, 0, 1],
}

def one_hot_encode(seq):
    return [ONE_HOT[b] for b in seq]

def decode_one_hot(encoded):
    bases = ["A", "T", "G", "C"]
    return "".join(bases[row.index(1)] for row in encoded)

print(one_hot_encode("ATG"))  # [[1,0,0,0],[0,1,0,0],[0,0,1,0]]
print(decode_one_hot([[1,0,0,0],[0,1,0,0],[0,0,1,0]]))  # ATG

AlphaGenome takes a 1 million base-pair DNA sequence as one-hot encoded input. Its convolutional layers scan this matrix looking for short patterns — the learned analogues of the TATA box, CpG island, and splice site detectors you have been writing by hand.

Your Task

Implement one_hot_encode(seq) and decode_one_hot(encoded).

Python runtime loading...
Loading...
Click "Run" to execute your code.