Lesson 12 of 15
One-Hot Encoding
Preparing DNA for Neural Networks
Neural networks work with numbers, not letters. To feed a DNA sequence into a model, we convert each base into a one-hot vector — a 4-element list with exactly one 1 and three 0s:
| Base | Encoding |
|---|---|
| A | [1, 0, 0, 0] |
| T | [0, 1, 0, 0] |
| G | [0, 0, 1, 0] |
| C | [0, 0, 0, 1] |
A sequence of length L becomes a matrix of shape L × 4.
ONE_HOT = {
"A": [1, 0, 0, 0],
"T": [0, 1, 0, 0],
"G": [0, 0, 1, 0],
"C": [0, 0, 0, 1],
}
def one_hot_encode(seq):
return [ONE_HOT[b] for b in seq]
def decode_one_hot(encoded):
bases = ["A", "T", "G", "C"]
return "".join(bases[row.index(1)] for row in encoded)
print(one_hot_encode("ATG")) # [[1,0,0,0],[0,1,0,0],[0,0,1,0]]
print(decode_one_hot([[1,0,0,0],[0,1,0,0],[0,0,1,0]])) # ATG
AlphaGenome takes a 1 million base-pair DNA sequence as one-hot encoded input. Its convolutional layers scan this matrix looking for short patterns — the learned analogues of the TATA box, CpG island, and splice site detectors you have been writing by hand.
Your Task
Implement one_hot_encode(seq) and decode_one_hot(encoded).
Python runtime loading...
Loading...
Click "Run" to execute your code.