Lesson 13 of 15
K-mer Features
Sequence Fingerprints
A k-mer is a DNA substring of length k. The set of all k-mers in a sequence — and their frequencies — is a powerful feature vector for machine learning.
For k=2, there are 4² = 16 possible dinucleotides. For k=6, there are 4096 possible hexamers. Many transcription factors recognize 6–10 base motifs.
def get_kmers(seq, k):
return [seq[i:i+k] for i in range(len(seq) - k + 1)]
def kmer_frequency(seq, k):
kmers = get_kmers(seq, k)
freq = {}
for kmer in kmers:
freq[kmer] = freq.get(kmer, 0) + 1
return freq
seq = "ATCGATCG"
print(get_kmers(seq, 3))
# ['ATC', 'TCG', 'CGA', 'GAT', 'ATC', 'TCG']
K-mer frequencies were one of the earliest ML features for genomics. AlphaGenome's transformer architecture can be thought of as a vastly more powerful version: it learns to detect not just fixed k-mers but arbitrary long-range combinations of sequence patterns.
Your Task
Implement get_kmers(seq, k) and kmer_frequency(seq, k).
Python runtime loading...
Loading...
Click "Run" to execute your code.