Lesson 13 of 15

K-mer Features

Sequence Fingerprints

A k-mer is a DNA substring of length k. The set of all k-mers in a sequence — and their frequencies — is a powerful feature vector for machine learning.

For k=2, there are 4² = 16 possible dinucleotides. For k=6, there are 4096 possible hexamers. Many transcription factors recognize 6–10 base motifs.

def get_kmers(seq, k):
    return [seq[i:i+k] for i in range(len(seq) - k + 1)]

def kmer_frequency(seq, k):
    kmers = get_kmers(seq, k)
    freq = {}
    for kmer in kmers:
        freq[kmer] = freq.get(kmer, 0) + 1
    return freq

seq = "ATCGATCG"
print(get_kmers(seq, 3))
# ['ATC', 'TCG', 'CGA', 'GAT', 'ATC', 'TCG']

K-mer frequencies were one of the earliest ML features for genomics. AlphaGenome's transformer architecture can be thought of as a vastly more powerful version: it learns to detect not just fixed k-mers but arbitrary long-range combinations of sequence patterns.

Your Task

Implement get_kmers(seq, k) and kmer_frequency(seq, k).

Python runtime loading...
Loading...
Click "Run" to execute your code.