Token Sequences

Training Data: Next-Token Prediction

A language model learns by predicting the next token in a sequence. For each position in a document, the input is the current token and the target is the next token.

Wrapping with BOS

We wrap each word with BOS on both sides:

"emma" → [BOS, e, m, m, a, BOS]

This lets the model learn:

Given BOS, predict 'e' (the first character)
Given 'a', predict BOS (the end of word)

Sliding Window

From tokens [t0, t1, t2, ..., tn], we produce pairs:

(t0, t1)
(t1, t2)
...
(tn-1, tn)

In Python:

def make_pairs(word, encode, BOS):
    tokens = [BOS] + [encode[ch] for ch in word] + [BOS]
    return [(tokens[i], tokens[i+1]) for i in range(len(tokens)-1)]

Example

For "emma" with encode = {'a':0, 'e':1, 'm':2} and BOS=3:

Input	Target	Meaning
3	1	BOS → e
1	2	e → m
2	2	m → m
2	0	m → a
0	3	a → BOS

Your Task

Implement make_pairs(word, encode, BOS) that returns a list of (input, target) integer tuples.

← Previous Next →

Python runtime loading...

Click "Run" to execute your code.