Lesson 8 of 17

Token Sequences

Training Data: Next-Token Prediction

A language model learns by predicting the next token in a sequence. For each position in a document, the input is the current token and the target is the next token.

Wrapping with BOS

We wrap each word with BOS on both sides:

"emma" → [BOS, e, m, m, a, BOS]

This lets the model learn:

  • Given BOS, predict 'e' (the first character)
  • Given 'a', predict BOS (the end of word)

Sliding Window

From tokens [t0, t1, t2, ..., tn], we produce pairs:

  • (t0, t1)
  • (t1, t2)
  • ...
  • (tn-1, tn)

In Python:

def make_pairs(word, encode, BOS):
    tokens = [BOS] + [encode[ch] for ch in word] + [BOS]
    return [(tokens[i], tokens[i+1]) for i in range(len(tokens)-1)]

Example

For "emma" with encode = {'a':0, 'e':1, 'm':2} and BOS=3:

InputTargetMeaning
31BOS → e
12e → m
22m → m
20m → a
03a → BOS

Your Task

Implement make_pairs(word, encode, BOS) that returns a list of (input, target) integer tuples.

Python runtime loading...
Loading...
Click "Run" to execute your code.