Character Tokenizer

Tokenization

Neural networks process numbers, not strings. A tokenizer maps characters to integer IDs and back.

Character-Level Tokenizer

The simplest tokenizer assigns an integer to each unique character:

text = "hello"
chars = sorted(set(text))   # ['e', 'h', 'l', 'o']
encode = {ch: i for i, ch in enumerate(chars)}
decode = {i: ch for i, ch in enumerate(chars)}

The sort ensures a consistent, reproducible mapping.

The BOS Token

We need a special Beginning-Of-Sequence token to mark where sequences start and end. We place it at index len(chars) — just after all the character tokens:

BOS = len(chars)        # e.g. 4 for "hello"
vocab_size = len(chars) + 1

The same BOS token serves as both start and end marker. A training sequence for "emma" would be:

[BOS, e, m, m, a, BOS]

The model learns to predict the next token at every position, including predicting BOS after the last character.

Encoding and Decoding

tokens = [encode[ch] for ch in "hello"]  # [1, 0, 2, 2, 3]
decoded = ''.join(decode[t] for t in tokens)  # "hello"

Your Task

Implement build_tokenizer(text) that returns (BOS, vocab_size, encode, decode).

← Previous Next →

Python runtime loading...

Click "Run" to execute your code.