Lesson 6 of 17
Character Tokenizer
Tokenization
Neural networks process numbers, not strings. A tokenizer maps characters to integer IDs and back.
Character-Level Tokenizer
The simplest tokenizer assigns an integer to each unique character:
text = "hello"
chars = sorted(set(text)) # ['e', 'h', 'l', 'o']
encode = {ch: i for i, ch in enumerate(chars)}
decode = {i: ch for i, ch in enumerate(chars)}
The sort ensures a consistent, reproducible mapping.
The BOS Token
We need a special Beginning-Of-Sequence token to mark where sequences start and end. We place it at index len(chars) — just after all the character tokens:
BOS = len(chars) # e.g. 4 for "hello"
vocab_size = len(chars) + 1
The same BOS token serves as both start and end marker. A training sequence for "emma" would be:
[BOS, e, m, m, a, BOS]
The model learns to predict the next token at every position, including predicting BOS after the last character.
Encoding and Decoding
tokens = [encode[ch] for ch in "hello"] # [1, 0, 2, 2, 3]
decoded = ''.join(decode[t] for t in tokens) # "hello"
Your Task
Implement build_tokenizer(text) that returns (BOS, vocab_size, encode, decode).
Python runtime loading...
Loading...
Click "Run" to execute your code.