Lesson 14 of 15

Train / Test Split

Train / Test Split

To evaluate a model honestly, we must test it on data it has never seen during training. We split the dataset into a training set (used to fit the model) and a test set (used only for evaluation).

Splitting Strategy

  1. Create an index array [0,1,,n1][0, 1, \ldots, n-1]
  2. Shuffle the indices randomly (deterministic with a seed)
  3. Take the first ntest_ratio\lfloor n \cdot \text{test\_ratio} \rfloor as test indices; the rest as train indices

Deterministic Shuffling with LCG

We use a Linear Congruential Generator (LCG) for reproducible shuffles:

si+1=(asi+c)modms_{i+1} = (a \cdot s_i + c) \bmod m

with a=1664525a = 1664525, c=1013904223c = 1013904223, m=232m = 2^{32}.

Fisher-Yates shuffle: iterate ii from n1n-1 down to 11; generate j=smod(i+1)j = s \bmod (i+1); swap indices ii and jj.

Stratified Ratio

For imbalanced datasets it is useful to check whether the positive class proportion is preserved:

stratified_ratio=itestyitest\text{stratified\_ratio} = \frac{\sum_{i \in \text{test}} y_i}{|\text{test}|}

k-Fold Cross-Validation

A single train/test split can be noisy — the model's score depends heavily on which examples land in the test set. k-fold cross-validation gives a more robust estimate by repeating the evaluation kk times:

  1. Partition the nn indices into kk roughly equal folds (no shuffling needed — the split itself is deterministic).
  2. For each fold i[0,k)i \in [0, k), treat fold ii as the validation set and the remaining k1k-1 folds as the training set.
  3. Average the kk scores to get the final estimate.

Each fold has size n/k\lfloor n / k \rfloor, except the last fold which absorbs any remainder. Fold ii covers indices [if,  (i+1)f)[i \cdot f,\; (i+1) \cdot f) for i<k1i < k-1, and [(k1)f,  n)[(k-1) \cdot f,\; n) for the last fold, where f=n/kf = \lfloor n / k \rfloor.

Cross-validation is especially important when data is limited — it ensures every example is used for both training and validation exactly once.

Your Task

Implement:

  • train_test_split(X, y, test_ratio, seed=42) → (X_train, X_test, y_train, y_test)
  • stratified_ratio(y, test_ratio) → fraction of positives in the test set
  • k_fold_indices(n, k) → list of kk tuples, each containing (train_indices, val_indices)
Python runtime loading...
Loading...
Click "Run" to execute your code.