Train / Test Split

To evaluate a model honestly, we must test it on data it has never seen during training. We split the dataset into a training set (used to fit the model) and a test set (used only for evaluation).

Splitting Strategy

Create an index array $[0, 1, \ldots, n-1]$
Shuffle the indices randomly (deterministic with a seed)
Take the first $\lfloor n \cdot \text{test\_ratio} \rfloor$ as test indices; the rest as train indices

Deterministic Shuffling with LCG

We use a Linear Congruential Generator (LCG) for reproducible shuffles:

$s_{i+1} = (a \cdot s_i + c) \bmod m$

with $a = 1664525$ , $c = 1013904223$ , $m = 2^{32}$ .

Fisher-Yates shuffle: iterate $i$ from $n-1$ down to $1$ ; generate $j = s \bmod (i+1)$ ; swap indices $i$ and $j$ .

Stratified Ratio

For imbalanced datasets it is useful to check whether the positive class proportion is preserved:

$\text{stratified\_ratio} = \frac{\sum_{i \in \text{test}} y_i}{|\text{test}|}$

k-Fold Cross-Validation

A single train/test split can be noisy — the model's score depends heavily on which examples land in the test set. k-fold cross-validation gives a more robust estimate by repeating the evaluation $k$ times:

Partition the $n$ indices into $k$ roughly equal folds (no shuffling needed — the split itself is deterministic).
For each fold $i \in [0, k)$ , treat fold $i$ as the validation set and the remaining $k-1$ folds as the training set.
Average the $k$ scores to get the final estimate.

Each fold has size $\lfloor n / k \rfloor$ , except the last fold which absorbs any remainder. Fold $i$ covers indices $[i \cdot f,\; (i+1) \cdot f)$ for $i < k-1$ , and $[(k-1) \cdot f,\; n)$ for the last fold, where $f = \lfloor n / k \rfloor$ .

Cross-validation is especially important when data is limited — it ensures every example is used for both training and validation exactly once.

Your Task

Implement:

train_test_split(X, y, test_ratio, seed=42) → (X_train, X_test, y_train, y_test)
stratified_ratio(y, test_ratio) → fraction of positives in the test set
k_fold_indices(n, k) → list of $k$ tuples, each containing (train_indices, val_indices)

← Previous Next →

Python runtime loading...

Click "Run" to execute your code.