Train / Test Split
Train / Test Split
To evaluate a model honestly, we must test it on data it has never seen during training. We split the dataset into a training set (used to fit the model) and a test set (used only for evaluation).
Splitting Strategy
- Create an index array
- Shuffle the indices randomly (deterministic with a seed)
- Take the first as test indices; the rest as train indices
Deterministic Shuffling with LCG
We use a Linear Congruential Generator (LCG) for reproducible shuffles:
with , , .
Fisher-Yates shuffle: iterate from down to ; generate ; swap indices and .
Stratified Ratio
For imbalanced datasets it is useful to check whether the positive class proportion is preserved:
k-Fold Cross-Validation
A single train/test split can be noisy — the model's score depends heavily on which examples land in the test set. k-fold cross-validation gives a more robust estimate by repeating the evaluation times:
- Partition the indices into roughly equal folds (no shuffling needed — the split itself is deterministic).
- For each fold , treat fold as the validation set and the remaining folds as the training set.
- Average the scores to get the final estimate.
Each fold has size , except the last fold which absorbs any remainder. Fold covers indices for , and for the last fold, where .
Cross-validation is especially important when data is limited — it ensures every example is used for both training and validation exactly once.
Your Task
Implement:
train_test_split(X, y, test_ratio, seed=42)→ (X_train, X_test, y_train, y_test)stratified_ratio(y, test_ratio)→ fraction of positives in the test setk_fold_indices(n, k)→ list of tuples, each containing (train_indices, val_indices)