Overview
1. Why splitting your data correctly matters
In machine learning, it is not enough to build a powerful model; you must also evaluate it
in a way that reflects real-world performance. This requires splitting your dataset into
distinct subsets and strictly avoiding data leakage—any situation where
information from the test set influences the training process.
This document explains:
- What train, validation, and test sets are and how they differ.
- Why using the test set for validation leads to data leakage.
- How to correctly implement K-fold cross-validation.
- How these ideas apply to sequence models like CNN–LSTM for gait analysis.
Navigation
2. Table of Contents
Concepts
3. Train / Validation / Test: Roles and definitions
3.1 Train set
The train set is the portion of data the model sees during learning.
Gradients are computed on this data, and model weights are updated accordingly. In deep
learning, this typically corresponds to X_train and y_train.
3.2 Validation set
The validation set is used during training to tune hyperparameters, select
the best model, and monitor overfitting. The model does not update its weights directly on
this set, but its performance on the validation data guides:
- early stopping,
- learning rate schedules,
- architecture and hyperparameter choices.
3.3 Test set
The test set is used only once, at the very end, to estimate
how well the final, chosen model generalizes to truly unseen data. It should not influence
model design, tuning, or training decisions.
ℹ️
Rule of thumb: Once you use the test set to make a decision
(e.g., choose a model), it is no longer a test set—it has become part of the tuning
pipeline.
Data Leakage
4. What is data leakage?
Data leakage occurs when information from outside the training data
“leaks” into the training process, giving the model access to information it would not
have in a real-world scenario. As a result, performance metrics become overly optimistic
and do not reflect true generalization.
4.1 Common examples
- Using the test set for hyperparameter tuning (directly or indirectly).
- Applying normalization or SMOTE on the entire dataset before splitting.
- Including future information in time-series models (e.g., using future frames to predict past labels).
- Subject-level leakage in medical or gait data (same subject appears in both train and test).
⚠️
If your model sees the test data during training—even indirectly—you
cannot trust the reported accuracy, AUC, or F1-score. The model may simply be
memorizing patterns specific to those test samples.
Anti-Pattern
5. Example of a wrong split (leakage scenario)
Consider the following incorrect workflow:
# ❌ WRONG (data leakage)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
# Train on X_train
model.fit(X_train, y_train, validation_data=(X_test, y_test))
# Later...
model.evaluate(X_test, y_test) # "Final test performance"
In this setup:
- The test set (
X_test, y_test) is used as a validation set during training.
- The model and hyperparameters (e.g., number of epochs, early stopping) are chosen based on performance on this “test” data.
- Then, the same data is used again for final evaluation.
🚫
This is data leakage because the test set is influencing the training
process. The model is indirectly tuned to perform well on that specific data, and
reported performance will be biased.
Best Practice
6. Correct split strategies (no leakage)
6.1 Simple 3-way split: train / validation / test
A common, leakage-free strategy is to split the data into three parts:
from sklearn.model_selection import train_test_split
# First: train + temp
X_train, X_temp, y_train, y_temp = train_test_split(
X, y, test_size=0.3, stratify=y, random_state=42
)
# Second: validation + test (split the temp set)
X_val, X_test, y_val, y_test = train_test_split(
X_temp, y_temp, test_size=0.5, stratify=y_temp, random_state=42
)
This results in approximately:
- 70% training data
- 15% validation data
- 15% test data
Use during model training:
history = model.fit(
X_train, y_train,
validation_data=(X_val, y_val),
epochs=EPOCHS,
batch_size=BATCH_SIZE,
callbacks=[...]
)
# Final evaluation (only once)
test_loss, test_acc = model.evaluate(X_test, y_test)
✅
The test set (X_test) is completely untouched until the end. It is
used exactly once for final evaluation, so your metrics reflect how the model
will behave on new, unseen subjects.
6.2 Important: Apply preprocessing only on training data
When using scaling, SMOTE, or other preprocessing steps, always fit them on
the training data only:
# ✅ Correct: fit on train, transform on all
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_val_scaled = scaler.transform(X_val)
X_test_scaled = scaler.transform(X_test)
# ✅ Correct: SMOTE on train only
X_train_res, y_train_res = smote.fit_resample(X_train_scaled, y_train)
# ❌ Wrong: NEVER fit on full dataset before splitting
# scaler.fit(X); smote.fit_resample(X, y)
Advanced
7. K-fold cross-validation (Version 2)
For small or imbalanced datasets, a single train/validation split may not be
reliable enough. K-fold cross-validation improves robustness by
training and evaluating the model across multiple different splits.
7.1 Idea
- Split the dataset into
K folds (e.g., 5 or 10).
- For each fold:
- Use that fold as the validation (or test) set.
- Train on the remaining
K-1 folds.
- Average performance metrics across folds.
7.2 Example (conceptual)
from sklearn.model_selection import KFold
kf = KFold(n_splits=5, shuffle=True, random_state=42)
fold_metrics = []
for fold, (train_idx, val_idx) in enumerate(kf.split(X)):
print(f"Fold {fold + 1}")
X_train, X_val = X[train_idx], X[val_idx]
y_train, y_val = y[train_idx], y[val_idx]
# Fit scaler / SMOTE only on this fold's training data
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_val_scaled = scaler.transform(X_val)
X_train_res, y_train_res = smote.fit_resample(X_train_scaled, y_train)
model = build_model() # fresh CNN–LSTM model for each fold
history = model.fit(
X_train_res, y_train_res,
validation_data=(X_val_scaled, y_val),
epochs=EPOCHS,
batch_size=BATCH_SIZE,
verbose=0
)
val_loss, val_acc = model.evaluate(X_val_scaled, y_val, verbose=0)
fold_metrics.append(val_acc)
At the end, you can report:
- Mean validation accuracy across folds.
- Standard deviation of accuracy.
- Any additional metrics (F1-score, AUC) averaged across folds.
Application
8. Applying this to gait-based CNN–LSTM models
In gait analysis (e.g., ASD vs. non-ASD classification), each subject is
represented by a sequence of gait features, such as:
- hip and knee flexion/extension,
- hip and knee abduction/adduction,
- normalized to a fixed number of frames (e.g., 120).
This produces a 3D tensor:
X.shape = (N_subjects, T_frames, F_features)
Key points to avoid leakage in this setting:
- Split data at the subject level, not at the frame level.
- Ensure that the same subject never appears in both training and test sets.
- Apply sequence-level preprocessing (e.g., scaling, SMOTE on flattened sequences)
only within each training split or fold.
- Use validation data for early stopping and model selection, not the test set.
🧠
When using CNN–LSTM architectures, keep the temporal structure of the data
intact during training (3D input). Only flatten sequences temporarily if needed
for operations like SMOTE, then reshape back to 3D afterwards.
Checklist
9. Practical checklist: “Do I have data leakage?”
Before trusting your results, ask yourself:
- Did I ever use the test set to:
- tune hyperparameters,
- decide when to stop training, or
- select between different model architectures?
- Did I fit any preprocessing (scaler, SMOTE, PCA, etc.) on the full dataset instead of only the training part?
- In subject-based data, could the same subject appear in both train and test sets?
- Did I reuse the test set multiple times to iteratively improve my model?
❗
If the answer is “yes” to any of the questions above, your evaluation is
likely biased due to data leakage. You should redesign the split and
repeat the experiments.
A clean experiment pipeline typically looks like:
- Define task and labels (e.g., ASD vs. non-ASD).
- Split subjects into train / validation / test (or K folds).
- Fit preprocessing only on training data.
- Train models, tune on validation, use early stopping if needed.
- Freeze the final model and evaluate once on the test set.
- Report metrics with clear description of the split procedure.