Train / Validation / Test Splits & Data Leakage

Overview

1. Why splitting your data correctly matters

In machine learning, it is not enough to build a powerful model; you must also evaluate it in a way that reflects real-world performance. This requires splitting your dataset into distinct subsets and strictly avoiding data leakage—any situation where information from the test set influences the training process.

This document explains:

What train, validation, and test sets are and how they differ.
Why using the test set for validation leads to data leakage.
How to correctly implement K-fold cross-validation.
How these ideas apply to sequence models like CNN–LSTM for gait analysis.

Navigation

2. Table of Contents

3. Train / Validation / Test: Roles and definitions
4. What is data leakage?
5. Example of a wrong split (leakage)
6. Correct split strategies
7. K-fold cross-validation (Version 2)
8. Applying this to gait-based CNN–LSTM models
9. Practical checklist

Concepts

3. Train / Validation / Test: Roles and definitions

3.1 Train set

The train set is the portion of data the model sees during learning. Gradients are computed on this data, and model weights are updated accordingly. In deep learning, this typically corresponds to X_train and y_train.

3.2 Validation set

The validation set is used during training to tune hyperparameters, select the best model, and monitor overfitting. The model does not update its weights directly on this set, but its performance on the validation data guides:

early stopping,
learning rate schedules,
architecture and hyperparameter choices.

3.3 Test set

The test set is used only once, at the very end, to estimate how well the final, chosen model generalizes to truly unseen data. It should not influence model design, tuning, or training decisions.

ℹ️

Rule of thumb: Once you use the test set to make a decision (e.g., choose a model), it is no longer a test set—it has become part of the tuning pipeline.

Data Leakage

4. What is data leakage?

Data leakage occurs when information from outside the training data “leaks” into the training process, giving the model access to information it would not have in a real-world scenario. As a result, performance metrics become overly optimistic and do not reflect true generalization.

4.1 Common examples

Using the test set for hyperparameter tuning (directly or indirectly).
Applying normalization or SMOTE on the entire dataset before splitting.
Including future information in time-series models (e.g., using future frames to predict past labels).
Subject-level leakage in medical or gait data (same subject appears in both train and test).

⚠️

If your model sees the test data during training—even indirectly—you cannot trust the reported accuracy, AUC, or F1-score. The model may simply be memorizing patterns specific to those test samples.

Anti-Pattern

5. Example of a wrong split (leakage scenario)

Consider the following incorrect workflow:

# ❌ WRONG (data leakage)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

# Train on X_train
model.fit(X_train, y_train, validation_data=(X_test, y_test))

# Later...
model.evaluate(X_test, y_test)  # "Final test performance"

In this setup:

The test set (X_test, y_test) is used as a validation set during training.
The model and hyperparameters (e.g., number of epochs, early stopping) are chosen based on performance on this “test” data.
Then, the same data is used again for final evaluation.

🚫

This is data leakage because the test set is influencing the training process. The model is indirectly tuned to perform well on that specific data, and reported performance will be biased.

Best Practice

6. Correct split strategies (no leakage)

6.1 Simple 3-way split: train / validation / test

A common, leakage-free strategy is to split the data into three parts:

from sklearn.model_selection import train_test_split

# First: train + temp
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, test_size=0.3, stratify=y, random_state=42
)

# Second: validation + test (split the temp set)
X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, test_size=0.5, stratify=y_temp, random_state=42
)

This results in approximately:

70% training data
15% validation data
15% test data

Use during model training:

history = model.fit(
    X_train, y_train,
    validation_data=(X_val, y_val),
    epochs=EPOCHS,
    batch_size=BATCH_SIZE,
    callbacks=[...]
)

# Final evaluation (only once)
test_loss, test_acc = model.evaluate(X_test, y_test)

✅

The test set (X_test) is completely untouched until the end. It is used exactly once for final evaluation, so your metrics reflect how the model will behave on new, unseen subjects.

6.2 Important: Apply preprocessing only on training data

When using scaling, SMOTE, or other preprocessing steps, always fit them on the training data only:

# ✅ Correct: fit on train, transform on all
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_val_scaled   = scaler.transform(X_val)
X_test_scaled  = scaler.transform(X_test)

# ✅ Correct: SMOTE on train only
X_train_res, y_train_res = smote.fit_resample(X_train_scaled, y_train)

# ❌ Wrong: NEVER fit on full dataset before splitting
# scaler.fit(X); smote.fit_resample(X, y)

Advanced

7. K-fold cross-validation (Version 2)

For small or imbalanced datasets, a single train/validation split may not be reliable enough. K-fold cross-validation improves robustness by training and evaluating the model across multiple different splits.

7.1 Idea

Split the dataset into K folds (e.g., 5 or 10).
For each fold:
- Use that fold as the validation (or test) set.
- Train on the remaining K-1 folds.
Average performance metrics across folds.

7.2 Example (conceptual)

from sklearn.model_selection import KFold

kf = KFold(n_splits=5, shuffle=True, random_state=42)

fold_metrics = []

for fold, (train_idx, val_idx) in enumerate(kf.split(X)):
    print(f"Fold {fold + 1}")
    X_train, X_val = X[train_idx], X[val_idx]
    y_train, y_val = y[train_idx], y[val_idx]

    # Fit scaler / SMOTE only on this fold's training data
    scaler.fit(X_train)
    X_train_scaled = scaler.transform(X_train)
    X_val_scaled   = scaler.transform(X_val)

    X_train_res, y_train_res = smote.fit_resample(X_train_scaled, y_train)

    model = build_model()  # fresh CNN–LSTM model for each fold
    history = model.fit(
        X_train_res, y_train_res,
        validation_data=(X_val_scaled, y_val),
        epochs=EPOCHS,
        batch_size=BATCH_SIZE,
        verbose=0
    )

    val_loss, val_acc = model.evaluate(X_val_scaled, y_val, verbose=0)
    fold_metrics.append(val_acc)

At the end, you can report:

Mean validation accuracy across folds.
Standard deviation of accuracy.
Any additional metrics (F1-score, AUC) averaged across folds.

Application

8. Applying this to gait-based CNN–LSTM models

In gait analysis (e.g., ASD vs. non-ASD classification), each subject is represented by a sequence of gait features, such as:

hip and knee flexion/extension,
hip and knee abduction/adduction,
normalized to a fixed number of frames (e.g., 120).

This produces a 3D tensor:

X.shape = (N_subjects, T_frames, F_features)

Key points to avoid leakage in this setting:

Split data at the subject level, not at the frame level.
Ensure that the same subject never appears in both training and test sets.
Apply sequence-level preprocessing (e.g., scaling, SMOTE on flattened sequences) only within each training split or fold.
Use validation data for early stopping and model selection, not the test set.

🧠

When using CNN–LSTM architectures, keep the temporal structure of the data intact during training (3D input). Only flatten sequences temporarily if needed for operations like SMOTE, then reshape back to 3D afterwards.

Checklist

9. Practical checklist: “Do I have data leakage?”

Before trusting your results, ask yourself:

Did I ever use the test set to:
- tune hyperparameters,
- decide when to stop training, or
- select between different model architectures?
Did I fit any preprocessing (scaler, SMOTE, PCA, etc.) on the full dataset instead of only the training part?
In subject-based data, could the same subject appear in both train and test sets?
Did I reuse the test set multiple times to iteratively improve my model?

❗

If the answer is “yes” to any of the questions above, your evaluation is likely biased due to data leakage. You should redesign the split and repeat the experiments.

A clean experiment pipeline typically looks like:

Define task and labels (e.g., ASD vs. non-ASD).
Split subjects into train / validation / test (or K folds).
Fit preprocessing only on training data.
Train models, tune on validation, use early stopping if needed.
Freeze the final model and evaluate once on the test set.
Report metrics with clear description of the split procedure.

Quick View

Split summary

Train set Typically 60–80% of data. Used to fit model weights.

Validation set Used for hyperparameter tuning and early stopping.

Test set Used only once for final unbiased evaluation.

In small datasets, use K-fold cross-validation for stable estimates and keep a separate held-out test set when possible.

Do & Don’t

Data leakage: do vs. don’t

✅ Do

Fit scalers and SMOTE on train only.
Use validation data or K-fold CV for tuning.
Keep the test set unseen until the end.
Split by subject for gait / medical data.

❌ Don’t

Use X_test as validation during training.
Fit preprocessing on the entire dataset.
Report results after repeatedly tweaking based on test accuracy.

Implementation Tips

Implementation tips (TensorFlow / Keras)

Use validation_data=(X_val, y_val) in model.fit(), not the test set.
Use callbacks like EarlyStopping with monitor="val_loss".
Reset / rebuild the model from scratch in each K-fold iteration.
Log metrics per fold for transparency.

from tensorflow.keras.callbacks import EarlyStopping

early_stop = EarlyStopping(
    monitor="val_loss",
    patience=10,
    restore_best_weights=True
)

history = model.fit(
    X_train, y_train,
    validation_data=(X_val, y_val),
    epochs=200,
    batch_size=32,
    callbacks=[early_stop]
)

Keywords

Train / Validation / Test Data Leakage K-fold Cross-Validation SMOTE Min–Max Scaling CNN–LSTM Gait Analysis ASD Classification Sequence Modeling

You can adapt this documentation for any supervised learning task by replacing the gait/ASD-specific examples with your own domain.