Train / Validation / Test Splits & Data Leakage

Practical guide for deep learning projects (e.g., gait-based ASD classification using CNN–LSTM)

Python · TensorFlow / Keras LSTM · CNN K-fold Cross-Validation No Data Leakage
Overview

1. Why splitting your data correctly matters

In machine learning, it is not enough to build a powerful model; you must also evaluate it in a way that reflects real-world performance. This requires splitting your dataset into distinct subsets and strictly avoiding data leakage—any situation where information from the test set influences the training process.

This document explains:

  • What train, validation, and test sets are and how they differ.
  • Why using the test set for validation leads to data leakage.
  • How to correctly implement K-fold cross-validation.
  • How these ideas apply to sequence models like CNN–LSTM for gait analysis.
Concepts

3. Train / Validation / Test: Roles and definitions

3.1 Train set

The train set is the portion of data the model sees during learning. Gradients are computed on this data, and model weights are updated accordingly. In deep learning, this typically corresponds to X_train and y_train.

3.2 Validation set

The validation set is used during training to tune hyperparameters, select the best model, and monitor overfitting. The model does not update its weights directly on this set, but its performance on the validation data guides:

  • early stopping,
  • learning rate schedules,
  • architecture and hyperparameter choices.

3.3 Test set

The test set is used only once, at the very end, to estimate how well the final, chosen model generalizes to truly unseen data. It should not influence model design, tuning, or training decisions.

ℹ️
Rule of thumb: Once you use the test set to make a decision (e.g., choose a model), it is no longer a test set—it has become part of the tuning pipeline.
Data Leakage

4. What is data leakage?

Data leakage occurs when information from outside the training data “leaks” into the training process, giving the model access to information it would not have in a real-world scenario. As a result, performance metrics become overly optimistic and do not reflect true generalization.

4.1 Common examples

  • Using the test set for hyperparameter tuning (directly or indirectly).
  • Applying normalization or SMOTE on the entire dataset before splitting.
  • Including future information in time-series models (e.g., using future frames to predict past labels).
  • Subject-level leakage in medical or gait data (same subject appears in both train and test).
⚠️
If your model sees the test data during training—even indirectly—you cannot trust the reported accuracy, AUC, or F1-score. The model may simply be memorizing patterns specific to those test samples.
Anti-Pattern

5. Example of a wrong split (leakage scenario)

Consider the following incorrect workflow:

# ❌ WRONG (data leakage)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

# Train on X_train
model.fit(X_train, y_train, validation_data=(X_test, y_test))

# Later...
model.evaluate(X_test, y_test)  # "Final test performance"

In this setup:

  • The test set (X_test, y_test) is used as a validation set during training.
  • The model and hyperparameters (e.g., number of epochs, early stopping) are chosen based on performance on this “test” data.
  • Then, the same data is used again for final evaluation.
🚫
This is data leakage because the test set is influencing the training process. The model is indirectly tuned to perform well on that specific data, and reported performance will be biased.
Best Practice

6. Correct split strategies (no leakage)

6.1 Simple 3-way split: train / validation / test

A common, leakage-free strategy is to split the data into three parts:

from sklearn.model_selection import train_test_split

# First: train + temp
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, test_size=0.3, stratify=y, random_state=42
)

# Second: validation + test (split the temp set)
X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, test_size=0.5, stratify=y_temp, random_state=42
)

This results in approximately:

  • 70% training data
  • 15% validation data
  • 15% test data

Use during model training:

history = model.fit(
    X_train, y_train,
    validation_data=(X_val, y_val),
    epochs=EPOCHS,
    batch_size=BATCH_SIZE,
    callbacks=[...]
)

# Final evaluation (only once)
test_loss, test_acc = model.evaluate(X_test, y_test)
The test set (X_test) is completely untouched until the end. It is used exactly once for final evaluation, so your metrics reflect how the model will behave on new, unseen subjects.

6.2 Important: Apply preprocessing only on training data

When using scaling, SMOTE, or other preprocessing steps, always fit them on the training data only:

# ✅ Correct: fit on train, transform on all
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_val_scaled   = scaler.transform(X_val)
X_test_scaled  = scaler.transform(X_test)

# ✅ Correct: SMOTE on train only
X_train_res, y_train_res = smote.fit_resample(X_train_scaled, y_train)

# ❌ Wrong: NEVER fit on full dataset before splitting
# scaler.fit(X); smote.fit_resample(X, y)
Advanced

7. K-fold cross-validation (Version 2)

For small or imbalanced datasets, a single train/validation split may not be reliable enough. K-fold cross-validation improves robustness by training and evaluating the model across multiple different splits.

7.1 Idea

  • Split the dataset into K folds (e.g., 5 or 10).
  • For each fold:
    • Use that fold as the validation (or test) set.
    • Train on the remaining K-1 folds.
  • Average performance metrics across folds.

7.2 Example (conceptual)

from sklearn.model_selection import KFold

kf = KFold(n_splits=5, shuffle=True, random_state=42)

fold_metrics = []

for fold, (train_idx, val_idx) in enumerate(kf.split(X)):
    print(f"Fold {fold + 1}")
    X_train, X_val = X[train_idx], X[val_idx]
    y_train, y_val = y[train_idx], y[val_idx]

    # Fit scaler / SMOTE only on this fold's training data
    scaler.fit(X_train)
    X_train_scaled = scaler.transform(X_train)
    X_val_scaled   = scaler.transform(X_val)

    X_train_res, y_train_res = smote.fit_resample(X_train_scaled, y_train)

    model = build_model()  # fresh CNN–LSTM model for each fold
    history = model.fit(
        X_train_res, y_train_res,
        validation_data=(X_val_scaled, y_val),
        epochs=EPOCHS,
        batch_size=BATCH_SIZE,
        verbose=0
    )

    val_loss, val_acc = model.evaluate(X_val_scaled, y_val, verbose=0)
    fold_metrics.append(val_acc)

At the end, you can report:

  • Mean validation accuracy across folds.
  • Standard deviation of accuracy.
  • Any additional metrics (F1-score, AUC) averaged across folds.
Application

8. Applying this to gait-based CNN–LSTM models

In gait analysis (e.g., ASD vs. non-ASD classification), each subject is represented by a sequence of gait features, such as:

  • hip and knee flexion/extension,
  • hip and knee abduction/adduction,
  • normalized to a fixed number of frames (e.g., 120).

This produces a 3D tensor:

X.shape = (N_subjects, T_frames, F_features)

Key points to avoid leakage in this setting:

  • Split data at the subject level, not at the frame level.
  • Ensure that the same subject never appears in both training and test sets.
  • Apply sequence-level preprocessing (e.g., scaling, SMOTE on flattened sequences) only within each training split or fold.
  • Use validation data for early stopping and model selection, not the test set.
🧠
When using CNN–LSTM architectures, keep the temporal structure of the data intact during training (3D input). Only flatten sequences temporarily if needed for operations like SMOTE, then reshape back to 3D afterwards.
Checklist

9. Practical checklist: “Do I have data leakage?”

Before trusting your results, ask yourself:

  • Did I ever use the test set to:
    • tune hyperparameters,
    • decide when to stop training, or
    • select between different model architectures?
  • Did I fit any preprocessing (scaler, SMOTE, PCA, etc.) on the full dataset instead of only the training part?
  • In subject-based data, could the same subject appear in both train and test sets?
  • Did I reuse the test set multiple times to iteratively improve my model?
If the answer is “yes” to any of the questions above, your evaluation is likely biased due to data leakage. You should redesign the split and repeat the experiments.

A clean experiment pipeline typically looks like:

  1. Define task and labels (e.g., ASD vs. non-ASD).
  2. Split subjects into train / validation / test (or K folds).
  3. Fit preprocessing only on training data.
  4. Train models, tune on validation, use early stopping if needed.
  5. Freeze the final model and evaluate once on the test set.
  6. Report metrics with clear description of the split procedure.