What is Reproducibility?
Reproducibility means that if you run your code multiple times with the same data and settings, you get exactly the same results every time.
Example:
Run 1: Accuracy = 0.8542, Loss = 0.3241
Run 2: Accuracy = 0.8542, Loss = 0.3241 ā Reproducible
Run 3: Accuracy = 0.8542, Loss = 0.3241 ā Reproducible
Without Reproducibility:
Run 1: Accuracy = 0.8542, Loss = 0.3241
Run 2: Accuracy = 0.8301, Loss = 0.3689 ā Different results
Run 3: Accuracy = 0.8723, Loss = 0.2987 ā Different results
Why Reproducibility Matters
1. Scientific Validity š¬
- Problem: If results change every time, how do you know what's real?
- Impact: Your findings cannot be trusted or verified
- Example: You report 85% accuracy in your paper, but reviewers get 78% when they try it
2. Debugging and Development š
- Problem: Can't tell if a change improved your model or if it's just random variation
- Example:
Change 1: Accuracy went from 0.82 ā 0.85 (good!)
Change 2: Accuracy went from 0.85 ā 0.81 (bad!)
But wait... you rerun Change 1 and get 0.79 now š¤
- Impact: Impossible to know what actually helps
3. Model Comparison āļø
- Problem: Can't fairly compare different models
- Example:
Model A: Run 1 = 0.82, Run 2 = 0.88, Run 3 = 0.79
Model B: Run 1 = 0.84, Run 2 = 0.81, Run 3 = 0.87
Which is better? You can't tell!
4. Collaboration š„
- Problem: Team members get different results with same code
- Impact: Wasted time, confusion, inability to share findings
- Example:
- You train a model: 86% accuracy
- Your colleague runs your code: 79% accuracy
- Who's right? Neither can be sure
5. Production Deployment š
- Problem: Model performance in production differs from training
- Impact:
- Customer complaints
- Financial losses
- Safety issues (medical/autonomous vehicles)
- Example: Medical diagnosis model shows 90% accuracy in your tests but only 75% in hospital
6. Research Publication š
- Problem: Reviewers and other researchers cannot replicate your results
- Impact:
- Paper rejection
- Scientific credibility damaged
- Replication crisis in AI research
- Fact: Many papers are rejected because results can't be reproduced
7. Regulatory Compliance āļø
- Problem: FDA, healthcare regulators require reproducible results
- Impact: Cannot deploy in regulated industries
- Example: AI medical device must produce consistent diagnoses
8. Hyperparameter Tuning šļø
- Problem: Can't optimize hyperparameters if results vary randomly
- Example:
Learning rate 0.001: Acc = 0.85, 0.79, 0.88 (what's the real value?)
Learning rate 0.0001: Acc = 0.83, 0.86, 0.81 (which is better?)
What Happens Without Reproducibility
Real-World Consequences:
1. Wasted Time and Resources ā°š°
- You spend weeks optimizing a model
- Later realize the improvements were just random variation
- All that work was meaningless
2. False Conclusions ā
# You think you improved the model
Old version: 82% accuracy
New version: 87% accuracy # Celebrate! š
# But when you rerun...
Old version: 84% accuracy
New version: 81% accuracy # Actually worse! š±
3. Unreliable Production Systems š
Training: Model achieves 90% accuracy
Production Day 1: 85% accuracy
Production Day 2: 78% accuracy
Production Day 3: 92% accuracy
Customer trust: 0% š
4. Research Impact š
- 2019 Study: Only 30% of deep learning papers could be reproduced
- Cost to science: Billions of dollars in wasted effort
- Public trust in AI: Damaged
5. Legal and Ethical Issues āļø
- Medical misdiagnosis due to inconsistent models
- Financial losses in trading algorithms
- Discrimination in hiring AI due to random variation
Sources of Randomness in Deep Learning
Understanding where randomness comes from helps you control it:
1. Weight Initialization š²
# Neural network weights start with random values
Dense(64) # Initializes 64 neurons with random weights
- Impact: Different starting points ā different final models
- Each run: New random weights
2. Data Shuffling š
# Training data is shuffled each epoch
model.fit(X_train, y_train, shuffle=True)
- Impact: Different order of training examples
- Effect: Model learns differently
3. Dropout š§
Dropout(0.3) # Randomly drops 30% of neurons
- Impact: Different neurons dropped in each training batch
- Purpose: Regularization, but introduces randomness
4. Data Augmentation š¼ļø
# Random image transformations
ImageDataGenerator(rotation_range=20, zoom_range=0.2)
- Impact: Each epoch sees slightly different data
5. Train/Test Split āļø
train_test_split(X, y, random_state=None) # Random split
- Impact: Different subjects in train vs test sets
6. Batch Sampling š¦
model.fit(X, y, batch_size=32) # Random batches
- Impact: Different mini-batches each epoch
7. GPU Operations š„ļø
- Parallel operations on GPU can execute in different orders
- Floating-point math is not perfectly deterministic
- Impact: Tiny differences accumulate
8. Python Hash Randomization š
# Python randomizes hash seeds for security
hash("example") # Different each program run
- Impact: Affects dictionary ordering, set operations
9. Operating System š»
- Thread scheduling
- Memory allocation
- Impact: Non-deterministic execution order
10. Multi-threading/Parallelism ā”
# Multiple CPU cores processing data
tf.data.Dataset(...).map(func, num_parallel_calls=AUTO)
- Impact: Race conditions, non-deterministic ordering
How to Make TensorFlow Reproducible
Complete Checklist ā
1. Set Python Hash Seed (Before anything else)
import os
os.environ['PYTHONHASHSEED'] = str(42)
Why: Ensures consistent dictionary/set ordering
When: Must be set before importing any libraries
2. Set NumPy Random Seed
import numpy as np
np.random.seed(42)
Why: Controls NumPy operations (data shuffling, sampling)
3. Set TensorFlow Environment Variables (Before TF import)
os.environ['TF_DETERMINISTIC_OPS'] = '1'
os.environ['TF_CUDNN_DETERMINISTIC'] = '1'
Why: Forces TensorFlow to use deterministic GPU operations
4. Import TensorFlow
import tensorflow as tf
5. Enable TensorFlow Determinism
tf.config.experimental.enable_op_determinism()
Why: Enforces deterministic behavior across all TF operations
6. Set TensorFlow Random Seeds
tf.random.set_seed(42)
tf.keras.utils.set_random_seed(42)
Why: Controls weight initialization, dropout, etc.
7. Clear Keras Backend (Before each model)
tf.keras.backend.clear_session()
Why: Removes residual state from previous models
8. Set Random State in Sklearn
train_test_split(X, y, random_state=42)
StandardScaler() # No randomness, but keep pipeline consistent
Why: Ensures consistent data splits
9. Disable Shuffling or Set Seed
model.fit(X, y, shuffle=False) # Or ensure seed is set
10. Use Single Thread (Optional, for maximum reproducibility)
tf.config.threading.set_inter_op_parallelism_threads(1)
tf.config.threading.set_intra_op_parallelism_threads(1)
Warning: Significantly slower!
Complete Implementation Guide
Template for Reproducible TensorFlow Code
"""
Reproducible Deep Learning Template
"""
import os
# ============================================
# STEP 1: Set seeds BEFORE any imports
# ============================================
RANDOM_SEED = 42
# Python hash seed (must be first!)
os.environ['PYTHONHASHSEED'] = str(RANDOM_SEED)
# TensorFlow determinism (before TF import)
os.environ['TF_DETERMINISTIC_OPS'] = '1'
os.environ['TF_CUDNN_DETERMINISTIC'] = '1'
# ============================================
# STEP 2: Import libraries
# ============================================
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, models
# ============================================
# STEP 3: Set all random seeds
# ============================================
np.random.seed(RANDOM_SEED)
tf.random.set_seed(RANDOM_SEED)
tf.keras.utils.set_random_seed(RANDOM_SEED)
# Enable TensorFlow determinism
tf.config.experimental.enable_op_determinism()
# Optional: Limit threads for maximum reproducibility
# tf.config.threading.set_inter_op_parallelism_threads(1)
# tf.config.threading.set_intra_op_parallelism_threads(1)
print(f"Random seed set to: {RANDOM_SEED}")
print(f"TensorFlow version: {tf.__version__}")
print(f"Deterministic ops enabled: True")
# ============================================
# STEP 4: Function to reset seeds (for multiple runs)
# ============================================
def reset_seeds(seed=RANDOM_SEED):
"""Reset all random seeds - call before each training"""
np.random.seed(seed)
tf.random.set_seed(seed)
tf.keras.utils.set_random_seed(seed)
os.environ['PYTHONHASHSEED'] = str(seed)
# ============================================
# STEP 5: Load and prepare data with fixed seed
# ============================================
def prepare_data():
# Your data loading code here
X, y = load_data() # Your function
# Use random_state for consistent splits
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2,
random_state=RANDOM_SEED, # Important!
stratify=y
)
# Normalization (deterministic)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
return X_train, X_test, y_train, y_test
# ============================================
# STEP 6: Build model with session clearing
# ============================================
def build_model(input_shape):
"""Build model with clean session"""
# Clear any previous models
tf.keras.backend.clear_session()
model = models.Sequential([
layers.Dense(64, activation='relu', input_shape=input_shape),
layers.Dropout(0.3),
layers.Dense(32, activation='relu'),
layers.Dropout(0.3),
layers.Dense(1, activation='sigmoid')
])
model.compile(
optimizer=keras.optimizers.Adam(learning_rate=0.001),
loss='binary_crossentropy',
metrics=['accuracy']
)
return model
# ============================================
# STEP 7: Training function
# ============================================
def train_model(X_train, y_train, X_test, y_test):
"""Train model with reproducible settings"""
# Reset seeds before training
reset_seeds(RANDOM_SEED)
# Build model
model = build_model((X_train.shape[1],))
# Train with fixed batch size and epochs
history = model.fit(
X_train, y_train,
validation_data=(X_test, y_test),
epochs=50,
batch_size=32,
shuffle=True, # OK because seed is set
verbose=1
)
return model, history
# ============================================
# STEP 8: Main execution
# ============================================
def main():
print("="*80)
print("REPRODUCIBLE DEEP LEARNING TRAINING")
print("="*80)
# Prepare data
X_train, X_test, y_train, y_test = prepare_data()
# Train model
model, history = train_model(X_train, y_train, X_test, y_test)
# Evaluate
test_loss, test_acc = model.evaluate(X_test, y_test, verbose=0)
print(f"\nTest Accuracy: {test_acc:.4f}")
print(f"Test Loss: {test_loss:.4f}")
# Save model
model.save('reproducible_model.h5')
print("\nModel saved. Running again will produce identical results!")
return model
if __name__ == "__main__":
model = main()
Verification Script
"""
Verify Reproducibility - Run this to test
"""
def verify_reproducibility(n_runs=3):
"""Run training multiple times and verify identical results"""
results = []
for run in range(1, n_runs + 1):
print(f"\n{'='*60}")
print(f"RUN {run}/{n_runs}")
print(f"{'='*60}")
# Train model
model = main()
# Get final metrics
X_train, X_test, y_train, y_test = prepare_data()
loss, acc = model.evaluate(X_test, y_test, verbose=0)
results.append({
'run': run,
'accuracy': acc,
'loss': loss
})
print(f"Accuracy: {acc:.10f}")
print(f"Loss: {loss:.10f}")
# Check if all results are identical
df = pd.DataFrame(results)
print(f"\n{'='*60}")
print("REPRODUCIBILITY CHECK")
print(f"{'='*60}")
print(df)
if df['accuracy'].nunique() == 1 and df['loss'].nunique() == 1:
print("\nā
SUCCESS! All runs produced identical results!")
print("Your code is fully reproducible!")
else:
print("\nā FAILURE! Results varied across runs!")
print("Check your seed settings!")
print(f"Accuracy variance: {df['accuracy'].std()}")
print(f"Loss variance: {df['loss'].std()}")
return df
# Run verification
verify_reproducibility(n_runs=3)
Trade-offs and Considerations
Advantages of Reproducibility ā
- Scientific Validity: Results can be trusted
- Debugging: Easy to identify what helps
- Collaboration: Team gets same results
- Publication: Papers are accepted
- Deployment: Predictable production performance
Disadvantages/Trade-offs ā ļø
1. Performance Impact
- Deterministic operations are 5-30% slower
- GPU parallelism limited
- Solution: Use only during development/testing
2. Single-threaded Operations
- Maximum reproducibility requires single threading
- Can be 2-10x slower
- Solution: Only enable when absolutely necessary
3. Hardware Dependence
- Different GPUs may still give slightly different results
- CPU vs GPU results may differ
- Solution: Specify hardware in documentation
4. Not Always Possible
- Some operations fundamentally non-deterministic
- Distributed training more challenging
- Solution: Document limitations
When is Reproducibility Critical? š“
- Always critical:
- Medical applications
- Financial models
- Safety-critical systems
- Research papers
- Regulatory submissions
- Important but flexible:
- Model development
- Hyperparameter tuning
- Team collaboration
- Less critical:
- Initial exploration
- Proof-of-concept
- When using ensemble methods
- When averaging across many runs
Best Practices
Do's ā
- Set Seeds Early: Before any imports
- Document Seeds: Write down all seed values used
- Version Control: Track TensorFlow/library versions
- Save Everything: Save data splits, preprocessors, models
- Test Reproducibility: Run multiple times to verify
- Document Hardware: Note GPU/CPU used
- Use Requirements.txt: Pin library versions
- Separate Exploration from Production: Different reproducibility needs
Don'ts ā
- Don't set seeds in random places: Do it once, at the start
- Don't ignore warnings: They often indicate non-determinism
- Don't assume it works: Always verify
- Don't use time-based seeds:
seed=int(time.time())is wrong - Don't mix random operations without seeds
- Don't forget to document: Future you will thank you
Example Requirements.txt
tensorflow==2.15.0
numpy==1.24.3
pandas==2.0.3
scikit-learn==1.3.0
matplotlib==3.7.2
Documentation Template
## Reproducibility Information
- **Random Seed**: 42
- **TensorFlow Version**: 2.15.0
- **Python Version**: 3.10.12
- **Hardware**: NVIDIA RTX 3090 (24GB)
- **CUDA Version**: 12.2
- **cuDNN Version**: 8.9
## To Reproduce Results:
1. Install exact dependencies: `pip install -r requirements.txt`
2. Run with seed 42: `python train.py --seed 42`
3. Expected accuracy: 0.8542 ± 0.0001
## Known Limitations:
- Results may vary slightly on different GPU models
- CPU results may differ from GPU by ~0.1%
Summary: Quick Reference Card
šÆ Essential Steps (Minimum Required)
# 1. Before anything
import os
os.environ['PYTHONHASHSEED'] = '42'
os.environ['TF_DETERMINISTIC_OPS'] = '1'
os.environ['TF_CUDNN_DETERMINISTIC'] = '1'
# 2. Import and set seeds
import numpy as np
import tensorflow as tf
SEED = 42
np.random.seed(SEED)
tf.random.set_seed(SEED)
tf.keras.utils.set_random_seed(SEED)
tf.config.experimental.enable_op_determinism()
# 3. Before each model
tf.keras.backend.clear_session()
# 4. Use random_state everywhere
train_test_split(..., random_state=SEED)
š Is Reproducibility Necessary?
| Scenario | Necessary? | Why |
|---|---|---|
| Research paper | ā Yes | Must be verifiable |
| Medical AI | ā Yes | Lives depend on it |
| Production model | ā Yes | Consistent performance |
| Debugging | ā Yes | Find what works |
| Quick experiment | ā ļø Optional | Speed over precision |
| Model ensembles | ā ļø Optional | Averaging reduces variance |
š” Key Takeaway
Reproducibility is not optional for serious work. It's the foundation of:
- Scientific integrity
- Reliable models
- Productive development
- Trustworthy AI systems
Without it, you're essentially doing random guessing with extra steps.
Further Reading
- TensorFlow Official Guide: Determinism in TensorFlow
- Papers With Code: Reproducibility Checklist
- Nature Paper: "Reproducibility crisis in AI research" (2019)
- NVIDIA Documentation: Determinism in Deep Learning
Vanilla SMOTE Example Table (From Earlier Section)
This table appears in the explanation of vanilla SMOTE:
| Sample | Feature1 | Feature2 |
|---|---|---|
| A | 1.0 | 2.0 |
| B | 2.0 | 3.0 |
This sits alongside the explanation of SMOTE, where: SMOTE creates interpolated samples between minority-class examples.