Bias–Variance Tradeoff
Why simple models underfit, complex models overfit, and how to find the “sweet spot” for generalization.
What is the Bias–Variance Tradeoff?
The bias–variance tradeoff explains why a model’s test performance worsens when it is too simple (high bias) or too complex (high variance). Our goal is a model that balances both, minimizing total generalization error.
Intuitive View
High Bias → Underfitting
- Overly simplistic assumptions (e.g., a straight line for a curved trend).
- Poor training and test performance.
High Variance → Overfitting
- Overly sensitive to data noise (fits fluctuations rather than the pattern).
- Great training performance, poor test performance.
Bias vs “Bias in Data” (Why the term is confusing)
The word bias is overloaded. In ethics/statistics, it refers to unfairness or representation issues in the dataset (e.g., imbalanced data). In the tradeoff, bias means the model’s systematic error: how far the average prediction is from the true function.
Bias in Data (Fairness)
Imbalances or prejudices in the dataset that can produce unfair outcomes. Addressed with data collection, rebalancing, fairness constraints, etc.
Bias in Model (Tradeoff)
Due to simplifying assumptions; the model can’t express the true relationship well. Addressed by adding capacity/features, reducing regularization, etc.
What “Variance” Means in the Tradeoff
Variance here measures how much a model’s predictions change when trained on different samples from the same population. High-variance models are unstable: small data changes ⇒ very different predictions.
- Decision trees/deep nets on small data → often high variance.
- Linear models/ridge regression → typically lower variance.
Dartboard Analogy
Low Bias, Low Variance: tight cluster at the bullseye.
High Bias, Low Variance: tight cluster far from center (consistently wrong).
Low Bias, High Variance: scattered around the center (inconsistent).
High Bias, High Variance: scattered and far (worst of both).
Mathematical Decomposition
The expected squared error at input x decomposes as:
E[(Y - ŷ(x))²] = ( E[ŷ(x)] - f(x) )² + E[( ŷ(x) - E[ŷ(x)] )²] + σ²
↑ Bias² (systematic) ↑ Variance (instability) ↑ Irreducible noise
- Bias²: how far the average model is from truth.
- Variance: how much the model wiggles around its own average across different samples.
- Irreducible noise (σ²): randomness in data we can’t remove.
Error vs. Model Complexity
How to Control the Tradeoff (Practical Tips)
Reduce Variance (tame overfitting)
- Regularization (L2/Ridge, L1/Lasso, weight decay, dropout).
- More training data / data augmentation.
- Ensembles: bagging, random forests, snapshot ensembles.
- Early stopping; simpler architectures.
Reduce Bias (fix underfitting)
- Add features, increase model capacity (deeper/wider nets, higher-degree polynomials).
- Reduce regularization strength.
- Use more expressive models (kernels, residual connections).
Common diagnostics
- High bias: train error high, val/test error also high → model too simple.
- High variance: train error low, val/test error much higher → model overfits.
- Use cross-validation curves and learning curves to locate the sweet spot.
Quick Mental Trick
Bias = model’s simplicity → ignores patterns.
Variance = model’s sensitivity → chases noise.
Or even shorter: “Bias ignores. Variance chases.”
Mini Code Example (Polynomial fit intuition)
# Python / scikit-learn sketch
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np
rng = np.random.default_rng(0)
X = np.linspace(0, 6, 30)[:, None]
y = np.sin(X).ravel() + rng.normal(0, 0.3, len(X))
Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.3, random_state=42)
for d in (1, 3, 9):
poly = PolynomialFeatures(degree=d, include_bias=False)
model = LinearRegression().fit(poly.fit_transform(Xtr), ytr)
mse_train = mean_squared_error(ytr, model.predict(poly.transform(Xtr)))
mse_test = mean_squared_error(yte, model.predict(poly.transform(Xte)))
print(f"degree={d:>2} train_MSE={mse_train:.3f} test_MSE={mse_test:.3f}")
# Expect: d=1 underfits (high bias); d=9 overfits (high variance).
Summary
- Bias (in tradeoff) ≠ fairness bias. It’s systematic error from simplicity.
- Variance is prediction instability across different samples.
- As model complexity increases: bias ↓, variance ↑. Seek the sweet spot.
- Use regularization/ensembles/more data to reduce variance; increase capacity to reduce bias.
- Quick recall: “Bias ignores. Variance chases.”