Machine Learning Notes

Bias–Variance Tradeoff

Why simple models underfit, complex models overfit, and how to find the “sweet spot” for generalization.

What is the Bias–Variance Tradeoff?

The bias–variance tradeoff explains why a model’s test performance worsens when it is too simple (high bias) or too complex (high variance). Our goal is a model that balances both, minimizing total generalization error.

Bias (Model)

Systematic error from simplifying assumptions; the model ignores real patterns. Leads to underfitting.

Variance (Model)

Instability: predictions change significantly across training samples; the model chases noise. Leads to overfitting.

Intuitive View

High Bias → Underfitting

Overly simplistic assumptions (e.g., a straight line for a curved trend).
Poor training and test performance.

High Variance → Overfitting

Overly sensitive to data noise (fits fluctuations rather than the pattern).
Great training performance, poor test performance.

Bias vs “Bias in Data” (Why the term is confusing)

The word bias is overloaded. In ethics/statistics, it refers to unfairness or representation issues in the dataset (e.g., imbalanced data). In the tradeoff, bias means the model’s systematic error: how far the average prediction is from the true function.

Bias in Data (Fairness)

Imbalances or prejudices in the dataset that can produce unfair outcomes. Addressed with data collection, rebalancing, fairness constraints, etc.

Bias in Model (Tradeoff)

Due to simplifying assumptions; the model can’t express the true relationship well. Addressed by adding capacity/features, reducing regularization, etc.

Key distinction: “Data bias” is about fairness. “Model bias” is about systematic error due to simplicity.

What “Variance” Means in the Tradeoff

Variance here measures how much a model’s predictions change when trained on different samples from the same population. High-variance models are unstable: small data changes ⇒ very different predictions.

Decision trees/deep nets on small data → often high variance.
Linear models/ridge regression → typically lower variance.

Dartboard Analogy

Bias = distance from center (systematic error). Variance = spread of the darts (instability).

Low Bias, Low Variance: tight cluster at the bullseye.

High Bias, Low Variance: tight cluster far from center (consistently wrong).

Low Bias, High Variance: scattered around the center (inconsistent).

High Bias, High Variance: scattered and far (worst of both).

Mathematical Decomposition

The expected squared error at input x decomposes as:

E[(Y - ŷ(x))²] = ( E[ŷ(x)] - f(x) )²   +   E[( ŷ(x) - E[ŷ(x)] )²]   +   σ²
                   ↑ Bias² (systematic)        ↑ Variance (instability)     ↑ Irreducible noise

Bias²: how far the average model is from truth.
Variance: how much the model wiggles around its own average across different samples.
Irreducible noise (σ²): randomness in data we can’t remove.

Error vs. Model Complexity

As complexity increases, bias decreases and variance increases; total error is U-shaped.

How to Control the Tradeoff (Practical Tips)

Reduce Variance (tame overfitting)

Regularization (L2/Ridge, L1/Lasso, weight decay, dropout).
More training data / data augmentation.
Ensembles: bagging, random forests, snapshot ensembles.
Early stopping; simpler architectures.

Reduce Bias (fix underfitting)

Add features, increase model capacity (deeper/wider nets, higher-degree polynomials).
Reduce regularization strength.
Use more expressive models (kernels, residual connections).

Common diagnostics

High bias: train error high, val/test error also high → model too simple.
High variance: train error low, val/test error much higher → model overfits.
Use cross-validation curves and learning curves to locate the sweet spot.

Quick Mental Trick

Bias = model’s simplicity → ignores patterns.

Variance = model’s sensitivity → chases noise.

Or even shorter: “Bias ignores. Variance chases.”

Mini Code Example (Polynomial fit intuition)

# Python / scikit-learn sketch
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np

rng = np.random.default_rng(0)
X = np.linspace(0, 6, 30)[:, None]
y = np.sin(X).ravel() + rng.normal(0, 0.3, len(X))

Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.3, random_state=42)

for d in (1, 3, 9):
    poly = PolynomialFeatures(degree=d, include_bias=False)
    model = LinearRegression().fit(poly.fit_transform(Xtr), ytr)
    mse_train = mean_squared_error(ytr, model.predict(poly.transform(Xtr)))
    mse_test  = mean_squared_error(yte, model.predict(poly.transform(Xte)))
    print(f"degree={d:>2}  train_MSE={mse_train:.3f}  test_MSE={mse_test:.3f}")
# Expect: d=1 underfits (high bias); d=9 overfits (high variance).

Summary

Bias (in tradeoff) ≠ fairness bias. It’s systematic error from simplicity.
Variance is prediction instability across different samples.
As model complexity increases: bias ↓, variance ↑. Seek the sweet spot.
Use regularization/ensembles/more data to reduce variance; increase capacity to reduce bias.
Quick recall: “Bias ignores. Variance chases.”