Linear Regression

The foundation of supervised learning: from the normal equation to gradient descent, bias-variance tradeoff, and regularization.

Supervised Learning February 10, 2026 7 min read

The Simplest Useful Model

Linear regression is the workhorse of predictive modeling. Despite its simplicity, it’s the foundation upon which much of machine learning is built. Understanding it deeply means understanding optimization, overfitting, regularization, and the bias-variance tradeoff.

The model assumes a linear relationship between inputs and output:

y=w1x1+w2x2++wnxn+by = w_1 x_1 + w_2 x_2 + \cdots + w_n x_n + b

Or in matrix form:

y=Xw+ϵ\mathbf{y} = \mathbf{X} \mathbf{w} + \boldsymbol{\epsilon}

where ϵ\boldsymbol{\epsilon} represents noise (typically assumed Gaussian).

Why Linear Models Matter

Even in an era of deep learning, linear regression remains essential:

  • Interpretable: Each weight tells you exactly how much a feature matters
  • Fast: Training is near-instant, even on millions of data points
  • Baseline: Every ML project should start with a linear model
  • Foundation: Logistic regression, SVMs, and neural networks all generalize linear models

The Normal Equation

For linear regression, there’s a closed-form solution — no iterative optimization needed:

w=(XTX)1XTy\mathbf{w} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}

This minimizes the sum of squared errors and gives the exact optimal weights in one step.

Derivation

We want to minimize the residual sum of squares:

RSS(w)=yXw2=(yXw)T(yXw)\text{RSS}(\mathbf{w}) = \|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2 = (\mathbf{y} - \mathbf{X}\mathbf{w})^T (\mathbf{y} - \mathbf{X}\mathbf{w})

Taking the derivative with respect to w\mathbf{w} and setting it to zero:

RSSw=2XT(yXw)=0\frac{\partial \, \text{RSS}}{\partial \mathbf{w}} = -2 \mathbf{X}^T (\mathbf{y} - \mathbf{X}\mathbf{w}) = 0 XTXw=XTy\mathbf{X}^T \mathbf{X} \, \mathbf{w} = \mathbf{X}^T \mathbf{y} w=(XTX)1XTy\mathbf{w} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}

Geometric interpretation: The prediction Xw\mathbf{X}\mathbf{w} is the orthogonal projection of y\mathbf{y} onto the column space of X\mathbf{X}. The residuals are perpendicular to every feature vector.

When the Normal Equation Fails

The normal equation requires inverting XTX\mathbf{X}^T \mathbf{X}, which can fail when:

  • Multicollinearity: Features are highly correlated (XTX\mathbf{X}^T \mathbf{X} is nearly singular)
  • More features than samples: n<pn < p makes XTX\mathbf{X}^T \mathbf{X} rank-deficient
  • Large datasets: Matrix inversion is O(p3)O(p^3), expensive for many features

In these cases, we turn to gradient descent or regularization.

Gradient Descent

For large-scale problems, we optimize iteratively. The gradient of the loss with respect to weights is:

wL=2nXT(yXw)\nabla_{\mathbf{w}} \mathcal{L} = -\frac{2}{n} \mathbf{X}^T (\mathbf{y} - \mathbf{X}\mathbf{w})

Update rule:

w:=wηwL\mathbf{w} := \mathbf{w} - \eta \, \nabla_{\mathbf{w}} \mathcal{L}

Variants

MethodBatch SizeProsCons
Batch GDAll dataStable, exact gradientSlow for large datasets
Stochastic GD1 sampleFast per stepNoisy, oscillates
Mini-batch GD32-256 samplesBest of both worldsNeed to tune batch size

Learning Rate

The learning rate η\eta is the single most important hyperparameter:

  • Too large: Diverges, loss explodes
  • Too small: Converges painfully slowly
  • Just right: Smooth, steady decrease in loss

In practice, use learning rate schedules or adaptive methods (Adam, RMSprop).

Evaluating the Model

Metrics

  • MSE (Mean Squared Error): Average of squared residuals. Penalizes large errors.
  • RMSE: Square root of MSE. Same units as yy.
  • MAE (Mean Absolute Error): Average of absolute residuals. Robust to outliers.
  • R-squared: Proportion of variance explained. R2=1R^2 = 1 means perfect fit, R2=0R^2 = 0 means the model is no better than predicting the mean.
R2=1i(yiy^i)2i(yiyˉ)2R^2 = 1 - \frac{\sum_{i} (y_i - \hat{y}_i)^2}{\sum_{i} (y_i - \bar{y})^2}

Residual Analysis

A well-fit model should have residuals that are:

  1. Centered at zero — no systematic bias
  2. Constant variance — homoscedasticity
  3. Normally distributed — for valid confidence intervals
  4. Independent — no patterns over time or features

If residuals show patterns, the model is missing something.

The Bias-Variance Tradeoff

Every model’s error can be decomposed into three parts:

E ⁣[Error]=Bias2+Variance+σnoise2\mathbb{E}\!\big[\text{Error}\big] = \text{Bias}^2 + \text{Variance} + \sigma^2_{\text{noise}}
  • Bias: Error from wrong assumptions (underfitting). A linear model applied to curved data has high bias.
  • Variance: Error from sensitivity to training data (overfitting). A very flexible model has high variance.
  • Irreducible noise (σnoise2\sigma^2_{\text{noise}}): Inherent randomness in the data. No model can reduce this.

The goal is to find the sweet spot — complex enough to capture patterns, simple enough to generalize.

Regularization

Regularization prevents overfitting by adding a penalty on the weights.

Ridge Regression (L2)

Lridge=MSE+λiwi2\mathcal{L}_{\text{ridge}} = \text{MSE} + \lambda \sum_{i} w_i^2
  • Shrinks all weights toward zero
  • Never sets weights exactly to zero
  • Closed-form solution: w=(XTX+λI)1XTy\mathbf{w} = (\mathbf{X}^T \mathbf{X} + \lambda \mathbf{I})^{-1} \mathbf{X}^T \mathbf{y}
  • Stabilizes the solution when features are correlated

Lasso Regression (L1)

Llasso=MSE+λiwi\mathcal{L}_{\text{lasso}} = \text{MSE} + \lambda \sum_{i} |w_i|
  • Can set weights exactly to zero (feature selection)
  • No closed-form solution — requires iterative optimization
  • Produces sparse models

Elastic Net

Lelastic=MSE+λ1iwi+λ2iwi2\mathcal{L}_{\text{elastic}} = \text{MSE} + \lambda_1 \sum_{i} |w_i| + \lambda_2 \sum_{i} w_i^2

Combines L1 and L2 — useful when features are correlated and you still want sparsity.

Bayesian interpretation: Ridge corresponds to a Gaussian prior on weights. Lasso corresponds to a Laplace prior. The regularization strength λ\lambda is the inverse prior variance. (See the MAP estimation article for details.)

Polynomial and Feature Engineering

Linear regression is linear in the weights, not necessarily in the features. We can model nonlinear relationships by creating new features:

y=w0+w1x+w2x2+w3x3y = w_0 + w_1 x + w_2 x^2 + w_3 x^3

This is still linear regression — just with polynomial features. Other transformations:

  • Interaction terms: x1x2x_1 \cdot x_2
  • Log transforms: log(x)\log(x)
  • Binning: Convert continuous to categorical
  • Basis functions: sin(x)\sin(x), exp(x)\exp(x), etc.

Warning: More features means higher variance. Always regularize when adding engineered features.

Assumptions of Linear Regression

For the standard theory (confidence intervals, p-values) to be valid:

  1. Linearity: The relationship between X\mathbf{X} and y\mathbf{y} is truly linear
  2. Independence: Observations are independent
  3. Homoscedasticity: Constant variance of residuals
  4. Normality: Residuals are normally distributed (for inference)
  5. No perfect multicollinearity: No feature is a perfect linear combination of others

Violations don’t make linear regression useless — they just limit what you can conclude from it.

From Linear to Logistic Regression

For classification (predicting categories instead of numbers), we wrap the linear model in a sigmoid function:

P(y=1x)=σ(wTx)=11+exp(wTx)P(y = 1 \mid \mathbf{x}) = \sigma(\mathbf{w}^T \mathbf{x}) = \frac{1}{1 + \exp(-\mathbf{w}^T \mathbf{x})}

This is logistic regression — still a linear model at its core, but outputting probabilities between 0 and 1. It’s trained with MLE (cross-entropy loss) instead of least squares.

Summary

  • Linear regression models y=Xw+ϵ\mathbf{y} = \mathbf{X}\mathbf{w} + \boldsymbol{\epsilon}
  • The normal equation gives an exact closed-form solution
  • Gradient descent handles large-scale optimization
  • R2R^2, RMSE, and residual analysis evaluate fit quality
  • Bias-variance tradeoff guides model complexity choices
  • Ridge (L2) and Lasso (L1) regularization prevent overfitting
  • Feature engineering lets linear models capture nonlinear patterns
  • Linear regression is the foundation for logistic regression, SVMs, and neural networks

References

  • Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.
  • Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). Springer.
  • James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning (2nd ed.). Springer.
  • Murphy, K. P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press.
  • Shalev-Shwartz, S., & Ben-David, S. (2014). Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press.

Keyboard Shortcuts

Navigation
j
Next heading
k
Previous heading
n
Next article in series
p
Previous article in series
t
Scroll to top
Actions
r
Toggle reading mode
Ctrl K
Search articles
?
Toggle this help
Esc
Close overlay