Linear Regression

The Simplest Useful Model

Linear regression is the workhorse of predictive modeling. Despite its simplicity, it’s the foundation upon which much of machine learning is built. Understanding it deeply means understanding optimization, overfitting, regularization, and the bias-variance tradeoff.

The model assumes a linear relationship between inputs and output:

y = w_1 x_1 + w_2 x_2 + \cdots + w_n x_n + b

Or in matrix form:

\mathbf{y} = \mathbf{X} \mathbf{w} + \boldsymbol{\epsilon}

where $\boldsymbol{\epsilon}$ represents noise (typically assumed Gaussian).

Why Linear Models Matter

Even in an era of deep learning, linear regression remains essential:

Interpretable: Each weight tells you exactly how much a feature matters
Fast: Training is near-instant, even on millions of data points
Baseline: Every ML project should start with a linear model
Foundation: Logistic regression, SVMs, and neural networks all generalize linear models

The Normal Equation

For linear regression, there’s a closed-form solution — no iterative optimization needed:

\mathbf{w} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}

This minimizes the sum of squared errors and gives the exact optimal weights in one step.

Derivation

We want to minimize the residual sum of squares:

\text{RSS}(\mathbf{w}) = \|\mathbf{y} - \mathbf{X}\mathbf{w}\|^2 = (\mathbf{y} - \mathbf{X}\mathbf{w})^T (\mathbf{y} - \mathbf{X}\mathbf{w})

Taking the derivative with respect to $\mathbf{w}$ and setting it to zero:

\frac{\partial \, \text{RSS}}{\partial \mathbf{w}} = -2 \mathbf{X}^T (\mathbf{y} - \mathbf{X}\mathbf{w}) = 0

\mathbf{X}^T \mathbf{X} \, \mathbf{w} = \mathbf{X}^T \mathbf{y}

\mathbf{w} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}

Geometric interpretation: The prediction $\mathbf{X}\mathbf{w}$ is the orthogonal projection of $\mathbf{y}$ onto the column space of $\mathbf{X}$ . The residuals are perpendicular to every feature vector.

When the Normal Equation Fails

The normal equation requires inverting $\mathbf{X}^T \mathbf{X}$ , which can fail when:

Multicollinearity: Features are highly correlated ( $\mathbf{X}^T \mathbf{X}$ is nearly singular)
More features than samples: $n < p$ makes $\mathbf{X}^T \mathbf{X}$ rank-deficient
Large datasets: Matrix inversion is $O(p^3)$ , expensive for many features

In these cases, we turn to gradient descent or regularization.

Gradient Descent

For large-scale problems, we optimize iteratively. The gradient of the loss with respect to weights is:

\nabla_{\mathbf{w}} \mathcal{L} = -\frac{2}{n} \mathbf{X}^T (\mathbf{y} - \mathbf{X}\mathbf{w})

Update rule:

\mathbf{w} := \mathbf{w} - \eta \, \nabla_{\mathbf{w}} \mathcal{L}

Variants

Method	Batch Size	Pros	Cons
Batch GD	All data	Stable, exact gradient	Slow for large datasets
Stochastic GD	1 sample	Fast per step	Noisy, oscillates
Mini-batch GD	32-256 samples	Best of both worlds	Need to tune batch size

Learning Rate

The learning rate $\eta$ is the single most important hyperparameter:

Too large: Diverges, loss explodes
Too small: Converges painfully slowly
Just right: Smooth, steady decrease in loss

In practice, use learning rate schedules or adaptive methods (Adam, RMSprop).

Evaluating the Model

Metrics

MSE (Mean Squared Error): Average of squared residuals. Penalizes large errors.
RMSE: Square root of MSE. Same units as $y$ .
MAE (Mean Absolute Error): Average of absolute residuals. Robust to outliers.
R-squared: Proportion of variance explained. $R^2 = 1$ means perfect fit, $R^2 = 0$ means the model is no better than predicting the mean.

R^2 = 1 - \frac{\sum_{i} (y_i - \hat{y}_i)^2}{\sum_{i} (y_i - \bar{y})^2}

Residual Analysis

A well-fit model should have residuals that are:

Centered at zero — no systematic bias
Constant variance — homoscedasticity
Normally distributed — for valid confidence intervals
Independent — no patterns over time or features

If residuals show patterns, the model is missing something.

The Bias-Variance Tradeoff

Every model’s error can be decomposed into three parts:

\mathbb{E}\!\big[\text{Error}\big] = \text{Bias}^2 + \text{Variance} + \sigma^2_{\text{noise}}

Bias: Error from wrong assumptions (underfitting). A linear model applied to curved data has high bias.
Variance: Error from sensitivity to training data (overfitting). A very flexible model has high variance.
Irreducible noise ( $\sigma^2_{\text{noise}}$ ): Inherent randomness in the data. No model can reduce this.

The goal is to find the sweet spot — complex enough to capture patterns, simple enough to generalize.

Regularization

Regularization prevents overfitting by adding a penalty on the weights.

Ridge Regression (L2)

\mathcal{L}_{\text{ridge}} = \text{MSE} + \lambda \sum_{i} w_i^2

Shrinks all weights toward zero
Never sets weights exactly to zero
Closed-form solution: $\mathbf{w} = (\mathbf{X}^T \mathbf{X} + \lambda \mathbf{I})^{-1} \mathbf{X}^T \mathbf{y}$
Stabilizes the solution when features are correlated

Lasso Regression (L1)

\mathcal{L}_{\text{lasso}} = \text{MSE} + \lambda \sum_{i} |w_i|

Can set weights exactly to zero (feature selection)
No closed-form solution — requires iterative optimization
Produces sparse models

Elastic Net

\mathcal{L}_{\text{elastic}} = \text{MSE} + \lambda_1 \sum_{i} |w_i| + \lambda_2 \sum_{i} w_i^2

Combines L1 and L2 — useful when features are correlated and you still want sparsity.

Bayesian interpretation: Ridge corresponds to a Gaussian prior on weights. Lasso corresponds to a Laplace prior. The regularization strength $\lambda$ is the inverse prior variance. (See the MAP estimation article for details.)

Polynomial and Feature Engineering

Linear regression is linear in the weights, not necessarily in the features. We can model nonlinear relationships by creating new features:

y = w_0 + w_1 x + w_2 x^2 + w_3 x^3

This is still linear regression — just with polynomial features. Other transformations:

Interaction terms: $x_1 \cdot x_2$
Log transforms: $\log(x)$
Binning: Convert continuous to categorical
Basis functions: $\sin(x)$ , $\exp(x)$ , etc.

Warning: More features means higher variance. Always regularize when adding engineered features.

Assumptions of Linear Regression

For the standard theory (confidence intervals, p-values) to be valid:

Linearity: The relationship between $\mathbf{X}$ and $\mathbf{y}$ is truly linear
Independence: Observations are independent
Homoscedasticity: Constant variance of residuals
Normality: Residuals are normally distributed (for inference)
No perfect multicollinearity: No feature is a perfect linear combination of others

Violations don’t make linear regression useless — they just limit what you can conclude from it.

From Linear to Logistic Regression

For classification (predicting categories instead of numbers), we wrap the linear model in a sigmoid function:

P(y = 1 \mid \mathbf{x}) = \sigma(\mathbf{w}^T \mathbf{x}) = \frac{1}{1 + \exp(-\mathbf{w}^T \mathbf{x})}

This is logistic regression — still a linear model at its core, but outputting probabilities between 0 and 1. It’s trained with MLE (cross-entropy loss) instead of least squares.

Summary

Linear regression models $\mathbf{y} = \mathbf{X}\mathbf{w} + \boldsymbol{\epsilon}$
The normal equation gives an exact closed-form solution
Gradient descent handles large-scale optimization
$R^2$ , RMSE, and residual analysis evaluate fit quality
Bias-variance tradeoff guides model complexity choices
Ridge (L2) and Lasso (L1) regularization prevent overfitting
Feature engineering lets linear models capture nonlinear patterns
Linear regression is the foundation for logistic regression, SVMs, and neural networks

References

Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). Springer.
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning (2nd ed.). Springer.
Murphy, K. P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press.
Shalev-Shwartz, S., & Ben-David, S. (2014). Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press.