The Simplest Useful Model
Linear regression is the workhorse of predictive modeling. Despite its simplicity, it’s the foundation upon which much of machine learning is built. Understanding it deeply means understanding optimization, overfitting, regularization, and the bias-variance tradeoff.
The model assumes a linear relationship between inputs and output:
Or in matrix form:
where represents noise (typically assumed Gaussian).
Why Linear Models Matter
Even in an era of deep learning, linear regression remains essential:
- Interpretable: Each weight tells you exactly how much a feature matters
- Fast: Training is near-instant, even on millions of data points
- Baseline: Every ML project should start with a linear model
- Foundation: Logistic regression, SVMs, and neural networks all generalize linear models
The Normal Equation
For linear regression, there’s a closed-form solution — no iterative optimization needed:
This minimizes the sum of squared errors and gives the exact optimal weights in one step.
Derivation
We want to minimize the residual sum of squares:
Taking the derivative with respect to and setting it to zero:
Geometric interpretation: The prediction is the orthogonal projection of onto the column space of . The residuals are perpendicular to every feature vector.
When the Normal Equation Fails
The normal equation requires inverting , which can fail when:
- Multicollinearity: Features are highly correlated ( is nearly singular)
- More features than samples: makes rank-deficient
- Large datasets: Matrix inversion is , expensive for many features
In these cases, we turn to gradient descent or regularization.
Gradient Descent
For large-scale problems, we optimize iteratively. The gradient of the loss with respect to weights is:
Update rule:
Variants
| Method | Batch Size | Pros | Cons |
|---|---|---|---|
| Batch GD | All data | Stable, exact gradient | Slow for large datasets |
| Stochastic GD | 1 sample | Fast per step | Noisy, oscillates |
| Mini-batch GD | 32-256 samples | Best of both worlds | Need to tune batch size |
Learning Rate
The learning rate is the single most important hyperparameter:
- Too large: Diverges, loss explodes
- Too small: Converges painfully slowly
- Just right: Smooth, steady decrease in loss
In practice, use learning rate schedules or adaptive methods (Adam, RMSprop).
Evaluating the Model
Metrics
- MSE (Mean Squared Error): Average of squared residuals. Penalizes large errors.
- RMSE: Square root of MSE. Same units as .
- MAE (Mean Absolute Error): Average of absolute residuals. Robust to outliers.
- R-squared: Proportion of variance explained. means perfect fit, means the model is no better than predicting the mean.
Residual Analysis
A well-fit model should have residuals that are:
- Centered at zero — no systematic bias
- Constant variance — homoscedasticity
- Normally distributed — for valid confidence intervals
- Independent — no patterns over time or features
If residuals show patterns, the model is missing something.
The Bias-Variance Tradeoff
Every model’s error can be decomposed into three parts:
- Bias: Error from wrong assumptions (underfitting). A linear model applied to curved data has high bias.
- Variance: Error from sensitivity to training data (overfitting). A very flexible model has high variance.
- Irreducible noise (): Inherent randomness in the data. No model can reduce this.
The goal is to find the sweet spot — complex enough to capture patterns, simple enough to generalize.
Regularization
Regularization prevents overfitting by adding a penalty on the weights.
Ridge Regression (L2)
- Shrinks all weights toward zero
- Never sets weights exactly to zero
- Closed-form solution:
- Stabilizes the solution when features are correlated
Lasso Regression (L1)
- Can set weights exactly to zero (feature selection)
- No closed-form solution — requires iterative optimization
- Produces sparse models
Elastic Net
Combines L1 and L2 — useful when features are correlated and you still want sparsity.
Bayesian interpretation: Ridge corresponds to a Gaussian prior on weights. Lasso corresponds to a Laplace prior. The regularization strength is the inverse prior variance. (See the MAP estimation article for details.)
Polynomial and Feature Engineering
Linear regression is linear in the weights, not necessarily in the features. We can model nonlinear relationships by creating new features:
This is still linear regression — just with polynomial features. Other transformations:
- Interaction terms:
- Log transforms:
- Binning: Convert continuous to categorical
- Basis functions: , , etc.
Warning: More features means higher variance. Always regularize when adding engineered features.
Assumptions of Linear Regression
For the standard theory (confidence intervals, p-values) to be valid:
- Linearity: The relationship between and is truly linear
- Independence: Observations are independent
- Homoscedasticity: Constant variance of residuals
- Normality: Residuals are normally distributed (for inference)
- No perfect multicollinearity: No feature is a perfect linear combination of others
Violations don’t make linear regression useless — they just limit what you can conclude from it.
From Linear to Logistic Regression
For classification (predicting categories instead of numbers), we wrap the linear model in a sigmoid function:
This is logistic regression — still a linear model at its core, but outputting probabilities between 0 and 1. It’s trained with MLE (cross-entropy loss) instead of least squares.
Summary
- Linear regression models
- The normal equation gives an exact closed-form solution
- Gradient descent handles large-scale optimization
- , RMSE, and residual analysis evaluate fit quality
- Bias-variance tradeoff guides model complexity choices
- Ridge (L2) and Lasso (L1) regularization prevent overfitting
- Feature engineering lets linear models capture nonlinear patterns
- Linear regression is the foundation for logistic regression, SVMs, and neural networks
References
- Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.
- Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). Springer.
- James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning (2nd ed.). Springer.
- Murphy, K. P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press.
- Shalev-Shwartz, S., & Ben-David, S. (2014). Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press.