- 01 Vectors and Vector Spaces: The Language of Data 02 Matrices and Matrix Operations: Organizing Linear Computation 03 Systems of Linear Equations: From Geometry to Algorithms 04 Determinants: The Volume Factor of Linear Maps 05 Linear Transformations: Matrices as Functions 06 Inner Products, Norms, and Orthogonality: Measuring Geometry 07 Eigenvalues and Eigenvectors: The DNA of a Matrix 08 Matrix Decompositions: Breaking Matrices into Simpler Pieces 09 Linear Algebra in Machine Learning: Putting It All Together 10 Matrix Calculus: Derivatives for Machine Learning 11 Tensor Operations: Beyond Matrices 12 Sparse Matrices and Efficient Computation 13 Randomized Linear Algebra: Speed Through Randomness
Introduction
Linear algebra without geometry is just symbol manipulation. The concepts of length, distance, angle, and perpendicularity are what give vectors their geometric meaning — and these concepts are built on a single operation: the inner product.
In machine learning, inner products are everywhere. Cosine similarity in NLP, Euclidean distance in clustering, orthogonal projections in least squares, and the Gram-Schmidt process in QR decomposition — all rest on the ideas in this article. We build on linear transformations and set the stage for eigenvalues.
The Inner Product
Standard Inner Product
The standard inner product (dot product) on is:
This is the same dot product we introduced in the vectors article, now written in the more general inner product notation .
General Inner Product
An inner product on a vector space is any function satisfying:
| Property | Statement |
|---|---|
| Symmetry | |
| Linearity (first argument) | |
| Positive definiteness | for all , and |
The standard dot product is just one inner product. Others exist — for example, a weighted inner product where is positive definite. This appears in Mahalanobis distance, which accounts for feature correlations.
Norms Revisited
Every inner product induces a norm (length measure):
For the standard inner product, this gives the Euclidean norm:
Important Norm Properties
Cauchy-Schwarz Inequality:
Equality holds if and only if and are parallel. This inequality is arguably the most important in all of mathematics — it guarantees that the cosine formula gives values in .
Triangle Inequality:
The direct path is never longer than going through an intermediate point.
Parallelogram Law:
Distance
The distance between two vectors is the norm of their difference:
For the Euclidean norm, this is the standard Euclidean distance used in k-nearest neighbors, k-means clustering, and many other algorithms.
Angles and Cosine Similarity
The angle between two nonzero vectors is defined by:
This ratio is called cosine similarity:
| Cosine similarity | Interpretation |
|---|---|
| Vectors point in the same direction | |
| Vectors are orthogonal (perpendicular) | |
| Vectors point in opposite directions |
Key insight: Cosine similarity measures direction alignment regardless of magnitude. This is why it dominates in NLP — the meaning of a word embedding is encoded in its direction, not its length. Two long documents with similar topics have high cosine similarity even if their word count vectors have very different magnitudes.
Orthogonality
Two vectors are orthogonal if their inner product is zero:
A set of vectors is:
- Orthogonal if for all
- Orthonormal if additionally for all
Orthonormal vectors are automatically linearly independent. Working with orthonormal bases makes everything simpler: coordinates become dot products, and matrix operations become clean.
The Pythagorean Theorem
If , then:
This generalizes to any number of mutually orthogonal vectors.
Orthogonal Projection
The projection of vector onto vector is:
The residual is orthogonal to .
Projection onto a Subspace
More generally, projecting onto the column space of a matrix (with linearly independent columns) gives:
The matrix is the projection matrix. It satisfies (applying it twice does nothing new) and (it is symmetric).
Key insight: Least squares regression is an orthogonal projection. When has no exact solution, the least squares solution projects onto the column space of . The residual is orthogonal to every column of .
The Gram-Schmidt Process
The Gram-Schmidt process converts any set of linearly independent vectors into an orthonormal set spanning the same subspace.
Algorithm: Given linearly independent vectors :
- For :
- Subtract the projections onto all previous vectors:
- Normalize:
Worked Example
Orthogonalize and .
Step 1:
Step 2:
Verify: . Orthogonal.
Step 3: Normalize:
import numpy as np
v1 = np.array([1, 1, 0], dtype=float)
v2 = np.array([1, 0, 1], dtype=float)
# Gram-Schmidt
u1 = v1
u2 = v2 - (v2 @ u1) / (u1 @ u1) * u1
q1 = u1 / np.linalg.norm(u1)
q2 = u2 / np.linalg.norm(u2)
print(f"q1 = {q1}")
print(f"q2 = {q2}")
print(f"q1 · q2 = {q1 @ q2:.10f}") # ≈ 0 (orthogonal)
Connection to QR Decomposition
Gram-Schmidt applied to the columns of produces the QR decomposition: , where has orthonormal columns and is upper triangular. QR decomposition is the numerically stable way to solve least squares problems.
Orthogonal Complements
The orthogonal complement of a subspace is the set of all vectors perpendicular to everything in :
is itself a subspace, and .
Every vector can be uniquely decomposed as:
where and . The component is the orthogonal projection of onto .
Orthogonal Matrices
An orthogonal matrix satisfies , meaning its columns are orthonormal.
Key properties:
- — inversion is free
- — preserves lengths
- — preserves inner products
Orthogonal matrices represent rotations () and reflections (). They are the “rigid motions” of linear algebra — they move vectors around without distorting them.
Key insight: Orthogonal transformations are numerically ideal because they do not amplify errors. This is why QR decomposition (which produces orthogonal matrices) is preferred over LU for least squares, and why orthogonal initialization of neural network weights helps with training stability.
Why This Matters for ML
- Cosine similarity drives similarity search in NLP embeddings, recommendation systems, and retrieval-augmented generation.
- Orthogonal projection is the mathematical core of least squares regression — projecting the target vector onto the feature space.
- Gram-Schmidt / QR decomposition provides numerically stable solutions to least squares problems.
- Orthogonal weight initialization (e.g., in RNNs) helps preserve gradient magnitudes across layers, mitigating vanishing/exploding gradients.
- Mahalanobis distance uses a weighted inner product to account for feature correlations, improving anomaly detection and clustering.
- Kernel methods generalize inner products to nonlinear feature spaces via the kernel trick.
Summary
- An inner product generalizes the dot product, providing notions of length, angle, and orthogonality.
- The Cauchy-Schwarz inequality is the foundational bound.
- Cosine similarity measures directional alignment and is central to NLP and retrieval.
- Vectors are orthogonal when their inner product is zero — they carry independent information.
- Orthogonal projection finds the closest point in a subspace and underlies least squares regression.
- Gram-Schmidt orthogonalizes vectors and leads to QR decomposition.
- Orthogonal matrices preserve geometry and provide numerical stability.
- Next, we discover the intrinsic structure of matrices through eigenvalues and eigenvectors.
References
- Strang, G. (2016). Introduction to Linear Algebra (5th ed.). Wellesley-Cambridge Press. math.mit.edu/~gs/linearalgebra
- Boyd, S., & Vandenberghe, L. (2018). Introduction to Applied Linear Algebra. Cambridge University Press. vmls-book.stanford.edu
- Trefethen, L. N., & Bau, D. (1997). Numerical Linear Algebra. SIAM. Chapter 7-8.