Inner Products, Norms, and Orthogonality: Measuring Geometry

Linear Algebra Series 6 / 13

Introduction

Linear algebra without geometry is just symbol manipulation. The concepts of length, distance, angle, and perpendicularity are what give vectors their geometric meaning — and these concepts are built on a single operation: the inner product.

In machine learning, inner products are everywhere. Cosine similarity in NLP, Euclidean distance in clustering, orthogonal projections in least squares, and the Gram-Schmidt process in QR decomposition — all rest on the ideas in this article. We build on linear transformations and set the stage for eigenvalues.

The Inner Product

Standard Inner Product

The standard inner product (dot product) on $\mathbb{R}^n$ is:

\langle \mathbf{u}, \mathbf{v} \rangle = \mathbf{u}^T \mathbf{v} = \sum_{i=1}^{n} u_i v_i

This is the same dot product we introduced in the vectors article, now written in the more general inner product notation $\langle \cdot, \cdot \rangle$ .

General Inner Product

An inner product on a vector space $V$ is any function $\langle \cdot, \cdot \rangle: V \times V \to \mathbb{R}$ satisfying:

Property	Statement
Symmetry	$\langle \mathbf{u}, \mathbf{v} \rangle = \langle \mathbf{v}, \mathbf{u} \rangle$
Linearity (first argument)	$\langle a\mathbf{u} + b\mathbf{w}, \mathbf{v} \rangle = a\langle \mathbf{u}, \mathbf{v} \rangle + b\langle \mathbf{w}, \mathbf{v} \rangle$
Positive definiteness	$\langle \mathbf{v}, \mathbf{v} \rangle > 0$ for all $\mathbf{v} \neq \mathbf{0}$ , and $\langle \mathbf{0}, \mathbf{0} \rangle = 0$

The standard dot product is just one inner product. Others exist — for example, a weighted inner product $\langle \mathbf{u}, \mathbf{v} \rangle_\mathbf{M} = \mathbf{u}^T \mathbf{M} \mathbf{v}$ where $\mathbf{M}$ is positive definite. This appears in Mahalanobis distance, which accounts for feature correlations.

Norms Revisited

Every inner product induces a norm (length measure):

\|\mathbf{v}\| = \sqrt{\langle \mathbf{v}, \mathbf{v} \rangle}

For the standard inner product, this gives the Euclidean norm:

\|\mathbf{v}\|_2 = \sqrt{v_1^2 + v_2^2 + \cdots + v_n^2}

Important Norm Properties

Cauchy-Schwarz Inequality:

|\langle \mathbf{u}, \mathbf{v} \rangle| \leq \|\mathbf{u}\| \cdot \|\mathbf{v}\|

Equality holds if and only if $\mathbf{u}$ and $\mathbf{v}$ are parallel. This inequality is arguably the most important in all of mathematics — it guarantees that the cosine formula gives values in $[-1, 1]$ .

Triangle Inequality:

\|\mathbf{u} + \mathbf{v}\| \leq \|\mathbf{u}\| + \|\mathbf{v}\|

The direct path is never longer than going through an intermediate point.

Parallelogram Law:

\|\mathbf{u} + \mathbf{v}\|^2 + \|\mathbf{u} - \mathbf{v}\|^2 = 2\|\mathbf{u}\|^2 + 2\|\mathbf{v}\|^2

Distance

The distance between two vectors is the norm of their difference:

d(\mathbf{u}, \mathbf{v}) = \|\mathbf{u} - \mathbf{v}\|

For the Euclidean norm, this is the standard Euclidean distance used in k-nearest neighbors, k-means clustering, and many other algorithms.

Angles and Cosine Similarity

The angle $\theta$ between two nonzero vectors is defined by:

\cos\theta = \frac{\langle \mathbf{u}, \mathbf{v} \rangle}{\|\mathbf{u}\| \cdot \|\mathbf{v}\|}

This ratio is called cosine similarity:

\text{sim}(\mathbf{u}, \mathbf{v}) = \frac{\mathbf{u}^T \mathbf{v}}{\|\mathbf{u}\|_2 \cdot \|\mathbf{v}\|_2}

Cosine similarity	Interpretation
$+1$	Vectors point in the same direction
$0$	Vectors are orthogonal (perpendicular)
$-1$	Vectors point in opposite directions

Key insight: Cosine similarity measures direction alignment regardless of magnitude. This is why it dominates in NLP — the meaning of a word embedding is encoded in its direction, not its length. Two long documents with similar topics have high cosine similarity even if their word count vectors have very different magnitudes.

Orthogonality

Two vectors are orthogonal if their inner product is zero:

\mathbf{u} \perp \mathbf{v} \iff \langle \mathbf{u}, \mathbf{v} \rangle = 0

A set of vectors $\{\mathbf{v}_1, \ldots, \mathbf{v}_k\}$ is:

Orthogonal if $\langle \mathbf{v}_i, \mathbf{v}_j \rangle = 0$ for all $i \neq j$
Orthonormal if additionally $\|\mathbf{v}_i\| = 1$ for all $i$

Orthonormal vectors are automatically linearly independent. Working with orthonormal bases makes everything simpler: coordinates become dot products, and matrix operations become clean.

The Pythagorean Theorem

If $\mathbf{u} \perp \mathbf{v}$ , then:

\|\mathbf{u} + \mathbf{v}\|^2 = \|\mathbf{u}\|^2 + \|\mathbf{v}\|^2

This generalizes to any number of mutually orthogonal vectors.

Orthogonal Projection

The projection of vector $\mathbf{v}$ onto vector $\mathbf{u}$ is:

\text{proj}_\mathbf{u}(\mathbf{v}) = \frac{\langle \mathbf{v}, \mathbf{u} \rangle}{\langle \mathbf{u}, \mathbf{u} \rangle} \mathbf{u} = \frac{\mathbf{u}^T \mathbf{v}}{\mathbf{u}^T \mathbf{u}} \mathbf{u}

The residual $\mathbf{v} - \text{proj}_\mathbf{u}(\mathbf{v})$ is orthogonal to $\mathbf{u}$ .

Projection onto a Subspace

More generally, projecting $\mathbf{v}$ onto the column space of a matrix $\mathbf{A}$ (with linearly independent columns) gives:

\text{proj}_{C(\mathbf{A})}(\mathbf{v}) = \mathbf{A}(\mathbf{A}^T\mathbf{A})^{-1}\mathbf{A}^T \mathbf{v}

The matrix $\mathbf{P} = \mathbf{A}(\mathbf{A}^T\mathbf{A})^{-1}\mathbf{A}^T$ is the projection matrix. It satisfies $\mathbf{P}^2 = \mathbf{P}$ (applying it twice does nothing new) and $\mathbf{P}^T = \mathbf{P}$ (it is symmetric).

Key insight: Least squares regression is an orthogonal projection. When $\mathbf{X}\mathbf{w} = \mathbf{y}$ has no exact solution, the least squares solution $\hat{\mathbf{w}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$ projects $\mathbf{y}$ onto the column space of $\mathbf{X}$ . The residual $\mathbf{y} - \mathbf{X}\hat{\mathbf{w}}$ is orthogonal to every column of $\mathbf{X}$ .

The Gram-Schmidt Process

The Gram-Schmidt process converts any set of linearly independent vectors into an orthonormal set spanning the same subspace.

Algorithm: Given linearly independent vectors $\mathbf{v}_1, \ldots, \mathbf{v}_k$ :

$\mathbf{u}_1 = \mathbf{v}_1$
For $j = 2, \ldots, k$ $j = 2, \dots, k$ :
- Subtract the projections onto all previous vectors:

\mathbf{u}_j = \mathbf{v}_j - \sum_{i=1}^{j-1} \frac{\langle \mathbf{v}_j, \mathbf{u}_i \rangle}{\langle \mathbf{u}_i, \mathbf{u}_i \rangle} \mathbf{u}_i

Normalize: $\mathbf{q}_i = \frac{\mathbf{u}_i}{\|\mathbf{u}_i\|}$

Worked Example

Orthogonalize $\mathbf{v}_1 = [1, 1, 0]^T$ and $\mathbf{v}_2 = [1, 0, 1]^T$ .

Step 1: $\mathbf{u}_1 = \mathbf{v}_1 = [1, 1, 0]^T$

Step 2:

\begin{aligned} \mathbf{u}_2 &= \mathbf{v}_2 - \frac{\langle \mathbf{v}_2, \mathbf{u}_1 \rangle}{\langle \mathbf{u}_1, \mathbf{u}_1 \rangle} \mathbf{u}_1 \\[6pt] &= \begin{bmatrix} 1 \\ 0 \\ 1 \end{bmatrix} - \frac{1}{2} \begin{bmatrix} 1 \\ 1 \\ 0 \end{bmatrix} = \begin{bmatrix} 1/2 \\ -1/2 \\ 1 \end{bmatrix} \end{aligned}

Verify: $\langle \mathbf{u}_1, \mathbf{u}_2 \rangle = 1/2 - 1/2 + 0 = 0$ . Orthogonal.

Step 3: Normalize:

\mathbf{q}_1 = \frac{1}{\sqrt{2}}\begin{bmatrix} 1 \\ 1 \\ 0 \end{bmatrix}, \qquad \mathbf{q}_2 = \frac{1}{\sqrt{3/2}}\begin{bmatrix} 1/2 \\ -1/2 \\ 1 \end{bmatrix} = \sqrt{\frac{2}{3}}\begin{bmatrix} 1/2 \\ -1/2 \\ 1 \end{bmatrix}

import numpy as np

v1 = np.array([1, 1, 0], dtype=float)
v2 = np.array([1, 0, 1], dtype=float)

# Gram-Schmidt
u1 = v1
u2 = v2 - (v2 @ u1) / (u1 @ u1) * u1

q1 = u1 / np.linalg.norm(u1)
q2 = u2 / np.linalg.norm(u2)

print(f"q1 = {q1}")
print(f"q2 = {q2}")
print(f"q1 · q2 = {q1 @ q2:.10f}")  # ≈ 0 (orthogonal)

Connection to QR Decomposition

Gram-Schmidt applied to the columns of $\mathbf{A}$ produces the QR decomposition: $\mathbf{A} = \mathbf{Q}\mathbf{R}$ , where $\mathbf{Q}$ has orthonormal columns and $\mathbf{R}$ is upper triangular. QR decomposition is the numerically stable way to solve least squares problems.

Orthogonal Complements

The orthogonal complement of a subspace $W$ is the set of all vectors perpendicular to everything in $W$ :

W^\perp = \{\mathbf{v} \in V : \langle \mathbf{v}, \mathbf{w} \rangle = 0 \text{ for all } \mathbf{w} \in W\}

$W^\perp$ is itself a subspace, and $\dim(W) + \dim(W^\perp) = \dim(V)$ .

Every vector $\mathbf{v} \in V$ can be uniquely decomposed as:

\mathbf{v} = \mathbf{w} + \mathbf{w}^\perp

where $\mathbf{w} \in W$ and $\mathbf{w}^\perp \in W^\perp$ . The component $\mathbf{w}$ is the orthogonal projection of $\mathbf{v}$ onto $W$ .

Orthogonal Matrices

An orthogonal matrix $\mathbf{Q}$ satisfies $\mathbf{Q}^T\mathbf{Q} = \mathbf{I}$ , meaning its columns are orthonormal.

Key properties:

$\mathbf{Q}^{-1} = \mathbf{Q}^T$ — inversion is free
$\|\mathbf{Q}\mathbf{x}\|_2 = \|\mathbf{x}\|_2$ — preserves lengths
$\langle \mathbf{Q}\mathbf{u}, \mathbf{Q}\mathbf{v} \rangle = \langle \mathbf{u}, \mathbf{v} \rangle$ — preserves inner products
$\det(\mathbf{Q}) = \pm 1$

Orthogonal matrices represent rotations ( $\det = +1$ ) and reflections ( $\det = -1$ ). They are the “rigid motions” of linear algebra — they move vectors around without distorting them.

Key insight: Orthogonal transformations are numerically ideal because they do not amplify errors. This is why QR decomposition (which produces orthogonal matrices) is preferred over LU for least squares, and why orthogonal initialization of neural network weights helps with training stability.

Why This Matters for ML

Cosine similarity drives similarity search in NLP embeddings, recommendation systems, and retrieval-augmented generation.
Orthogonal projection is the mathematical core of least squares regression — projecting the target vector onto the feature space.
Gram-Schmidt / QR decomposition provides numerically stable solutions to least squares problems.
Orthogonal weight initialization (e.g., in RNNs) helps preserve gradient magnitudes across layers, mitigating vanishing/exploding gradients.
Mahalanobis distance uses a weighted inner product to account for feature correlations, improving anomaly detection and clustering.
Kernel methods generalize inner products to nonlinear feature spaces via the kernel trick.

Summary

An inner product generalizes the dot product, providing notions of length, angle, and orthogonality.
The Cauchy-Schwarz inequality $|\langle \mathbf{u}, \mathbf{v} \rangle| \leq \|\mathbf{u}\| \cdot \|\mathbf{v}\|$ is the foundational bound.
Cosine similarity measures directional alignment and is central to NLP and retrieval.
Vectors are orthogonal when their inner product is zero — they carry independent information.
Orthogonal projection finds the closest point in a subspace and underlies least squares regression.
Gram-Schmidt orthogonalizes vectors and leads to QR decomposition.
Orthogonal matrices preserve geometry and provide numerical stability.
Next, we discover the intrinsic structure of matrices through eigenvalues and eigenvectors.

References

Strang, G. (2016). Introduction to Linear Algebra (5th ed.). Wellesley-Cambridge Press. math.mit.edu/~gs/linearalgebra
Boyd, S., & Vandenberghe, L. (2018). Introduction to Applied Linear Algebra. Cambridge University Press. vmls-book.stanford.edu
Trefethen, L. N., & Bau, D. (1997). Numerical Linear Algebra. SIAM. Chapter 7-8.