Inner Products, Norms, and Orthogonality: Measuring Geometry

Master inner products, distance metrics, orthogonal projections, and Gram-Schmidt — the geometric tools behind PCA, least squares, and embeddings.

Linear Algebra March 6, 2026 8 min read

Introduction

Linear algebra without geometry is just symbol manipulation. The concepts of length, distance, angle, and perpendicularity are what give vectors their geometric meaning — and these concepts are built on a single operation: the inner product.

In machine learning, inner products are everywhere. Cosine similarity in NLP, Euclidean distance in clustering, orthogonal projections in least squares, and the Gram-Schmidt process in QR decomposition — all rest on the ideas in this article. We build on linear transformations and set the stage for eigenvalues.

The Inner Product

Standard Inner Product

The standard inner product (dot product) on Rn\mathbb{R}^n is:

u,v=uTv=i=1nuivi\langle \mathbf{u}, \mathbf{v} \rangle = \mathbf{u}^T \mathbf{v} = \sum_{i=1}^{n} u_i v_i

This is the same dot product we introduced in the vectors article, now written in the more general inner product notation ,\langle \cdot, \cdot \rangle.

General Inner Product

An inner product on a vector space VV is any function ,:V×VR\langle \cdot, \cdot \rangle: V \times V \to \mathbb{R} satisfying:

PropertyStatement
Symmetryu,v=v,u\langle \mathbf{u}, \mathbf{v} \rangle = \langle \mathbf{v}, \mathbf{u} \rangle
Linearity (first argument)au+bw,v=au,v+bw,v\langle a\mathbf{u} + b\mathbf{w}, \mathbf{v} \rangle = a\langle \mathbf{u}, \mathbf{v} \rangle + b\langle \mathbf{w}, \mathbf{v} \rangle
Positive definitenessv,v>0\langle \mathbf{v}, \mathbf{v} \rangle > 0 for all v0\mathbf{v} \neq \mathbf{0}, and 0,0=0\langle \mathbf{0}, \mathbf{0} \rangle = 0

The standard dot product is just one inner product. Others exist — for example, a weighted inner product u,vM=uTMv\langle \mathbf{u}, \mathbf{v} \rangle_\mathbf{M} = \mathbf{u}^T \mathbf{M} \mathbf{v} where M\mathbf{M} is positive definite. This appears in Mahalanobis distance, which accounts for feature correlations.

Norms Revisited

Every inner product induces a norm (length measure):

v=v,v\|\mathbf{v}\| = \sqrt{\langle \mathbf{v}, \mathbf{v} \rangle}

For the standard inner product, this gives the Euclidean norm:

v2=v12+v22++vn2\|\mathbf{v}\|_2 = \sqrt{v_1^2 + v_2^2 + \cdots + v_n^2}

Important Norm Properties

Cauchy-Schwarz Inequality:

u,vuv|\langle \mathbf{u}, \mathbf{v} \rangle| \leq \|\mathbf{u}\| \cdot \|\mathbf{v}\|

Equality holds if and only if u\mathbf{u} and v\mathbf{v} are parallel. This inequality is arguably the most important in all of mathematics — it guarantees that the cosine formula gives values in [1,1][-1, 1].

Triangle Inequality:

u+vu+v\|\mathbf{u} + \mathbf{v}\| \leq \|\mathbf{u}\| + \|\mathbf{v}\|

The direct path is never longer than going through an intermediate point.

Parallelogram Law:

u+v2+uv2=2u2+2v2\|\mathbf{u} + \mathbf{v}\|^2 + \|\mathbf{u} - \mathbf{v}\|^2 = 2\|\mathbf{u}\|^2 + 2\|\mathbf{v}\|^2

Distance

The distance between two vectors is the norm of their difference:

d(u,v)=uvd(\mathbf{u}, \mathbf{v}) = \|\mathbf{u} - \mathbf{v}\|

For the Euclidean norm, this is the standard Euclidean distance used in k-nearest neighbors, k-means clustering, and many other algorithms.

Angles and Cosine Similarity

The angle θ\theta between two nonzero vectors is defined by:

cosθ=u,vuv\cos\theta = \frac{\langle \mathbf{u}, \mathbf{v} \rangle}{\|\mathbf{u}\| \cdot \|\mathbf{v}\|}

This ratio is called cosine similarity:

sim(u,v)=uTvu2v2\text{sim}(\mathbf{u}, \mathbf{v}) = \frac{\mathbf{u}^T \mathbf{v}}{\|\mathbf{u}\|_2 \cdot \|\mathbf{v}\|_2}
Cosine similarityInterpretation
+1+1Vectors point in the same direction
00Vectors are orthogonal (perpendicular)
1-1Vectors point in opposite directions

Key insight: Cosine similarity measures direction alignment regardless of magnitude. This is why it dominates in NLP — the meaning of a word embedding is encoded in its direction, not its length. Two long documents with similar topics have high cosine similarity even if their word count vectors have very different magnitudes.

Orthogonality

Two vectors are orthogonal if their inner product is zero:

uv    u,v=0\mathbf{u} \perp \mathbf{v} \iff \langle \mathbf{u}, \mathbf{v} \rangle = 0

A set of vectors {v1,,vk}\{\mathbf{v}_1, \ldots, \mathbf{v}_k\} is:

  • Orthogonal if vi,vj=0\langle \mathbf{v}_i, \mathbf{v}_j \rangle = 0 for all iji \neq j
  • Orthonormal if additionally vi=1\|\mathbf{v}_i\| = 1 for all ii

Orthonormal vectors are automatically linearly independent. Working with orthonormal bases makes everything simpler: coordinates become dot products, and matrix operations become clean.

The Pythagorean Theorem

If uv\mathbf{u} \perp \mathbf{v}, then:

u+v2=u2+v2\|\mathbf{u} + \mathbf{v}\|^2 = \|\mathbf{u}\|^2 + \|\mathbf{v}\|^2

This generalizes to any number of mutually orthogonal vectors.

Orthogonal Projection

The projection of vector v\mathbf{v} onto vector u\mathbf{u} is:

proju(v)=v,uu,uu=uTvuTuu\text{proj}_\mathbf{u}(\mathbf{v}) = \frac{\langle \mathbf{v}, \mathbf{u} \rangle}{\langle \mathbf{u}, \mathbf{u} \rangle} \mathbf{u} = \frac{\mathbf{u}^T \mathbf{v}}{\mathbf{u}^T \mathbf{u}} \mathbf{u}

The residual vproju(v)\mathbf{v} - \text{proj}_\mathbf{u}(\mathbf{v}) is orthogonal to u\mathbf{u}.

Projection onto a Subspace

More generally, projecting v\mathbf{v} onto the column space of a matrix A\mathbf{A} (with linearly independent columns) gives:

projC(A)(v)=A(ATA)1ATv\text{proj}_{C(\mathbf{A})}(\mathbf{v}) = \mathbf{A}(\mathbf{A}^T\mathbf{A})^{-1}\mathbf{A}^T \mathbf{v}

The matrix P=A(ATA)1AT\mathbf{P} = \mathbf{A}(\mathbf{A}^T\mathbf{A})^{-1}\mathbf{A}^T is the projection matrix. It satisfies P2=P\mathbf{P}^2 = \mathbf{P} (applying it twice does nothing new) and PT=P\mathbf{P}^T = \mathbf{P} (it is symmetric).

Key insight: Least squares regression is an orthogonal projection. When Xw=y\mathbf{X}\mathbf{w} = \mathbf{y} has no exact solution, the least squares solution w^=(XTX)1XTy\hat{\mathbf{w}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y} projects y\mathbf{y} onto the column space of X\mathbf{X}. The residual yXw^\mathbf{y} - \mathbf{X}\hat{\mathbf{w}} is orthogonal to every column of X\mathbf{X}.

The Gram-Schmidt Process

The Gram-Schmidt process converts any set of linearly independent vectors into an orthonormal set spanning the same subspace.

Algorithm: Given linearly independent vectors v1,,vk\mathbf{v}_1, \ldots, \mathbf{v}_k:

  1. u1=v1\mathbf{u}_1 = \mathbf{v}_1
  2. For j=2,,kj = 2, \ldots, k:
    • Subtract the projections onto all previous vectors:
uj=vji=1j1vj,uiui,uiui\mathbf{u}_j = \mathbf{v}_j - \sum_{i=1}^{j-1} \frac{\langle \mathbf{v}_j, \mathbf{u}_i \rangle}{\langle \mathbf{u}_i, \mathbf{u}_i \rangle} \mathbf{u}_i
  1. Normalize: qi=uiui\mathbf{q}_i = \frac{\mathbf{u}_i}{\|\mathbf{u}_i\|}

Worked Example

Orthogonalize v1=[1,1,0]T\mathbf{v}_1 = [1, 1, 0]^T and v2=[1,0,1]T\mathbf{v}_2 = [1, 0, 1]^T.

Step 1: u1=v1=[1,1,0]T\mathbf{u}_1 = \mathbf{v}_1 = [1, 1, 0]^T

Step 2:

u2=v2v2,u1u1,u1u1=[101]12[110]=[1/21/21]\begin{aligned} \mathbf{u}_2 &= \mathbf{v}_2 - \frac{\langle \mathbf{v}_2, \mathbf{u}_1 \rangle}{\langle \mathbf{u}_1, \mathbf{u}_1 \rangle} \mathbf{u}_1 \\[6pt] &= \begin{bmatrix} 1 \\ 0 \\ 1 \end{bmatrix} - \frac{1}{2} \begin{bmatrix} 1 \\ 1 \\ 0 \end{bmatrix} = \begin{bmatrix} 1/2 \\ -1/2 \\ 1 \end{bmatrix} \end{aligned}

Verify: u1,u2=1/21/2+0=0\langle \mathbf{u}_1, \mathbf{u}_2 \rangle = 1/2 - 1/2 + 0 = 0. Orthogonal.

Step 3: Normalize:

q1=12[110],q2=13/2[1/21/21]=23[1/21/21]\mathbf{q}_1 = \frac{1}{\sqrt{2}}\begin{bmatrix} 1 \\ 1 \\ 0 \end{bmatrix}, \qquad \mathbf{q}_2 = \frac{1}{\sqrt{3/2}}\begin{bmatrix} 1/2 \\ -1/2 \\ 1 \end{bmatrix} = \sqrt{\frac{2}{3}}\begin{bmatrix} 1/2 \\ -1/2 \\ 1 \end{bmatrix}
import numpy as np

v1 = np.array([1, 1, 0], dtype=float)
v2 = np.array([1, 0, 1], dtype=float)

# Gram-Schmidt
u1 = v1
u2 = v2 - (v2 @ u1) / (u1 @ u1) * u1

q1 = u1 / np.linalg.norm(u1)
q2 = u2 / np.linalg.norm(u2)

print(f"q1 = {q1}")
print(f"q2 = {q2}")
print(f"q1 · q2 = {q1 @ q2:.10f}")  # ≈ 0 (orthogonal)

Connection to QR Decomposition

Gram-Schmidt applied to the columns of A\mathbf{A} produces the QR decomposition: A=QR\mathbf{A} = \mathbf{Q}\mathbf{R}, where Q\mathbf{Q} has orthonormal columns and R\mathbf{R} is upper triangular. QR decomposition is the numerically stable way to solve least squares problems.

Orthogonal Complements

The orthogonal complement of a subspace WW is the set of all vectors perpendicular to everything in WW:

W={vV:v,w=0 for all wW}W^\perp = \{\mathbf{v} \in V : \langle \mathbf{v}, \mathbf{w} \rangle = 0 \text{ for all } \mathbf{w} \in W\}

WW^\perp is itself a subspace, and dim(W)+dim(W)=dim(V)\dim(W) + \dim(W^\perp) = \dim(V).

Every vector vV\mathbf{v} \in V can be uniquely decomposed as:

v=w+w\mathbf{v} = \mathbf{w} + \mathbf{w}^\perp

where wW\mathbf{w} \in W and wW\mathbf{w}^\perp \in W^\perp. The component w\mathbf{w} is the orthogonal projection of v\mathbf{v} onto WW.

Orthogonal Matrices

An orthogonal matrix Q\mathbf{Q} satisfies QTQ=I\mathbf{Q}^T\mathbf{Q} = \mathbf{I}, meaning its columns are orthonormal.

Key properties:

  • Q1=QT\mathbf{Q}^{-1} = \mathbf{Q}^T — inversion is free
  • Qx2=x2\|\mathbf{Q}\mathbf{x}\|_2 = \|\mathbf{x}\|_2 — preserves lengths
  • Qu,Qv=u,v\langle \mathbf{Q}\mathbf{u}, \mathbf{Q}\mathbf{v} \rangle = \langle \mathbf{u}, \mathbf{v} \rangle — preserves inner products
  • det(Q)=±1\det(\mathbf{Q}) = \pm 1

Orthogonal matrices represent rotations (det=+1\det = +1) and reflections (det=1\det = -1). They are the “rigid motions” of linear algebra — they move vectors around without distorting them.

Key insight: Orthogonal transformations are numerically ideal because they do not amplify errors. This is why QR decomposition (which produces orthogonal matrices) is preferred over LU for least squares, and why orthogonal initialization of neural network weights helps with training stability.

Why This Matters for ML

  • Cosine similarity drives similarity search in NLP embeddings, recommendation systems, and retrieval-augmented generation.
  • Orthogonal projection is the mathematical core of least squares regression — projecting the target vector onto the feature space.
  • Gram-Schmidt / QR decomposition provides numerically stable solutions to least squares problems.
  • Orthogonal weight initialization (e.g., in RNNs) helps preserve gradient magnitudes across layers, mitigating vanishing/exploding gradients.
  • Mahalanobis distance uses a weighted inner product to account for feature correlations, improving anomaly detection and clustering.
  • Kernel methods generalize inner products to nonlinear feature spaces via the kernel trick.

Summary

  • An inner product generalizes the dot product, providing notions of length, angle, and orthogonality.
  • The Cauchy-Schwarz inequality u,vuv|\langle \mathbf{u}, \mathbf{v} \rangle| \leq \|\mathbf{u}\| \cdot \|\mathbf{v}\| is the foundational bound.
  • Cosine similarity measures directional alignment and is central to NLP and retrieval.
  • Vectors are orthogonal when their inner product is zero — they carry independent information.
  • Orthogonal projection finds the closest point in a subspace and underlies least squares regression.
  • Gram-Schmidt orthogonalizes vectors and leads to QR decomposition.
  • Orthogonal matrices preserve geometry and provide numerical stability.
  • Next, we discover the intrinsic structure of matrices through eigenvalues and eigenvectors.

References

  • Strang, G. (2016). Introduction to Linear Algebra (5th ed.). Wellesley-Cambridge Press. math.mit.edu/~gs/linearalgebra
  • Boyd, S., & Vandenberghe, L. (2018). Introduction to Applied Linear Algebra. Cambridge University Press. vmls-book.stanford.edu
  • Trefethen, L. N., & Bau, D. (1997). Numerical Linear Algebra. SIAM. Chapter 7-8.

Keyboard Shortcuts

Navigation
j
Next heading
k
Previous heading
n
Next article in series
p
Previous article in series
t
Scroll to top
Actions
r
Toggle reading mode
Ctrl K
Search articles
?
Toggle this help
Esc
Close overlay