- 01 Vectors and Vector Spaces: The Language of Data 02 Matrices and Matrix Operations: Organizing Linear Computation 03 Systems of Linear Equations: From Geometry to Algorithms 04 Determinants: The Volume Factor of Linear Maps 05 Linear Transformations: Matrices as Functions 06 Inner Products, Norms, and Orthogonality: Measuring Geometry 07 Eigenvalues and Eigenvectors: The DNA of a Matrix 08 Matrix Decompositions: Breaking Matrices into Simpler Pieces 09 Linear Algebra in Machine Learning: Putting It All Together 10 Matrix Calculus: Derivatives for Machine Learning 11 Tensor Operations: Beyond Matrices 12 Sparse Matrices and Efficient Computation 13 Randomized Linear Algebra: Speed Through Randomness
Introduction
Every dataset you will ever work with in machine learning is, at its core, a collection of vectors. An image is a vector of pixel values. A sentence is a vector of word embeddings. A user profile is a vector of features. Understanding vectors is not optional — it is the entry point to everything that follows.
This article builds the vocabulary and intuition you need. We start with the concrete (arrows in space) and move toward the abstract (vector spaces over arbitrary fields), always keeping one eye on why these ideas matter for ML.
What Is a Vector?
A vector is an ordered list of numbers. In , a vector has components:
We write to say that lives in -dimensional real space.
There are two complementary ways to think about vectors:
- Algebraic view: A vector is a tuple of numbers — a point in .
- Geometric view: A vector is an arrow from the origin to that point, carrying both magnitude and direction.
Both views are useful. The algebraic view lets us compute. The geometric view lets us reason about distances, angles, and projections — concepts that appear everywhere in ML.
Example: A data point with three features — age, income, and credit score — is a vector . Each feature is a dimension.
Basic Vector Operations
Addition
Two vectors of the same dimension can be added component-wise:
Geometrically, vector addition follows the parallelogram rule — place the tail of at the head of , and the sum is the diagonal.
Scalar Multiplication
Multiplying a vector by a scalar scales every component:
If , the vector stretches. If , it shrinks. If , it flips direction.
Properties
For all vectors and scalars :
| Property | Statement |
|---|---|
| Commutativity | |
| Associativity | |
| Zero vector | |
| Additive inverse | |
| Distributivity | |
| Scalar associativity | |
| Identity |
These properties are not just bookkeeping. They are exactly the axioms that define a vector space.
The Dot Product
The dot product (or inner product) of two vectors is:
The dot product is a single number — a scalar. It encodes geometric information:
where is the angle between the two vectors, and is the Euclidean norm (length).
What the Dot Product Tells Us
| Value of | Meaning |
|---|---|
| Positive | Vectors point in similar directions () |
| Zero | Vectors are orthogonal (perpendicular, ) |
| Negative | Vectors point in opposing directions () |
Key insight: The dot product measures similarity in direction. This is why cosine similarity — — is one of the most widely used similarity metrics in NLP, recommendation systems, and retrieval.
Vector Norms
A norm assigns a non-negative length to every vector. The most common norms are:
L2 norm (Euclidean norm):
L1 norm (Manhattan norm):
L-infinity norm (max norm):
General Lp norm:
Every norm must satisfy three properties: (1) with equality only when , (2) , and (3) the triangle inequality .
Key insight: In ML, L1 and L2 norms appear directly in regularization. L1 regularization (Lasso) encourages sparsity. L2 regularization (Ridge) encourages small weights. The choice of norm shapes the geometry of the solution space.
Linear Combinations
A linear combination of vectors is any expression of the form:
where are scalars called coefficients.
Linear combinations are the central operation in linear algebra. A neural network computes linear combinations of inputs at every layer. A prediction in linear regression is a linear combination of features. PCA finds directions that are linear combinations of the original axes.
Span
The span of a set of vectors is the set of all possible linear combinations:
Geometrically:
- The span of one nonzero vector is a line through the origin.
- The span of two non-parallel vectors is a plane through the origin.
- The span of three vectors that don’t all lie in one plane is all of .
Geometric interpretation: The span tells you what “world” a set of vectors can reach. If your features span only a low-dimensional subspace, your model can only learn patterns within that subspace.
Linear Independence
Vectors are linearly independent if the only solution to:
is . If any other solution exists, the vectors are linearly dependent — meaning at least one vector is redundant (expressible as a combination of the others).
Key distinction: Linearly independent vectors each contribute unique information. Linearly dependent vectors contain redundancy. In data, dependent features (like height in cm and height in inches) carry the same signal and can cause problems like multicollinearity.
Vector Spaces
A vector space over is a set equipped with addition and scalar multiplication that satisfies the eight axioms listed in the properties table above (commutativity, associativity, zero vector, additive inverse, distributivity, scalar associativity, and identity).
The key examples:
| Vector Space | Elements | Dimension |
|---|---|---|
| Column vectors with real entries | ||
| real matrices | ||
| Polynomials of degree | ||
| Continuous functions on |
The abstraction is powerful: once you prove a theorem about vector spaces in general, it applies to all these examples simultaneously.
Subspaces
A subspace of a vector space is a subset that is itself a vector space — closed under addition and scalar multiplication, and containing the zero vector.
To verify that is a subspace, check three things:
- If , then
- If and , then
Example: The set of all solutions to a homogeneous system is a subspace of . This is the null space of .
Basis and Dimension
A basis for a vector space is a set of vectors that is:
- Linearly independent — no redundancy
- Spanning — every vector in can be written as a linear combination of the basis
The number of vectors in any basis of is always the same. This number is the dimension of , written .
The Standard Basis
The standard basis for consists of the unit vectors:
Any vector can be written as .
But the standard basis is just one choice. Choosing a different basis is equivalent to choosing a different coordinate system — and finding the right coordinate system is what techniques like PCA are all about.
Key insight: PCA finds a new basis where the first basis vector captures the most variance, the second captures the next most, and so on. The “principal components” are the new basis vectors, and projecting data onto the first few gives dimensionality reduction.
Worked Example: Checking Linear Independence
Determine whether , , are linearly independent.
We need to check if implies .
Row reduce the coefficient matrix:
The third row is all zeros, so there are free variables. A nontrivial solution exists (for example, ), meaning the vectors are linearly dependent.
Indeed, : the third vector is a combination of the first two, so it contributes no new information.
import numpy as np
v1 = np.array([1, 2, 3])
v2 = np.array([4, 5, 6])
v3 = np.array([7, 8, 9])
A = np.column_stack([v1, v2, v3])
rank = np.linalg.matrix_rank(A)
print(f"Rank: {rank}") # Output: 2 (< 3, so linearly dependent)
Why This Matters for ML
Vectors and vector spaces are not abstract luxuries — they are the language of data and models:
- Feature vectors: Every data point is a vector. The dimension equals the number of features.
- Weight vectors: Every linear model stores its parameters as a vector. Training is finding the right vector.
- Embeddings: Word2Vec, BERT, and GPT all map tokens to dense vectors where geometric relationships encode semantic meaning.
- Similarity: The dot product and cosine similarity measure how “alike” two data points are — the backbone of recommendation systems and retrieval.
- Dimensionality reduction: PCA, t-SNE, and UMAP all operate on vectors, finding lower-dimensional representations that preserve structure.
- Regularization: L1 and L2 norms on weight vectors control model complexity and prevent overfitting.
Understanding the geometry of vectors — their lengths, angles, projections, and spans — gives you geometric intuition for what models are actually doing.
Summary
- A vector is an ordered list of numbers, interpretable as a point or an arrow in .
- The dot product measures directional similarity and encodes angles between vectors.
- Norms measure vector length; L1 and L2 norms drive regularization in ML.
- Linear combinations are the fundamental operation; neural networks compute them at every layer.
- Span is the set of all reachable points via linear combinations.
- Linear independence means no vector is redundant — each carries unique information.
- A vector space is any set with addition and scalar multiplication satisfying the eight axioms.
- A basis is a minimal spanning set; its size is the dimension of the space.
- In the next article, we formalize how to organize and transform vectors using matrices and matrix operations.
References
- Strang, G. (2016). Introduction to Linear Algebra (5th ed.). Wellesley-Cambridge Press. math.mit.edu/~gs/linearalgebra
- Boyd, S., & Vandenberghe, L. (2018). Introduction to Applied Linear Algebra. Cambridge University Press. vmls-book.stanford.edu
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning, Chapter 2. MIT Press. deeplearningbook.org
- MIT 18.06 Linear Algebra (Gilbert Strang). ocw.mit.edu