Vectors and Vector Spaces: The Language of Data

Linear Algebra Series 1 / 13

Introduction

Every dataset you will ever work with in machine learning is, at its core, a collection of vectors. An image is a vector of pixel values. A sentence is a vector of word embeddings. A user profile is a vector of features. Understanding vectors is not optional — it is the entry point to everything that follows.

This article builds the vocabulary and intuition you need. We start with the concrete (arrows in space) and move toward the abstract (vector spaces over arbitrary fields), always keeping one eye on why these ideas matter for ML.

What Is a Vector?

A vector is an ordered list of numbers. In $\mathbb{R}^n$ , a vector $\mathbf{v}$ has $n$ components:

\mathbf{v} = \begin{bmatrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{bmatrix}

We write $\mathbf{v} \in \mathbb{R}^n$ to say that $\mathbf{v}$ lives in $n$ -dimensional real space.

There are two complementary ways to think about vectors:

Algebraic view: A vector is a tuple of numbers — a point in $\mathbb{R}^n$ .
Geometric view: A vector is an arrow from the origin to that point, carrying both magnitude and direction.

Both views are useful. The algebraic view lets us compute. The geometric view lets us reason about distances, angles, and projections — concepts that appear everywhere in ML.

Example: A data point with three features — age, income, and credit score — is a vector $\mathbf{x} = [25, 50000, 720]^T \in \mathbb{R}^3$ . Each feature is a dimension.

Basic Vector Operations

Addition

Two vectors of the same dimension can be added component-wise:

\mathbf{u} + \mathbf{v} = \begin{bmatrix} u_1 + v_1 \\ u_2 + v_2 \\ \vdots \\ u_n + v_n \end{bmatrix}

Geometrically, vector addition follows the parallelogram rule — place the tail of $\mathbf{v}$ at the head of $\mathbf{u}$ , and the sum is the diagonal.

Scalar Multiplication

Multiplying a vector by a scalar $c \in \mathbb{R}$ scales every component:

c\mathbf{v} = \begin{bmatrix} cv_1 \\ cv_2 \\ \vdots \\ cv_n \end{bmatrix}

If $c > 1$ , the vector stretches. If $0 < c < 1$ , it shrinks. If $c < 0$ , it flips direction.

Properties

For all vectors $\mathbf{u}, \mathbf{v}, \mathbf{w} \in \mathbb{R}^n$ and scalars $a, b \in \mathbb{R}$ :

Property	Statement
Commutativity	$\mathbf{u} + \mathbf{v} = \mathbf{v} + \mathbf{u}$
Associativity	$(\mathbf{u} + \mathbf{v}) + \mathbf{w} = \mathbf{u} + (\mathbf{v} + \mathbf{w})$
Zero vector	$\mathbf{v} + \mathbf{0} = \mathbf{v}$
Additive inverse	$\mathbf{v} + (-\mathbf{v}) = \mathbf{0}$
Distributivity	$a(\mathbf{u} + \mathbf{v}) = a\mathbf{u} + a\mathbf{v}$
Scalar associativity	$(ab)\mathbf{v} = a(b\mathbf{v})$
Identity	$1\mathbf{v} = \mathbf{v}$

These properties are not just bookkeeping. They are exactly the axioms that define a vector space.

The Dot Product

The dot product (or inner product) of two vectors $\mathbf{u}, \mathbf{v} \in \mathbb{R}^n$ is:

\mathbf{u} \cdot \mathbf{v} = \sum_{i=1}^{n} u_i v_i = u_1 v_1 + u_2 v_2 + \cdots + u_n v_n

The dot product is a single number — a scalar. It encodes geometric information:

\mathbf{u} \cdot \mathbf{v} = \|\mathbf{u}\| \, \|\mathbf{v}\| \cos\theta

where $\theta$ is the angle between the two vectors, and $\|\mathbf{u}\| = \sqrt{\sum u_i^2}$ is the Euclidean norm (length).

What the Dot Product Tells Us

Value of $\mathbf{u} \cdot \mathbf{v}$	Meaning
Positive	Vectors point in similar directions ( $\theta < 90°$ )
Zero	Vectors are orthogonal (perpendicular, $\theta = 90°$ )
Negative	Vectors point in opposing directions ( $\theta > 90°$ )

Key insight: The dot product measures similarity in direction. This is why cosine similarity — $\frac{\mathbf{u} \cdot \mathbf{v}}{\|\mathbf{u}\| \, \|\mathbf{v}\|}$ — is one of the most widely used similarity metrics in NLP, recommendation systems, and retrieval.

Vector Norms

A norm assigns a non-negative length to every vector. The most common norms are:

L2 norm (Euclidean norm):

\|\mathbf{v}\|_2 = \sqrt{\sum_{i=1}^{n} v_i^2}

L1 norm (Manhattan norm):

\|\mathbf{v}\|_1 = \sum_{i=1}^{n} |v_i|

L-infinity norm (max norm):

\|\mathbf{v}\|_\infty = \max_i |v_i|

General Lp norm:

\|\mathbf{v}\|_p = \left( \sum_{i=1}^{n} |v_i|^p \right)^{1/p}

Every norm must satisfy three properties: (1) $\|\mathbf{v}\| \geq 0$ with equality only when $\mathbf{v} = \mathbf{0}$ , (2) $\|c\mathbf{v}\| = |c| \, \|\mathbf{v}\|$ , and (3) the triangle inequality $\|\mathbf{u} + \mathbf{v}\| \leq \|\mathbf{u}\| + \|\mathbf{v}\|$ .

Key insight: In ML, L1 and L2 norms appear directly in regularization. L1 regularization (Lasso) encourages sparsity. L2 regularization (Ridge) encourages small weights. The choice of norm shapes the geometry of the solution space.

Linear Combinations

A linear combination of vectors $\mathbf{v}_1, \mathbf{v}_2, \ldots, \mathbf{v}_k$ is any expression of the form:

c_1 \mathbf{v}_1 + c_2 \mathbf{v}_2 + \cdots + c_k \mathbf{v}_k

where $c_1, c_2, \ldots, c_k \in \mathbb{R}$ are scalars called coefficients.

Linear combinations are the central operation in linear algebra. A neural network computes linear combinations of inputs at every layer. A prediction in linear regression is a linear combination of features. PCA finds directions that are linear combinations of the original axes.

Span

The span of a set of vectors is the set of all possible linear combinations:

\text{span}(\mathbf{v}_1, \ldots, \mathbf{v}_k) = \left\{ \sum_{i=1}^{k} c_i \mathbf{v}_i \;\middle|\; c_i \in \mathbb{R} \right\}

Geometrically:

The span of one nonzero vector is a line through the origin.
The span of two non-parallel vectors is a plane through the origin.
The span of three vectors that don’t all lie in one plane is all of $\mathbb{R}^3$ .

Geometric interpretation: The span tells you what “world” a set of vectors can reach. If your features span only a low-dimensional subspace, your model can only learn patterns within that subspace.

Linear Independence

Vectors $\mathbf{v}_1, \ldots, \mathbf{v}_k$ are linearly independent if the only solution to:

c_1 \mathbf{v}_1 + c_2 \mathbf{v}_2 + \cdots + c_k \mathbf{v}_k = \mathbf{0}

is $c_1 = c_2 = \cdots = c_k = 0$ . If any other solution exists, the vectors are linearly dependent — meaning at least one vector is redundant (expressible as a combination of the others).

Key distinction: Linearly independent vectors each contribute unique information. Linearly dependent vectors contain redundancy. In data, dependent features (like height in cm and height in inches) carry the same signal and can cause problems like multicollinearity.

Vector Spaces

A vector space $V$ over $\mathbb{R}$ is a set equipped with addition and scalar multiplication that satisfies the eight axioms listed in the properties table above (commutativity, associativity, zero vector, additive inverse, distributivity, scalar associativity, and identity).

The key examples:

Vector Space	Elements	Dimension
$\mathbb{R}^n$	Column vectors with $n$ real entries	$n$
$\mathbb{R}^{m \times n}$	$m \times n$ real matrices	$mn$
$\mathcal{P}_n$	Polynomials of degree $\leq n$	$n + 1$
$C[a,b]$	Continuous functions on $[a,b]$	$\infty$

The abstraction is powerful: once you prove a theorem about vector spaces in general, it applies to all these examples simultaneously.

Subspaces

A subspace $W$ of a vector space $V$ is a subset that is itself a vector space — closed under addition and scalar multiplication, and containing the zero vector.

To verify that $W$ is a subspace, check three things:

$\mathbf{0} \in W$
If $\mathbf{u}, \mathbf{v} \in W$ , then $\mathbf{u} + \mathbf{v} \in W$
If $\mathbf{v} \in W$ and $c \in \mathbb{R}$ , then $c\mathbf{v} \in W$

Example: The set of all solutions to a homogeneous system $\mathbf{A}\mathbf{x} = \mathbf{0}$ is a subspace of $\mathbb{R}^n$ . This is the null space of $\mathbf{A}$ .

Basis and Dimension

A basis for a vector space $V$ is a set of vectors that is:

Linearly independent — no redundancy
Spanning — every vector in $V$ can be written as a linear combination of the basis

The number of vectors in any basis of $V$ is always the same. This number is the dimension of $V$ , written $\dim(V)$ .

The Standard Basis

The standard basis for $\mathbb{R}^n$ consists of the unit vectors:

\mathbf{e}_1 = \begin{bmatrix} 1 \\ 0 \\ \vdots \\ 0 \end{bmatrix}, \quad \mathbf{e}_2 = \begin{bmatrix} 0 \\ 1 \\ \vdots \\ 0 \end{bmatrix}, \quad \ldots, \quad \mathbf{e}_n = \begin{bmatrix} 0 \\ 0 \\ \vdots \\ 1 \end{bmatrix}

Any vector $\mathbf{v} \in \mathbb{R}^n$ can be written as $\mathbf{v} = v_1 \mathbf{e}_1 + v_2 \mathbf{e}_2 + \cdots + v_n \mathbf{e}_n$ .

But the standard basis is just one choice. Choosing a different basis is equivalent to choosing a different coordinate system — and finding the right coordinate system is what techniques like PCA are all about.

Key insight: PCA finds a new basis where the first basis vector captures the most variance, the second captures the next most, and so on. The “principal components” are the new basis vectors, and projecting data onto the first few gives dimensionality reduction.

Worked Example: Checking Linear Independence

Determine whether $\mathbf{v}_1 = [1, 2, 3]^T$ , $\mathbf{v}_2 = [4, 5, 6]^T$ , $\mathbf{v}_3 = [7, 8, 9]^T$ are linearly independent.

We need to check if $c_1 \mathbf{v}_1 + c_2 \mathbf{v}_2 + c_3 \mathbf{v}_3 = \mathbf{0}$ implies $c_1 = c_2 = c_3 = 0$ .

\begin{aligned} c_1 + 4c_2 + 7c_3 &= 0 \\[6pt] 2c_1 + 5c_2 + 8c_3 &= 0 \\[6pt] 3c_1 + 6c_2 + 9c_3 &= 0 \end{aligned}

Row reduce the coefficient matrix:

\begin{bmatrix} 1 & 4 & 7 \\ 2 & 5 & 8 \\ 3 & 6 & 9 \end{bmatrix} \xrightarrow{R_2 - 2R_1} \begin{bmatrix} 1 & 4 & 7 \\ 0 & -3 & -6 \\ 3 & 6 & 9 \end{bmatrix} \xrightarrow{R_3 - 3R_1} \begin{bmatrix} 1 & 4 & 7 \\ 0 & -3 & -6 \\ 0 & -6 & -12 \end{bmatrix} \xrightarrow{R_3 - 2R_2} \begin{bmatrix} 1 & 4 & 7 \\ 0 & -3 & -6 \\ 0 & 0 & 0 \end{bmatrix}

The third row is all zeros, so there are free variables. A nontrivial solution exists (for example, $c_3 = 1, c_2 = -2, c_1 = 1$ ), meaning the vectors are linearly dependent.

Indeed, $\mathbf{v}_3 = 2\mathbf{v}_2 - \mathbf{v}_1$ : the third vector is a combination of the first two, so it contributes no new information.

import numpy as np

v1 = np.array([1, 2, 3])
v2 = np.array([4, 5, 6])
v3 = np.array([7, 8, 9])

A = np.column_stack([v1, v2, v3])
rank = np.linalg.matrix_rank(A)
print(f"Rank: {rank}")  # Output: 2 (< 3, so linearly dependent)

Why This Matters for ML

Vectors and vector spaces are not abstract luxuries — they are the language of data and models:

Feature vectors: Every data point is a vector. The dimension equals the number of features.
Weight vectors: Every linear model stores its parameters as a vector. Training is finding the right vector.
Embeddings: Word2Vec, BERT, and GPT all map tokens to dense vectors where geometric relationships encode semantic meaning.
Similarity: The dot product and cosine similarity measure how “alike” two data points are — the backbone of recommendation systems and retrieval.
Dimensionality reduction: PCA, t-SNE, and UMAP all operate on vectors, finding lower-dimensional representations that preserve structure.
Regularization: L1 and L2 norms on weight vectors control model complexity and prevent overfitting.

Understanding the geometry of vectors — their lengths, angles, projections, and spans — gives you geometric intuition for what models are actually doing.

Summary

A vector is an ordered list of numbers, interpretable as a point or an arrow in $\mathbb{R}^n$ .
The dot product measures directional similarity and encodes angles between vectors.
Norms measure vector length; L1 and L2 norms drive regularization in ML.
Linear combinations are the fundamental operation; neural networks compute them at every layer.
Span is the set of all reachable points via linear combinations.
Linear independence means no vector is redundant — each carries unique information.
A vector space is any set with addition and scalar multiplication satisfying the eight axioms.
A basis is a minimal spanning set; its size is the dimension of the space.
In the next article, we formalize how to organize and transform vectors using matrices and matrix operations.

References

Strang, G. (2016). Introduction to Linear Algebra (5th ed.). Wellesley-Cambridge Press. math.mit.edu/~gs/linearalgebra
Boyd, S., & Vandenberghe, L. (2018). Introduction to Applied Linear Algebra. Cambridge University Press. vmls-book.stanford.edu
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning, Chapter 2. MIT Press. deeplearningbook.org
MIT 18.06 Linear Algebra (Gilbert Strang). ocw.mit.edu