Randomized Linear Algebra: Speed Through Randomness

Learn how random projections, randomized SVD, and sketching algorithms solve massive linear algebra problems faster than classical methods.

Linear Algebra March 6, 2026 12 min read

Introduction

Classical linear algebra algorithms are exact but slow: SVD costs O(mnmin(m,n))O(mn \min(m,n)), eigendecomposition costs O(n3)O(n^3), and solving dense systems costs O(n3)O(n^3). When matrices have millions of rows and columns — which is routine in modern ML — these costs become prohibitive.

Randomized linear algebra trades a small, controlled amount of accuracy for dramatic speed improvements. The core idea is simple: use random sampling or projection to reduce a large problem to a smaller one, solve the small problem exactly, then lift the solution back. The theory guarantees that the error is small with high probability.

This article builds on matrix decompositions and sparse matrices, completing our linear algebra series with the tools that make large-scale ML computationally feasible.

The Core Idea: Dimensionality Reduction via Randomness

Given a large matrix ARm×n\mathbf{A} \in \mathbb{R}^{m \times n}, we want to compute its SVD, low-rank approximation, or solve a system involving it. The randomized approach:

  1. Sketch: Multiply A\mathbf{A} by a random matrix Ω\boldsymbol{\Omega} to get a much smaller matrix Y=AΩ\mathbf{Y} = \mathbf{A}\boldsymbol{\Omega}.
  2. Solve: Perform the expensive computation on Y\mathbf{Y} (which is small).
  3. Recover: Reconstruct the answer for the original A\mathbf{A}.

The random matrix Ω\boldsymbol{\Omega} acts as a “sketch” — it compresses A\mathbf{A} while preserving its essential structure (the top singular vectors and values).

Key insight: Randomness is not a weakness here — it is a tool. Random projections preserve geometric structure (distances, angles, subspaces) with high probability. The algorithms come with rigorous error bounds, not just empirical heuristics.

Random Projections

The Johnson-Lindenstrauss Lemma

The Johnson-Lindenstrauss (JL) lemma is the theoretical foundation of random projections. It states:

For any set of nn points in Rd\mathbb{R}^d and any 0<ϵ<10 < \epsilon < 1, there exists a linear map f:RdRkf: \mathbb{R}^d \to \mathbb{R}^k with k=O(ϵ2logn)k = O(\epsilon^{-2} \log n) such that for all pairs of points u,v\mathbf{u}, \mathbf{v}:

(1ϵ)uv2f(u)f(v)2(1+ϵ)uv2(1 - \epsilon)\|\mathbf{u} - \mathbf{v}\|^2 \leq \|f(\mathbf{u}) - f(\mathbf{v})\|^2 \leq (1 + \epsilon)\|\mathbf{u} - \mathbf{v}\|^2

All pairwise distances are preserved to within a factor of (1±ϵ)(1 \pm \epsilon).

The remarkable fact: the target dimension kk depends only on logn\log n and ϵ\epsilon, not on the original dimension dd. You can project from a million dimensions to a few hundred and preserve all pairwise distances.

Constructing Random Projection Matrices

The projection f(x)=1kRxf(\mathbf{x}) = \frac{1}{\sqrt{k}}\mathbf{R}\mathbf{x} where RRk×d\mathbf{R} \in \mathbb{R}^{k \times d} has random entries. Several choices work:

Matrix TypeEntriesDensitySpeed
GaussianRijN(0,1)R_{ij} \sim \mathcal{N}(0, 1)DenseO(kd)O(kd) per vector
RademacherRij{1,+1}R_{ij} \in \{-1, +1\} uniformDenseO(kd)O(kd), no multiplications
Sparse RademacherRij{s,0,+s}R_{ij} \in \{-\sqrt{s}, 0, +\sqrt{s}\}1/s1/s densityO(kd/s)O(kd/s)
SRHT (subsampled randomized Hadamard)StructuredImplicitO(dlogk)O(d \log k)

The sparse and structured variants are faster because the projection itself costs less — critical when dd is large.

import numpy as np
from sklearn.random_projection import GaussianRandomProjection, SparseRandomProjection

# Original high-dimensional data
X = np.random.randn(10000, 5000)  # 10K points in 5000D

# Project to 200 dimensions (preserving distances)
projector = GaussianRandomProjection(n_components=200)
X_proj = projector.fit_transform(X)

print(f"Original shape: {X.shape}")
print(f"Projected shape: {X_proj.shape}")

# Verify distance preservation
from scipy.spatial.distance import pdist
d_orig = pdist(X[:100])
d_proj = pdist(X_proj[:100])

ratio = d_proj / d_orig
print(f"Distance ratio — mean: {ratio.mean():.3f}, std: {ratio.std():.3f}")
# Expect mean ≈ 1.0, small std

Applications of Random Projections

  • Approximate nearest neighbors: Project data to lower dimension, then run exact search. Much faster than searching in the original space.
  • Streaming algorithms: Process data points one at a time, maintaining a sketch. No need to store the full dataset.
  • Privacy: Random projections can preserve utility while obscuring individual data points (differential privacy connections).
  • Speeding up kernel methods: Random Fourier features approximate RBF kernels via random projections.

Randomized SVD

The randomized SVD is the most important algorithm in this article. It computes an approximate rank-kk SVD of a large matrix A\mathbf{A} much faster than the classical algorithm.

Algorithm

Given ARm×n\mathbf{A} \in \mathbb{R}^{m \times n} and target rank kk:

Stage 1: Find an approximate basis for the column space

  1. Draw a random matrix ΩRn×(k+p)\boldsymbol{\Omega} \in \mathbb{R}^{n \times (k + p)} (Gaussian or structured), where pp is a small oversampling parameter (typically p=5p = 5 or 1010).
  2. Form the sketch Y=AΩRm×(k+p)\mathbf{Y} = \mathbf{A}\boldsymbol{\Omega} \in \mathbb{R}^{m \times (k+p)}.
  3. Compute an orthonormal basis Q\mathbf{Q} for the column space of Y\mathbf{Y} via QR: Y=QR\mathbf{Y} = \mathbf{Q}\mathbf{R}.

Stage 2: Form and decompose the small matrix

  1. Form B=QTAR(k+p)×n\mathbf{B} = \mathbf{Q}^T\mathbf{A} \in \mathbb{R}^{(k+p) \times n} — a small matrix.
  2. Compute the SVD of B\mathbf{B}: B=U^ΣVT\mathbf{B} = \hat{\mathbf{U}}\boldsymbol{\Sigma}\mathbf{V}^T.
  3. Recover the left singular vectors: U=QU^\mathbf{U} = \mathbf{Q}\hat{\mathbf{U}}.

The approximate SVD is AUΣVT\mathbf{A} \approx \mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^T.

Complexity Comparison

MethodCostNotes
Full SVD (LAPACK)O(mnmin(m,n))O(mn \min(m,n))Exact, all singular values
Truncated SVD (Lanczos)O(mnk)O(mnk)Iterative, top kk values
Randomized SVDO(mn(k+p))+O((k+p)2n)O(mn(k+p)) + O((k+p)^2 n)One pass + small SVD
Randomized SVD (single pass)O(mn(k+p))O(mn(k+p))For streaming data

For a 100,000×50,000100{,}000 \times 50{,}000 matrix with k=50k = 50:

  • Full SVD: ~101410^{14} operations
  • Randomized SVD: ~2.5×10112.5 \times 10^{11} operations (400x faster)

Power Iteration for Better Accuracy

When singular values decay slowly, the basic sketch Y=AΩ\mathbf{Y} = \mathbf{A}\boldsymbol{\Omega} may not capture the top-kk subspace well. Power iteration fixes this:

Replace step 2 with:

Y=(AAT)qAΩ\mathbf{Y} = (\mathbf{A}\mathbf{A}^T)^q \mathbf{A}\boldsymbol{\Omega}

This amplifies the gap between the top-kk and remaining singular values by raising them to the (2q+1)(2q+1)-th power. Even q=1q = 1 or q=2q = 2 dramatically improves accuracy.

In practice, alternate between multiplying by A\mathbf{A} and AT\mathbf{A}^T with intermediate QR factorizations for numerical stability.

import numpy as np

def randomized_svd(A, k, p=5, q=2):
    """Randomized SVD with power iteration."""
    m, n = A.shape

    # Stage 1: Random sketch with power iteration
    Omega = np.random.randn(n, k + p)
    Y = A @ Omega

    for _ in range(q):
        Y = A @ (A.T @ Y)
        Y, _ = np.linalg.qr(Y)  # Stabilize
        Y = A @ (A.T @ Y)       # One more pass

    Q, _ = np.linalg.qr(Y)

    # Stage 2: Small SVD
    B = Q.T @ A
    U_hat, sigma, Vt = np.linalg.svd(B, full_matrices=False)
    U = Q @ U_hat

    return U[:, :k], sigma[:k], Vt[:k, :]

# Test
A = np.random.randn(10000, 5000)
U, sigma, Vt = randomized_svd(A, k=20)
A_approx = U @ np.diag(sigma) @ Vt

error = np.linalg.norm(A - A_approx, 'fro') / np.linalg.norm(A, 'fro')
print(f"Relative error: {error:.4f}")

Scikit-Learn Integration

Scikit-learn uses randomized SVD internally for many algorithms:

from sklearn.decomposition import TruncatedSVD, PCA

# TruncatedSVD uses randomized algorithm by default
svd = TruncatedSVD(n_components=50, algorithm='randomized')
X_reduced = svd.fit_transform(X_sparse)

# PCA also uses randomized SVD for large datasets
pca = PCA(n_components=50, svd_solver='randomized')
X_pca = pca.fit_transform(X_dense)

Matrix Sketching

Sketching compresses a matrix into a much smaller representation that preserves key properties. Beyond random projection, several sketching techniques exist:

Count Sketch

The Count Sketch hashes each row to a random bucket with a random sign:

  1. Choose hash function h:[m][s]h: [m] \to [s] (mapping rows to ss buckets)
  2. Choose sign function σ:[m]{1,+1}\sigma: [m] \to \{-1, +1\}
  3. Sketch row ii by adding σ(i)ai\sigma(i) \cdot \mathbf{a}_i to bucket h(i)h(i)

The result is an s×ns \times n matrix. This is extremely fast — O(nnz(A))O(\text{nnz}(\mathbf{A})) — and works well with sparse matrices.

Frequent Directions

The Frequent Directions algorithm maintains a low-rank sketch of a matrix seen in a stream:

  1. Maintain a sketch BRl×n\mathbf{B} \in \mathbb{R}^{l \times n} (initially zero)
  2. For each new row aiT\mathbf{a}_i^T: insert into an empty row of B\mathbf{B}
  3. When B\mathbf{B} is full: compute SVD, shrink singular values, and discard small ones

This gives a streaming approximation to the covariance matrix without storing the full data.

Random Features for Kernel Approximation

Random Fourier Features

Computing with kernel matrices costs O(n2)O(n^2) in memory and O(n3)O(n^3) for solving. Random Fourier Features (RFF) approximate the RBF kernel using random projections:

k(x,y)=exp(xy22σ2)z(x)Tz(y)k(\mathbf{x}, \mathbf{y}) = \exp\left(-\frac{\|\mathbf{x} - \mathbf{y}\|^2}{2\sigma^2}\right) \approx \mathbf{z}(\mathbf{x})^T \mathbf{z}(\mathbf{y})

where z(x)=2Dcos(Wx+b)\mathbf{z}(\mathbf{x}) = \sqrt{\frac{2}{D}}\cos(\mathbf{W}\mathbf{x} + \mathbf{b}) with random WN(0,σ2I)\mathbf{W} \sim \mathcal{N}(0, \sigma^{-2}\mathbf{I}) and bUniform(0,2π)\mathbf{b} \sim \text{Uniform}(0, 2\pi).

This transforms a kernel problem into a linear problem in DD-dimensional random feature space, reducing cost from O(n3)O(n^3) to O(nD2)O(nD^2).

from sklearn.kernel_approximation import RBFSampler

# Approximate RBF kernel with random features
rbf = RBFSampler(n_components=500, gamma=1.0)
X_features = rbf.fit_transform(X)

# Now use linear model instead of kernel SVM
from sklearn.linear_model import SGDClassifier
clf = SGDClassifier()
clf.fit(X_features, y)

Key insight: Random Fourier features turn an expensive nonlinear problem (kernel SVM) into a cheap linear problem. The approximation quality improves with more random features DD, but even D=500D = 500 often suffices for good accuracy.

Randomized Least Squares

For overdetermined least squares minxAxb2\min_\mathbf{x} \|\mathbf{A}\mathbf{x} - \mathbf{b}\|_2 with ARm×n\mathbf{A} \in \mathbb{R}^{m \times n} (mnm \gg n):

Sketch-and-Solve

  1. Generate sketching matrix SRs×m\mathbf{S} \in \mathbb{R}^{s \times m} with sms \ll m
  2. Form SA\mathbf{S}\mathbf{A} and Sb\mathbf{S}\mathbf{b}
  3. Solve the smaller problem minxSAxSb2\min_\mathbf{x} \|\mathbf{S}\mathbf{A}\mathbf{x} - \mathbf{S}\mathbf{b}\|_2

The sketched system has only ss rows instead of mm, making it much cheaper. With s=O(n/ϵ2)s = O(n/\epsilon^2), the solution is within (1+ϵ)(1 + \epsilon) of optimal.

Iterative Sketching

For higher accuracy, use the sketch as a preconditioner:

  1. Compute R\mathbf{R} from QR of SA\mathbf{S}\mathbf{A} (cheap, since SA\mathbf{S}\mathbf{A} is small)
  2. Solve the preconditioned system AR1zb\mathbf{A}\mathbf{R}^{-1}\mathbf{z} \approx \mathbf{b} via iterative methods
  3. Recover x=R1z\mathbf{x} = \mathbf{R}^{-1}\mathbf{z}

The preconditioner R\mathbf{R} makes the system well-conditioned, so iterative methods converge in O(log(1/ϵ))O(\log(1/\epsilon)) iterations.

Theoretical Guarantees

Randomized algorithms come with provable bounds. For the randomized SVD with oversampling pp and power iteration parameter qq:

EAQQTAF(1+kp1)1/(2q+1)(j>kσj2)1/2\mathbb{E}\|\mathbf{A} - \mathbf{Q}\mathbf{Q}^T\mathbf{A}\|_F \leq \left(1 + \sqrt{\frac{k}{p-1}}\right)^{1/(2q+1)} \cdot \left(\sum_{j > k} \sigma_j^2\right)^{1/2}

The term (j>kσj2)1/2\left(\sum_{j>k} \sigma_j^2\right)^{1/2} is the best possible error (from the Eckart-Young theorem). The factor in front approaches 1 as pp and qq increase — even small values (p=5,q=2p = 5, q = 2) give near-optimal results.

Key insight: Randomized algorithms are not approximate heuristics — they achieve near-optimal accuracy with high probability. The theory provides explicit error bounds and failure probabilities that can be made arbitrarily small.

Worked Example: Randomized PCA on Large Data

Compare classical and randomized PCA on a large dataset:

import numpy as np
from sklearn.decomposition import PCA
import time

# Large dataset: 50K samples, 10K features
np.random.seed(42)
X = np.random.randn(50000, 10000)

# Classical PCA (full SVD)
start = time.time()
pca_full = PCA(n_components=50, svd_solver='full')
X_full = pca_full.fit_transform(X)
time_full = time.time() - start

# Randomized PCA
start = time.time()
pca_rand = PCA(n_components=50, svd_solver='randomized')
X_rand = pca_rand.fit_transform(X)
time_rand = time.time() - start

print(f"Full SVD:       {time_full:.1f}s")
print(f"Randomized SVD: {time_rand:.1f}s")
print(f"Speedup: {time_full/time_rand:.1f}x")

# Compare explained variance
var_diff = np.abs(pca_full.explained_variance_ratio_ - pca_rand.explained_variance_ratio_)
print(f"Max variance difference: {var_diff.max():.6f}")

When to Use Randomized Methods

SituationRecommended Method
Low-rank SVD of large dense matrixRandomized SVD
PCA on high-dimensional dataRandomized PCA (scikit-learn default)
Approximate nearest neighborsRandom projections + exact search
Large-scale kernel SVMRandom Fourier features
Streaming data, one passFrequent Directions / streaming sketch
Massive least squares (mnm \gg n)Sketch-and-solve or sketched preconditioner
Small/medium matrices (n<5000n < 5000)Classical algorithms (overhead not worth it)

Why This Matters for ML

  • Scalable PCA: Scikit-learn uses randomized SVD by default for PCA when n>500n > 500. This makes PCA practical on datasets with thousands or millions of features.
  • Recommendation systems: Randomized SVD enables matrix factorization on Netflix-scale data (480K users × 18K movies).
  • NLP embeddings: GloVe and similar methods use randomized methods to factorize massive co-occurrence matrices.
  • Approximate nearest neighbors: Libraries like Annoy, FAISS, and ScaNN use random projections as part of their indexing.
  • Kernel approximation: Random Fourier features make kernel methods scalable to millions of data points.
  • Streaming ML: Sketching algorithms process data in one pass with bounded memory, enabling online learning on massive streams.

Summary

  • Randomized linear algebra trades small, controlled accuracy loss for large speedups on massive problems.
  • The Johnson-Lindenstrauss lemma guarantees that random projections preserve pairwise distances in O(logn)O(\log n) dimensions.
  • Randomized SVD computes rank-kk approximations in O(mnk)O(mnk) instead of O(mnmin(m,n))O(mn\min(m,n)), with near-optimal accuracy.
  • Power iteration improves randomized SVD accuracy when singular values decay slowly.
  • Matrix sketching (Count Sketch, Frequent Directions) enables streaming and single-pass algorithms.
  • Random Fourier features turn expensive kernel methods into fast linear methods.
  • These algorithms have rigorous theoretical guarantees — they are not heuristics.
  • This completes our linear algebra series. The tools from vectors through randomized algorithms form the mathematical backbone of modern machine learning.

References

  • Halko, N., Martinsson, P. G., & Tropp, J. A. (2011). Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions. SIAM Review, 53(2), 217-288. arXiv:0909.4061
  • Woodruff, D. P. (2014). Sketching as a Tool for Numerical Linear Algebra. Foundations and Trends in Theoretical Computer Science, 10(1-2), 1-157.
  • Martinsson, P. G., & Tropp, J. A. (2020). Randomized Numerical Linear Algebra: Foundations and Algorithms. Acta Numerica, 29, 403-572. arXiv:2002.01387
  • Rahimi, A., & Recht, B. (2007). Random Features for Large-Scale Kernel Machines. NeurIPS. papers.nips.cc
  • Scikit-learn Documentation. Random Projection. scikit-learn.org

Keyboard Shortcuts

Navigation
j
Next heading
k
Previous heading
n
Next article in series
p
Previous article in series
t
Scroll to top
Actions
r
Toggle reading mode
Ctrl K
Search articles
?
Toggle this help
Esc
Close overlay