- 01 Vectors and Vector Spaces: The Language of Data 02 Matrices and Matrix Operations: Organizing Linear Computation 03 Systems of Linear Equations: From Geometry to Algorithms 04 Determinants: The Volume Factor of Linear Maps 05 Linear Transformations: Matrices as Functions 06 Inner Products, Norms, and Orthogonality: Measuring Geometry 07 Eigenvalues and Eigenvectors: The DNA of a Matrix 08 Matrix Decompositions: Breaking Matrices into Simpler Pieces 09 Linear Algebra in Machine Learning: Putting It All Together 10 Matrix Calculus: Derivatives for Machine Learning 11 Tensor Operations: Beyond Matrices 12 Sparse Matrices and Efficient Computation 13 Randomized Linear Algebra: Speed Through Randomness
Introduction
Classical linear algebra algorithms are exact but slow: SVD costs , eigendecomposition costs , and solving dense systems costs . When matrices have millions of rows and columns — which is routine in modern ML — these costs become prohibitive.
Randomized linear algebra trades a small, controlled amount of accuracy for dramatic speed improvements. The core idea is simple: use random sampling or projection to reduce a large problem to a smaller one, solve the small problem exactly, then lift the solution back. The theory guarantees that the error is small with high probability.
This article builds on matrix decompositions and sparse matrices, completing our linear algebra series with the tools that make large-scale ML computationally feasible.
The Core Idea: Dimensionality Reduction via Randomness
Given a large matrix , we want to compute its SVD, low-rank approximation, or solve a system involving it. The randomized approach:
- Sketch: Multiply by a random matrix to get a much smaller matrix .
- Solve: Perform the expensive computation on (which is small).
- Recover: Reconstruct the answer for the original .
The random matrix acts as a “sketch” — it compresses while preserving its essential structure (the top singular vectors and values).
Key insight: Randomness is not a weakness here — it is a tool. Random projections preserve geometric structure (distances, angles, subspaces) with high probability. The algorithms come with rigorous error bounds, not just empirical heuristics.
Random Projections
The Johnson-Lindenstrauss Lemma
The Johnson-Lindenstrauss (JL) lemma is the theoretical foundation of random projections. It states:
For any set of points in and any , there exists a linear map with such that for all pairs of points :
All pairwise distances are preserved to within a factor of .
The remarkable fact: the target dimension depends only on and , not on the original dimension . You can project from a million dimensions to a few hundred and preserve all pairwise distances.
Constructing Random Projection Matrices
The projection where has random entries. Several choices work:
| Matrix Type | Entries | Density | Speed |
|---|---|---|---|
| Gaussian | Dense | per vector | |
| Rademacher | uniform | Dense | , no multiplications |
| Sparse Rademacher | density | ||
| SRHT (subsampled randomized Hadamard) | Structured | Implicit |
The sparse and structured variants are faster because the projection itself costs less — critical when is large.
import numpy as np
from sklearn.random_projection import GaussianRandomProjection, SparseRandomProjection
# Original high-dimensional data
X = np.random.randn(10000, 5000) # 10K points in 5000D
# Project to 200 dimensions (preserving distances)
projector = GaussianRandomProjection(n_components=200)
X_proj = projector.fit_transform(X)
print(f"Original shape: {X.shape}")
print(f"Projected shape: {X_proj.shape}")
# Verify distance preservation
from scipy.spatial.distance import pdist
d_orig = pdist(X[:100])
d_proj = pdist(X_proj[:100])
ratio = d_proj / d_orig
print(f"Distance ratio — mean: {ratio.mean():.3f}, std: {ratio.std():.3f}")
# Expect mean ≈ 1.0, small std
Applications of Random Projections
- Approximate nearest neighbors: Project data to lower dimension, then run exact search. Much faster than searching in the original space.
- Streaming algorithms: Process data points one at a time, maintaining a sketch. No need to store the full dataset.
- Privacy: Random projections can preserve utility while obscuring individual data points (differential privacy connections).
- Speeding up kernel methods: Random Fourier features approximate RBF kernels via random projections.
Randomized SVD
The randomized SVD is the most important algorithm in this article. It computes an approximate rank- SVD of a large matrix much faster than the classical algorithm.
Algorithm
Given and target rank :
Stage 1: Find an approximate basis for the column space
- Draw a random matrix (Gaussian or structured), where is a small oversampling parameter (typically or ).
- Form the sketch .
- Compute an orthonormal basis for the column space of via QR: .
Stage 2: Form and decompose the small matrix
- Form — a small matrix.
- Compute the SVD of : .
- Recover the left singular vectors: .
The approximate SVD is .
Complexity Comparison
| Method | Cost | Notes |
|---|---|---|
| Full SVD (LAPACK) | Exact, all singular values | |
| Truncated SVD (Lanczos) | Iterative, top values | |
| Randomized SVD | One pass + small SVD | |
| Randomized SVD (single pass) | For streaming data |
For a matrix with :
- Full SVD: ~ operations
- Randomized SVD: ~ operations (400x faster)
Power Iteration for Better Accuracy
When singular values decay slowly, the basic sketch may not capture the top- subspace well. Power iteration fixes this:
Replace step 2 with:
This amplifies the gap between the top- and remaining singular values by raising them to the -th power. Even or dramatically improves accuracy.
In practice, alternate between multiplying by and with intermediate QR factorizations for numerical stability.
import numpy as np
def randomized_svd(A, k, p=5, q=2):
"""Randomized SVD with power iteration."""
m, n = A.shape
# Stage 1: Random sketch with power iteration
Omega = np.random.randn(n, k + p)
Y = A @ Omega
for _ in range(q):
Y = A @ (A.T @ Y)
Y, _ = np.linalg.qr(Y) # Stabilize
Y = A @ (A.T @ Y) # One more pass
Q, _ = np.linalg.qr(Y)
# Stage 2: Small SVD
B = Q.T @ A
U_hat, sigma, Vt = np.linalg.svd(B, full_matrices=False)
U = Q @ U_hat
return U[:, :k], sigma[:k], Vt[:k, :]
# Test
A = np.random.randn(10000, 5000)
U, sigma, Vt = randomized_svd(A, k=20)
A_approx = U @ np.diag(sigma) @ Vt
error = np.linalg.norm(A - A_approx, 'fro') / np.linalg.norm(A, 'fro')
print(f"Relative error: {error:.4f}")
Scikit-Learn Integration
Scikit-learn uses randomized SVD internally for many algorithms:
from sklearn.decomposition import TruncatedSVD, PCA
# TruncatedSVD uses randomized algorithm by default
svd = TruncatedSVD(n_components=50, algorithm='randomized')
X_reduced = svd.fit_transform(X_sparse)
# PCA also uses randomized SVD for large datasets
pca = PCA(n_components=50, svd_solver='randomized')
X_pca = pca.fit_transform(X_dense)
Matrix Sketching
Sketching compresses a matrix into a much smaller representation that preserves key properties. Beyond random projection, several sketching techniques exist:
Count Sketch
The Count Sketch hashes each row to a random bucket with a random sign:
- Choose hash function (mapping rows to buckets)
- Choose sign function
- Sketch row by adding to bucket
The result is an matrix. This is extremely fast — — and works well with sparse matrices.
Frequent Directions
The Frequent Directions algorithm maintains a low-rank sketch of a matrix seen in a stream:
- Maintain a sketch (initially zero)
- For each new row : insert into an empty row of
- When is full: compute SVD, shrink singular values, and discard small ones
This gives a streaming approximation to the covariance matrix without storing the full data.
Random Features for Kernel Approximation
Random Fourier Features
Computing with kernel matrices costs in memory and for solving. Random Fourier Features (RFF) approximate the RBF kernel using random projections:
where with random and .
This transforms a kernel problem into a linear problem in -dimensional random feature space, reducing cost from to .
from sklearn.kernel_approximation import RBFSampler
# Approximate RBF kernel with random features
rbf = RBFSampler(n_components=500, gamma=1.0)
X_features = rbf.fit_transform(X)
# Now use linear model instead of kernel SVM
from sklearn.linear_model import SGDClassifier
clf = SGDClassifier()
clf.fit(X_features, y)
Key insight: Random Fourier features turn an expensive nonlinear problem (kernel SVM) into a cheap linear problem. The approximation quality improves with more random features , but even often suffices for good accuracy.
Randomized Least Squares
For overdetermined least squares with ():
Sketch-and-Solve
- Generate sketching matrix with
- Form and
- Solve the smaller problem
The sketched system has only rows instead of , making it much cheaper. With , the solution is within of optimal.
Iterative Sketching
For higher accuracy, use the sketch as a preconditioner:
- Compute from QR of (cheap, since is small)
- Solve the preconditioned system via iterative methods
- Recover
The preconditioner makes the system well-conditioned, so iterative methods converge in iterations.
Theoretical Guarantees
Randomized algorithms come with provable bounds. For the randomized SVD with oversampling and power iteration parameter :
The term is the best possible error (from the Eckart-Young theorem). The factor in front approaches 1 as and increase — even small values () give near-optimal results.
Key insight: Randomized algorithms are not approximate heuristics — they achieve near-optimal accuracy with high probability. The theory provides explicit error bounds and failure probabilities that can be made arbitrarily small.
Worked Example: Randomized PCA on Large Data
Compare classical and randomized PCA on a large dataset:
import numpy as np
from sklearn.decomposition import PCA
import time
# Large dataset: 50K samples, 10K features
np.random.seed(42)
X = np.random.randn(50000, 10000)
# Classical PCA (full SVD)
start = time.time()
pca_full = PCA(n_components=50, svd_solver='full')
X_full = pca_full.fit_transform(X)
time_full = time.time() - start
# Randomized PCA
start = time.time()
pca_rand = PCA(n_components=50, svd_solver='randomized')
X_rand = pca_rand.fit_transform(X)
time_rand = time.time() - start
print(f"Full SVD: {time_full:.1f}s")
print(f"Randomized SVD: {time_rand:.1f}s")
print(f"Speedup: {time_full/time_rand:.1f}x")
# Compare explained variance
var_diff = np.abs(pca_full.explained_variance_ratio_ - pca_rand.explained_variance_ratio_)
print(f"Max variance difference: {var_diff.max():.6f}")
When to Use Randomized Methods
| Situation | Recommended Method |
|---|---|
| Low-rank SVD of large dense matrix | Randomized SVD |
| PCA on high-dimensional data | Randomized PCA (scikit-learn default) |
| Approximate nearest neighbors | Random projections + exact search |
| Large-scale kernel SVM | Random Fourier features |
| Streaming data, one pass | Frequent Directions / streaming sketch |
| Massive least squares () | Sketch-and-solve or sketched preconditioner |
| Small/medium matrices () | Classical algorithms (overhead not worth it) |
Why This Matters for ML
- Scalable PCA: Scikit-learn uses randomized SVD by default for PCA when . This makes PCA practical on datasets with thousands or millions of features.
- Recommendation systems: Randomized SVD enables matrix factorization on Netflix-scale data (480K users × 18K movies).
- NLP embeddings: GloVe and similar methods use randomized methods to factorize massive co-occurrence matrices.
- Approximate nearest neighbors: Libraries like Annoy, FAISS, and ScaNN use random projections as part of their indexing.
- Kernel approximation: Random Fourier features make kernel methods scalable to millions of data points.
- Streaming ML: Sketching algorithms process data in one pass with bounded memory, enabling online learning on massive streams.
Summary
- Randomized linear algebra trades small, controlled accuracy loss for large speedups on massive problems.
- The Johnson-Lindenstrauss lemma guarantees that random projections preserve pairwise distances in dimensions.
- Randomized SVD computes rank- approximations in instead of , with near-optimal accuracy.
- Power iteration improves randomized SVD accuracy when singular values decay slowly.
- Matrix sketching (Count Sketch, Frequent Directions) enables streaming and single-pass algorithms.
- Random Fourier features turn expensive kernel methods into fast linear methods.
- These algorithms have rigorous theoretical guarantees — they are not heuristics.
- This completes our linear algebra series. The tools from vectors through randomized algorithms form the mathematical backbone of modern machine learning.
References
- Halko, N., Martinsson, P. G., & Tropp, J. A. (2011). Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions. SIAM Review, 53(2), 217-288. arXiv:0909.4061
- Woodruff, D. P. (2014). Sketching as a Tool for Numerical Linear Algebra. Foundations and Trends in Theoretical Computer Science, 10(1-2), 1-157.
- Martinsson, P. G., & Tropp, J. A. (2020). Randomized Numerical Linear Algebra: Foundations and Algorithms. Acta Numerica, 29, 403-572. arXiv:2002.01387
- Rahimi, A., & Recht, B. (2007). Random Features for Large-Scale Kernel Machines. NeurIPS. papers.nips.cc
- Scikit-learn Documentation. Random Projection. scikit-learn.org