Randomized Linear Algebra: Speed Through Randomness

Linear Algebra Series 13 / 13

Introduction

Classical linear algebra algorithms are exact but slow: SVD costs $O(mn \min(m,n))$ , eigendecomposition costs $O(n^3)$ , and solving dense systems costs $O(n^3)$ . When matrices have millions of rows and columns — which is routine in modern ML — these costs become prohibitive.

Randomized linear algebra trades a small, controlled amount of accuracy for dramatic speed improvements. The core idea is simple: use random sampling or projection to reduce a large problem to a smaller one, solve the small problem exactly, then lift the solution back. The theory guarantees that the error is small with high probability.

This article builds on matrix decompositions and sparse matrices, completing our linear algebra series with the tools that make large-scale ML computationally feasible.

The Core Idea: Dimensionality Reduction via Randomness

Given a large matrix $\mathbf{A} \in \mathbb{R}^{m \times n}$ , we want to compute its SVD, low-rank approximation, or solve a system involving it. The randomized approach:

Sketch: Multiply $\mathbf{A}$ by a random matrix $\boldsymbol{\Omega}$ to get a much smaller matrix $\mathbf{Y} = \mathbf{A}\boldsymbol{\Omega}$ .
Solve: Perform the expensive computation on $\mathbf{Y}$ (which is small).
Recover: Reconstruct the answer for the original $\mathbf{A}$ .

The random matrix $\boldsymbol{\Omega}$ acts as a “sketch” — it compresses $\mathbf{A}$ while preserving its essential structure (the top singular vectors and values).

Key insight: Randomness is not a weakness here — it is a tool. Random projections preserve geometric structure (distances, angles, subspaces) with high probability. The algorithms come with rigorous error bounds, not just empirical heuristics.

Random Projections

The Johnson-Lindenstrauss Lemma

The Johnson-Lindenstrauss (JL) lemma is the theoretical foundation of random projections. It states:

For any set of $n$ points in $\mathbb{R}^d$ and any $0 < \epsilon < 1$ , there exists a linear map $f: \mathbb{R}^d \to \mathbb{R}^k$ with $k = O(\epsilon^{-2} \log n)$ such that for all pairs of points $\mathbf{u}, \mathbf{v}$ :

(1 - \epsilon)\|\mathbf{u} - \mathbf{v}\|^2 \leq \|f(\mathbf{u}) - f(\mathbf{v})\|^2 \leq (1 + \epsilon)\|\mathbf{u} - \mathbf{v}\|^2

All pairwise distances are preserved to within a factor of $(1 \pm \epsilon)$ .

The remarkable fact: the target dimension $k$ depends only on $\log n$ and $\epsilon$ , not on the original dimension $d$ . You can project from a million dimensions to a few hundred and preserve all pairwise distances.

Constructing Random Projection Matrices

The projection $f(\mathbf{x}) = \frac{1}{\sqrt{k}}\mathbf{R}\mathbf{x}$ where $\mathbf{R} \in \mathbb{R}^{k \times d}$ has random entries. Several choices work:

Matrix Type	Entries	Density	Speed
Gaussian	$R_{ij} \sim \mathcal{N}(0, 1)$	Dense	$O(kd)$ per vector
Rademacher	$R_{ij} \in \{-1, +1\}$ uniform	Dense	$O(kd)$ , no multiplications
Sparse Rademacher	$R_{ij} \in \{-\sqrt{s}, 0, +\sqrt{s}\}$	$1/s$ density	$O(kd/s)$
SRHT (subsampled randomized Hadamard)	Structured	Implicit	$O(d \log k)$

The sparse and structured variants are faster because the projection itself costs less — critical when $d$ is large.

import numpy as np
from sklearn.random_projection import GaussianRandomProjection, SparseRandomProjection

# Original high-dimensional data
X = np.random.randn(10000, 5000)  # 10K points in 5000D

# Project to 200 dimensions (preserving distances)
projector = GaussianRandomProjection(n_components=200)
X_proj = projector.fit_transform(X)

print(f"Original shape: {X.shape}")
print(f"Projected shape: {X_proj.shape}")

# Verify distance preservation
from scipy.spatial.distance import pdist
d_orig = pdist(X[:100])
d_proj = pdist(X_proj[:100])

ratio = d_proj / d_orig
print(f"Distance ratio — mean: {ratio.mean():.3f}, std: {ratio.std():.3f}")
# Expect mean ≈ 1.0, small std

Applications of Random Projections

Approximate nearest neighbors: Project data to lower dimension, then run exact search. Much faster than searching in the original space.
Streaming algorithms: Process data points one at a time, maintaining a sketch. No need to store the full dataset.
Privacy: Random projections can preserve utility while obscuring individual data points (differential privacy connections).
Speeding up kernel methods: Random Fourier features approximate RBF kernels via random projections.

Randomized SVD

The randomized SVD is the most important algorithm in this article. It computes an approximate rank- $k$ SVD of a large matrix $\mathbf{A}$ much faster than the classical algorithm.

Algorithm

Given $\mathbf{A} \in \mathbb{R}^{m \times n}$ and target rank $k$ :

Stage 1: Find an approximate basis for the column space

Draw a random matrix $\boldsymbol{\Omega} \in \mathbb{R}^{n \times (k + p)}$ (Gaussian or structured), where $p$ is a small oversampling parameter (typically $p = 5$ or $10$ ).
Form the sketch $\mathbf{Y} = \mathbf{A}\boldsymbol{\Omega} \in \mathbb{R}^{m \times (k+p)}$ .
Compute an orthonormal basis $\mathbf{Q}$ for the column space of $\mathbf{Y}$ via QR: $\mathbf{Y} = \mathbf{Q}\mathbf{R}$ .

Stage 2: Form and decompose the small matrix

Form $\mathbf{B} = \mathbf{Q}^T\mathbf{A} \in \mathbb{R}^{(k+p) \times n}$ — a small matrix.
Compute the SVD of $\mathbf{B}$ : $\mathbf{B} = \hat{\mathbf{U}}\boldsymbol{\Sigma}\mathbf{V}^T$ .
Recover the left singular vectors: $\mathbf{U} = \mathbf{Q}\hat{\mathbf{U}}$ .

The approximate SVD is $\mathbf{A} \approx \mathbf{U}\boldsymbol{\Sigma}\mathbf{V}^T$ .

Complexity Comparison

Method	Cost	Notes
Full SVD (LAPACK)	$O(mn \min(m,n))$	Exact, all singular values
Truncated SVD (Lanczos)	$O(mnk)$	Iterative, top $k$ values
Randomized SVD	$O(mn(k+p)) + O((k+p)^2 n)$	One pass + small SVD
Randomized SVD (single pass)	$O(mn(k+p))$	For streaming data

For a $100{,}000 \times 50{,}000$ matrix with $k = 50$ :

Full SVD: ~ $10^{14}$ operations
Randomized SVD: ~ $2.5 \times 10^{11}$ operations (400x faster)

Power Iteration for Better Accuracy

When singular values decay slowly, the basic sketch $\mathbf{Y} = \mathbf{A}\boldsymbol{\Omega}$ may not capture the top- $k$ subspace well. Power iteration fixes this:

Replace step 2 with:

\mathbf{Y} = (\mathbf{A}\mathbf{A}^T)^q \mathbf{A}\boldsymbol{\Omega}

This amplifies the gap between the top- $k$ and remaining singular values by raising them to the $(2q+1)$ -th power. Even $q = 1$ or $q = 2$ dramatically improves accuracy.

In practice, alternate between multiplying by $\mathbf{A}$ and $\mathbf{A}^T$ with intermediate QR factorizations for numerical stability.

import numpy as np

def randomized_svd(A, k, p=5, q=2):
    """Randomized SVD with power iteration."""
    m, n = A.shape

    # Stage 1: Random sketch with power iteration
    Omega = np.random.randn(n, k + p)
    Y = A @ Omega

    for _ in range(q):
        Y = A @ (A.T @ Y)
        Y, _ = np.linalg.qr(Y)  # Stabilize
        Y = A @ (A.T @ Y)       # One more pass

    Q, _ = np.linalg.qr(Y)

    # Stage 2: Small SVD
    B = Q.T @ A
    U_hat, sigma, Vt = np.linalg.svd(B, full_matrices=False)
    U = Q @ U_hat

    return U[:, :k], sigma[:k], Vt[:k, :]

# Test
A = np.random.randn(10000, 5000)
U, sigma, Vt = randomized_svd(A, k=20)
A_approx = U @ np.diag(sigma) @ Vt

error = np.linalg.norm(A - A_approx, 'fro') / np.linalg.norm(A, 'fro')
print(f"Relative error: {error:.4f}")

Scikit-Learn Integration

Scikit-learn uses randomized SVD internally for many algorithms:

from sklearn.decomposition import TruncatedSVD, PCA

# TruncatedSVD uses randomized algorithm by default
svd = TruncatedSVD(n_components=50, algorithm='randomized')
X_reduced = svd.fit_transform(X_sparse)

# PCA also uses randomized SVD for large datasets
pca = PCA(n_components=50, svd_solver='randomized')
X_pca = pca.fit_transform(X_dense)

Matrix Sketching

Sketching compresses a matrix into a much smaller representation that preserves key properties. Beyond random projection, several sketching techniques exist:

Count Sketch

The Count Sketch hashes each row to a random bucket with a random sign:

Choose hash function $h: [m] \to [s]$ (mapping rows to $s$ buckets)
Choose sign function $\sigma: [m] \to \{-1, +1\}$
Sketch row $i$ by adding $\sigma(i) \cdot \mathbf{a}_i$ to bucket $h(i)$

The result is an $s \times n$ matrix. This is extremely fast — $O(\text{nnz}(\mathbf{A}))$ — and works well with sparse matrices.

Frequent Directions

The Frequent Directions algorithm maintains a low-rank sketch of a matrix seen in a stream:

Maintain a sketch $\mathbf{B} \in \mathbb{R}^{l \times n}$ (initially zero)
For each new row $\mathbf{a}_i^T$ : insert into an empty row of $\mathbf{B}$
When $\mathbf{B}$ is full: compute SVD, shrink singular values, and discard small ones

This gives a streaming approximation to the covariance matrix without storing the full data.

Random Features for Kernel Approximation

Random Fourier Features

Computing with kernel matrices costs $O(n^2)$ in memory and $O(n^3)$ for solving. Random Fourier Features (RFF) approximate the RBF kernel using random projections:

k(\mathbf{x}, \mathbf{y}) = \exp\left(-\frac{\|\mathbf{x} - \mathbf{y}\|^2}{2\sigma^2}\right) \approx \mathbf{z}(\mathbf{x})^T \mathbf{z}(\mathbf{y})

where $\mathbf{z}(\mathbf{x}) = \sqrt{\frac{2}{D}}\cos(\mathbf{W}\mathbf{x} + \mathbf{b})$ with random $\mathbf{W} \sim \mathcal{N}(0, \sigma^{-2}\mathbf{I})$ and $\mathbf{b} \sim \text{Uniform}(0, 2\pi)$ .

This transforms a kernel problem into a linear problem in $D$ -dimensional random feature space, reducing cost from $O(n^3)$ to $O(nD^2)$ .

from sklearn.kernel_approximation import RBFSampler

# Approximate RBF kernel with random features
rbf = RBFSampler(n_components=500, gamma=1.0)
X_features = rbf.fit_transform(X)

# Now use linear model instead of kernel SVM
from sklearn.linear_model import SGDClassifier
clf = SGDClassifier()
clf.fit(X_features, y)

Key insight: Random Fourier features turn an expensive nonlinear problem (kernel SVM) into a cheap linear problem. The approximation quality improves with more random features $D$ , but even $D = 500$ often suffices for good accuracy.

Randomized Least Squares

For overdetermined least squares $\min_\mathbf{x} \|\mathbf{A}\mathbf{x} - \mathbf{b}\|_2$ with $\mathbf{A} \in \mathbb{R}^{m \times n}$ ( $m \gg n$ ):

Sketch-and-Solve

Generate sketching matrix $\mathbf{S} \in \mathbb{R}^{s \times m}$ with $s \ll m$
Form $\mathbf{S}\mathbf{A}$ and $\mathbf{S}\mathbf{b}$
Solve the smaller problem $\min_\mathbf{x} \|\mathbf{S}\mathbf{A}\mathbf{x} - \mathbf{S}\mathbf{b}\|_2$

The sketched system has only $s$ rows instead of $m$ , making it much cheaper. With $s = O(n/\epsilon^2)$ , the solution is within $(1 + \epsilon)$ of optimal.

Iterative Sketching

For higher accuracy, use the sketch as a preconditioner:

Compute $\mathbf{R}$ from QR of $\mathbf{S}\mathbf{A}$ (cheap, since $\mathbf{S}\mathbf{A}$ is small)
Solve the preconditioned system $\mathbf{A}\mathbf{R}^{-1}\mathbf{z} \approx \mathbf{b}$ via iterative methods
Recover $\mathbf{x} = \mathbf{R}^{-1}\mathbf{z}$

The preconditioner $\mathbf{R}$ makes the system well-conditioned, so iterative methods converge in $O(\log(1/\epsilon))$ iterations.

Theoretical Guarantees

Randomized algorithms come with provable bounds. For the randomized SVD with oversampling $p$ and power iteration parameter $q$ :

\mathbb{E}\|\mathbf{A} - \mathbf{Q}\mathbf{Q}^T\mathbf{A}\|_F \leq \left(1 + \sqrt{\frac{k}{p-1}}\right)^{1/(2q+1)} \cdot \left(\sum_{j > k} \sigma_j^2\right)^{1/2}

The term $\left(\sum_{j>k} \sigma_j^2\right)^{1/2}$ is the best possible error (from the Eckart-Young theorem). The factor in front approaches 1 as $p$ and $q$ increase — even small values ( $p = 5, q = 2$ ) give near-optimal results.

Key insight: Randomized algorithms are not approximate heuristics — they achieve near-optimal accuracy with high probability. The theory provides explicit error bounds and failure probabilities that can be made arbitrarily small.

Worked Example: Randomized PCA on Large Data

Compare classical and randomized PCA on a large dataset:

import numpy as np
from sklearn.decomposition import PCA
import time

# Large dataset: 50K samples, 10K features
np.random.seed(42)
X = np.random.randn(50000, 10000)

# Classical PCA (full SVD)
start = time.time()
pca_full = PCA(n_components=50, svd_solver='full')
X_full = pca_full.fit_transform(X)
time_full = time.time() - start

# Randomized PCA
start = time.time()
pca_rand = PCA(n_components=50, svd_solver='randomized')
X_rand = pca_rand.fit_transform(X)
time_rand = time.time() - start

print(f"Full SVD:       {time_full:.1f}s")
print(f"Randomized SVD: {time_rand:.1f}s")
print(f"Speedup: {time_full/time_rand:.1f}x")

# Compare explained variance
var_diff = np.abs(pca_full.explained_variance_ratio_ - pca_rand.explained_variance_ratio_)
print(f"Max variance difference: {var_diff.max():.6f}")

When to Use Randomized Methods

Situation	Recommended Method
Low-rank SVD of large dense matrix	Randomized SVD
PCA on high-dimensional data	Randomized PCA (scikit-learn default)
Approximate nearest neighbors	Random projections + exact search
Large-scale kernel SVM	Random Fourier features
Streaming data, one pass	Frequent Directions / streaming sketch
Massive least squares ( $m \gg n$ )	Sketch-and-solve or sketched preconditioner
Small/medium matrices ( $n < 5000$ )	Classical algorithms (overhead not worth it)

Why This Matters for ML

Scalable PCA: Scikit-learn uses randomized SVD by default for PCA when $n > 500$ . This makes PCA practical on datasets with thousands or millions of features.
Recommendation systems: Randomized SVD enables matrix factorization on Netflix-scale data (480K users × 18K movies).
NLP embeddings: GloVe and similar methods use randomized methods to factorize massive co-occurrence matrices.
Approximate nearest neighbors: Libraries like Annoy, FAISS, and ScaNN use random projections as part of their indexing.
Kernel approximation: Random Fourier features make kernel methods scalable to millions of data points.
Streaming ML: Sketching algorithms process data in one pass with bounded memory, enabling online learning on massive streams.

Summary

Randomized linear algebra trades small, controlled accuracy loss for large speedups on massive problems.
The Johnson-Lindenstrauss lemma guarantees that random projections preserve pairwise distances in $O(\log n)$ dimensions.
Randomized SVD computes rank- $k$ approximations in $O(mnk)$ instead of $O(mn\min(m,n))$ , with near-optimal accuracy.
Power iteration improves randomized SVD accuracy when singular values decay slowly.
Matrix sketching (Count Sketch, Frequent Directions) enables streaming and single-pass algorithms.
Random Fourier features turn expensive kernel methods into fast linear methods.
These algorithms have rigorous theoretical guarantees — they are not heuristics.
This completes our linear algebra series. The tools from vectors through randomized algorithms form the mathematical backbone of modern machine learning.

References

Halko, N., Martinsson, P. G., & Tropp, J. A. (2011). Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions. SIAM Review, 53(2), 217-288. arXiv:0909.4061
Woodruff, D. P. (2014). Sketching as a Tool for Numerical Linear Algebra. Foundations and Trends in Theoretical Computer Science, 10(1-2), 1-157.
Martinsson, P. G., & Tropp, J. A. (2020). Randomized Numerical Linear Algebra: Foundations and Algorithms. Acta Numerica, 29, 403-572. arXiv:2002.01387
Rahimi, A., & Recht, B. (2007). Random Features for Large-Scale Kernel Machines. NeurIPS. papers.nips.cc
Scikit-learn Documentation. Random Projection. scikit-learn.org