- 01 Vectors and Vector Spaces: The Language of Data 02 Matrices and Matrix Operations: Organizing Linear Computation 03 Systems of Linear Equations: From Geometry to Algorithms 04 Determinants: The Volume Factor of Linear Maps 05 Linear Transformations: Matrices as Functions 06 Inner Products, Norms, and Orthogonality: Measuring Geometry 07 Eigenvalues and Eigenvectors: The DNA of a Matrix 08 Matrix Decompositions: Breaking Matrices into Simpler Pieces 09 Linear Algebra in Machine Learning: Putting It All Together 10 Matrix Calculus: Derivatives for Machine Learning 11 Tensor Operations: Beyond Matrices 12 Sparse Matrices and Efficient Computation 13 Randomized Linear Algebra: Speed Through Randomness
Introduction
Most large matrices in ML are sparse — the vast majority of their entries are zero. A bag-of-words matrix for a million documents has 99.9%+ zeros. A social network adjacency matrix with a billion users stores only the edges. A recommender system’s user-item matrix has ratings for a tiny fraction of all possible pairs.
Storing and computing with these matrices as dense arrays is wasteful and often impossible — a billion-by-billion matrix does not fit in memory. Sparse matrix formats exploit the zero structure to reduce storage from to (number of nonzeros) and to speed up matrix operations dramatically.
This article builds on matrices and decompositions, focusing on the practical computational side that makes large-scale ML feasible.
When Are Matrices Sparse?
A matrix is considered sparse when the number of nonzero entries is much smaller than . The sparsity is the fraction of zeros:
| Application | Matrix | Typical Sparsity |
|---|---|---|
| NLP bag-of-words | Term-document | 99%+ |
| Recommender systems | User-item ratings | 99.9%+ |
| Social networks | Adjacency matrix | 99.99%+ |
| Finite elements | Stiffness matrix | 99%+ |
| Graph neural networks | Graph Laplacian | 99%+ |
| Feature engineering | One-hot encoded | 99%+ |
Key insight: Sparsity is not just about saving memory. It fundamentally changes which algorithms are practical. Dense matrix multiplication is ; sparse matrix-vector multiplication is , which can be orders of magnitude faster.
Sparse Storage Formats
Coordinate Format (COO)
The simplest format: store three parallel arrays of row indices, column indices, and values.
COO storage:
| Row | Col | Value |
|---|---|---|
| 0 | 2 | 3 |
| 1 | 0 | 4 |
| 2 | 1 | 5 |
Storage: entries.
Pros: Simple, easy to construct incrementally.
Cons: Slow for arithmetic (need to search for entries), duplicates possible.
from scipy.sparse import coo_matrix
import numpy as np
row = np.array([0, 1, 2])
col = np.array([2, 0, 1])
data = np.array([3, 4, 5])
A = coo_matrix((data, (row, col)), shape=(3, 3))
print(A.toarray())
Compressed Sparse Row (CSR)
CSR is the most widely used format for computation. It stores:
data: Array of all nonzero values (length )indices: Column indices for each nonzero (length )indptr: Row pointer array —indptr[i]toindptr[i+1]gives the range of entries in row (length )
For the same matrix:
data = [3, 4, 5]indices = [2, 0, 1]indptr = [0, 1, 2, 3]
Storage: entries.
Pros: Fast row slicing, fast matrix-vector multiplication, efficient arithmetic.
Cons: Slow column slicing, expensive to modify structure.
from scipy.sparse import csr_matrix
A_csr = csr_matrix(A) # Convert from COO
# Or construct directly
A_csr = csr_matrix((data, col, np.array([0, 1, 2, 3])), shape=(3, 3))
Compressed Sparse Column (CSC)
CSC is the column analog of CSR. It stores column pointers instead of row pointers.
Pros: Fast column slicing, efficient for operations.
Cons: Slow row slicing.
Choosing the Right Format
| Operation | Best Format |
|---|---|
| Constructing incrementally | COO |
| Row slicing | CSR |
| Column slicing | CSC |
| Matrix-vector product | CSR |
| Transpose product | CSC |
| Element-wise arithmetic | CSR or CSC |
| Changing sparsity pattern | COO → convert |
Key insight: In practice, build your sparse matrix in COO format (easy to add entries), then convert to CSR for computation. This is the standard pattern in SciPy and most sparse libraries.
Sparse Matrix-Vector Multiplication
The core operation in sparse linear algebra is SpMV (sparse matrix-vector multiplication):
For dense , this costs . For sparse with nonzeros, it costs .
In CSR format, the algorithm is:
# Conceptual SpMV in CSR
for i in range(m):
y[i] = 0
for j_idx in range(indptr[i], indptr[i+1]):
j = indices[j_idx]
y[i] += data[j_idx] * x[j]
Each row only touches its nonzero entries. For a matrix with 0.1% density, this is 1000x faster than dense multiplication.
from scipy.sparse import random as sparse_random
import numpy as np
import time
n = 100000
density = 0.001 # 0.1% nonzeros
A_sparse = sparse_random(n, n, density=density, format='csr')
x = np.random.randn(n)
start = time.time()
y = A_sparse @ x
sparse_time = time.time() - start
print(f"Sparse SpMV: {sparse_time:.4f}s, nnz = {A_sparse.nnz}")
Sparse Matrix Arithmetic
Addition
Adding two sparse matrices in CSR format produces a result whose sparsity pattern is the union of the inputs. The cost is .
Multiplication
Sparse matrix multiplication is more complex. Even if and are sparse, may be dense (or at least denser). The cost depends on the sparsity patterns of both inputs and the output.
Warning: Sparse-sparse multiplication can produce unexpectedly dense results. Always check the density of the output. In the worst case, multiplying two sparse matrices produces a completely dense matrix.
Sparse Solvers
Solving with sparse uses specialized algorithms:
Direct Methods
Sparse LU and sparse Cholesky decompositions exploit sparsity by reordering rows and columns to minimize fill-in — new nonzeros created during elimination.
from scipy.sparse.linalg import spsolve
from scipy.sparse import random as sparse_random
n = 10000
A = sparse_random(n, n, density=0.01, format='csr') + 5 * eye(n)
b = np.random.randn(n)
x = spsolve(A, b) # Sparse direct solver
Iterative Methods
For very large systems, iterative methods are preferred:
| Method | Matrix Type | Convergence |
|---|---|---|
| Conjugate Gradient (CG) | Symmetric positive definite | Fast with good preconditioner |
| GMRES | General nonsymmetric | Flexible but more expensive |
| BiCGSTAB | General nonsymmetric | Good for some systems |
| LSQR / LSMR | Rectangular (least squares) | For overdetermined systems |
Iterative methods only require the matrix-vector product — they never form explicitly. This makes them ideal for problems where is defined implicitly.
from scipy.sparse.linalg import cg
# Conjugate gradient for symmetric positive definite
x, info = cg(A, b, tol=1e-8, maxiter=1000)
print(f"Converged: {info == 0}")
Preconditioning
Preconditioning transforms the system into , where approximates but is easy to invert. Good preconditioning can reduce iterations from thousands to tens.
Common preconditioners:
- Diagonal (Jacobi): — trivial but sometimes effective
- Incomplete LU (ILU): Sparse LU that drops small fill-in entries
- Incomplete Cholesky: Same idea for SPD matrices
Sparse Data in ML
Text Data (NLP)
The term-document matrix (or TF-IDF matrix) is one of the most common sparse matrices in ML:
- Rows: documents
- Columns: vocabulary terms
- Values: word counts or TF-IDF scores
With a vocabulary of 100,000 words and documents averaging 200 unique words, the density is 0.2%.
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ["the cat sat on the mat",
"the dog sat on the log",
"cats and dogs are pets"]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(f"Shape: {X.shape}")
print(f"Density: {X.nnz / (X.shape[0] * X.shape[1]):.2%}")
print(f"Format: {type(X)}") # scipy.sparse.csr_matrix
One-Hot and Feature Hashing
Categorical features with many categories produce sparse representations:
- A feature with 1,000 categories creates a 1,000-dimensional one-hot vector with exactly 1 nonzero
- Feature hashing maps high-cardinality features to a fixed-size sparse vector
Graph Data
Graphs are naturally represented as sparse adjacency matrices:
Most real graphs are sparse: a social network with millions of nodes has each user connected to at most a few thousand others.
import networkx as nx
from scipy.sparse import csr_matrix
# Create a random sparse graph
G = nx.barabasi_albert_graph(10000, 5)
A = nx.adjacency_matrix(G) # Returns CSR matrix
print(f"Nodes: {G.number_of_nodes()}")
print(f"Edges: {G.number_of_edges()}")
print(f"Density: {A.nnz / (A.shape[0]**2):.4%}")
Recommender Systems
The user-item interaction matrix is extremely sparse — most users interact with a tiny fraction of all items:
| System | Users | Items | Density |
|---|---|---|---|
| Netflix | 480K | 18K | ~1.2% |
| Amazon | 100M+ | 10M+ | <0.01% |
| Spotify | 500M+ | 100M+ | <0.001% |
Matrix factorization for recommendation works directly on the sparse matrix, only fitting to observed entries.
Sparse Eigenvalue Problems
For large sparse matrices, computing all eigenvalues is impractical. Instead, we compute a few (typically the largest or smallest) using iterative methods:
from scipy.sparse.linalg import eigsh
# Find top 5 eigenvalues of a sparse symmetric matrix
eigenvalues, eigenvectors = eigsh(A, k=5, which='LM') # Largest magnitude
print(f"Top 5 eigenvalues: {eigenvalues}")
The Lanczos algorithm (for symmetric) and Arnoldi algorithm (for general) are the standard methods. They only require matrix-vector products and converge to the extremal eigenvalues first.
This is how spectral clustering works at scale: compute the bottom few eigenvectors of the sparse graph Laplacian using Lanczos, then cluster in the eigenvector space.
Sparse vs. Dense: When to Use What
| Factor | Use Sparse | Use Dense |
|---|---|---|
| Density | < 10% nonzeros | > 10% nonzeros |
| Size | Too large for dense storage | Fits in memory |
| Operations | SpMV, solving systems | Matrix decomposition, small systems |
| GPU support | Limited (improving) | Excellent |
| Frameworks | SciPy, PyTorch sparse | NumPy, PyTorch, TensorFlow |
Warning: Sparse operations on GPUs are less mature than dense operations. While libraries like cuSPARSE exist, the irregular memory access patterns of sparse matrices make GPU parallelization harder. For small-to-medium problems, dense computation on GPU can be faster than sparse computation on CPU.
Worked Example: Sparse Least Squares for NLP
Fit a linear model on TF-IDF features for text classification:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import SGDClassifier
from scipy.sparse.linalg import lsqr
import numpy as np
# Simulated text classification
texts = ["machine learning is great"] * 500 + ["the weather is nice"] * 500
labels = np.array([1] * 500 + [0] * 500)
vectorizer = TfidfVectorizer(max_features=10000)
X = vectorizer.fit_transform(texts) # Sparse CSR matrix
print(f"X shape: {X.shape}")
print(f"X density: {X.nnz / (X.shape[0] * X.shape[1]):.4%}")
print(f"Memory (sparse): {X.data.nbytes + X.indices.nbytes + X.indptr.nbytes:.0f} bytes")
print(f"Memory (dense): {X.shape[0] * X.shape[1] * 8:.0f} bytes")
# SGDClassifier works efficiently with sparse matrices
clf = SGDClassifier(loss='log_loss')
clf.fit(X, labels)
Why This Matters for ML
- NLP pipelines produce sparse TF-IDF or bag-of-words matrices. Scikit-learn’s vectorizers output sparse CSR matrices.
- Graph neural networks operate on sparse adjacency matrices. Message passing is sparse matrix-vector multiplication.
- Recommender systems factorize sparse user-item matrices. Only observed entries contribute to the loss.
- Feature engineering with one-hot encoding and feature hashing produces sparse inputs.
- Large-scale eigenproblems (spectral clustering, PageRank) use sparse iterative solvers.
- Memory efficiency: A 100K × 100K matrix at 1% density needs 80MB sparse vs. 80GB dense.
Summary
- Sparse matrices have mostly zero entries; exploiting this structure saves memory and computation.
- COO format is easy to construct; CSR is best for computation; convert COO → CSR as standard practice.
- SpMV costs instead of — the key operation enabling large-scale ML.
- Iterative solvers (CG, GMRES) only need matrix-vector products and scale to millions of dimensions.
- Preconditioning dramatically improves convergence of iterative solvers.
- NLP, graphs, and recommender systems all produce naturally sparse data.
- Sparse eigensolvers (Lanczos, Arnoldi) find extremal eigenvalues efficiently for spectral methods.
- Next, we push scalability further with randomized linear algebra.
References
- Davis, T. A. (2006). Direct Methods for Sparse Linear Systems. SIAM.
- Saad, Y. (2003). Iterative Methods for Sparse Linear Systems (2nd ed.). SIAM.
- SciPy Documentation. scipy.sparse. docs.scipy.org
- Strang, G. (2019). Linear Algebra and Learning from Data. Wellesley-Cambridge Press.