Sparse Matrices and Efficient Computation

Linear Algebra Series 12 / 13

Introduction

Most large matrices in ML are sparse — the vast majority of their entries are zero. A bag-of-words matrix for a million documents has 99.9%+ zeros. A social network adjacency matrix with a billion users stores only the edges. A recommender system’s user-item matrix has ratings for a tiny fraction of all possible pairs.

Storing and computing with these matrices as dense arrays is wasteful and often impossible — a billion-by-billion matrix does not fit in memory. Sparse matrix formats exploit the zero structure to reduce storage from $O(mn)$ to $O(\text{nnz})$ (number of nonzeros) and to speed up matrix operations dramatically.

This article builds on matrices and decompositions, focusing on the practical computational side that makes large-scale ML feasible.

When Are Matrices Sparse?

A matrix is considered sparse when the number of nonzero entries $\text{nnz}$ is much smaller than $mn$ . The sparsity is the fraction of zeros:

\text{sparsity} = 1 - \frac{\text{nnz}}{mn}

Application	Matrix	Typical Sparsity
NLP bag-of-words	Term-document	99%+
Recommender systems	User-item ratings	99.9%+
Social networks	Adjacency matrix	99.99%+
Finite elements	Stiffness matrix	99%+
Graph neural networks	Graph Laplacian	99%+
Feature engineering	One-hot encoded	99%+

Key insight: Sparsity is not just about saving memory. It fundamentally changes which algorithms are practical. Dense matrix multiplication is $O(n^3)$ ; sparse matrix-vector multiplication is $O(\text{nnz})$ , which can be orders of magnitude faster.

Sparse Storage Formats

Coordinate Format (COO)

The simplest format: store three parallel arrays of row indices, column indices, and values.

\mathbf{A} = \begin{bmatrix} 0 & 0 & 3 \\ 4 & 0 & 0 \\ 0 & 5 & 0 \end{bmatrix}

COO storage:

Row	Col	Value
0	2	3
1	0	4
2	1	5

Storage: $3 \times \text{nnz}$ entries.

Pros: Simple, easy to construct incrementally.

Cons: Slow for arithmetic (need to search for entries), duplicates possible.

from scipy.sparse import coo_matrix
import numpy as np

row = np.array([0, 1, 2])
col = np.array([2, 0, 1])
data = np.array([3, 4, 5])

A = coo_matrix((data, (row, col)), shape=(3, 3))
print(A.toarray())

Compressed Sparse Row (CSR)

CSR is the most widely used format for computation. It stores:

data: Array of all nonzero values (length $\text{nnz}$ )
indices: Column indices for each nonzero (length $\text{nnz}$ )
indptr: Row pointer array — indptr[i] to indptr[i+1] gives the range of entries in row $i$ (length $m + 1$ )

For the same matrix:

data = [3, 4, 5]
indices = [2, 0, 1]
indptr = [0, 1, 2, 3]

Storage: $2 \times \text{nnz} + (m + 1)$ entries.

Pros: Fast row slicing, fast matrix-vector multiplication, efficient arithmetic.

Cons: Slow column slicing, expensive to modify structure.

from scipy.sparse import csr_matrix

A_csr = csr_matrix(A)  # Convert from COO
# Or construct directly
A_csr = csr_matrix((data, col, np.array([0, 1, 2, 3])), shape=(3, 3))

Compressed Sparse Column (CSC)

CSC is the column analog of CSR. It stores column pointers instead of row pointers.

Pros: Fast column slicing, efficient for $\mathbf{A}^T\mathbf{x}$ operations.

Cons: Slow row slicing.

Choosing the Right Format

Operation	Best Format
Constructing incrementally	COO
Row slicing	CSR
Column slicing	CSC
Matrix-vector product $\mathbf{A}\mathbf{x}$	CSR
Transpose product $\mathbf{A}^T\mathbf{x}$	CSC
Element-wise arithmetic	CSR or CSC
Changing sparsity pattern	COO → convert

Key insight: In practice, build your sparse matrix in COO format (easy to add entries), then convert to CSR for computation. This is the standard pattern in SciPy and most sparse libraries.

Sparse Matrix-Vector Multiplication

The core operation in sparse linear algebra is SpMV (sparse matrix-vector multiplication):

\mathbf{y} = \mathbf{A}\mathbf{x}

For dense $\mathbf{A} \in \mathbb{R}^{m \times n}$ , this costs $O(mn)$ . For sparse $\mathbf{A}$ with $\text{nnz}$ nonzeros, it costs $O(\text{nnz})$ .

In CSR format, the algorithm is:

# Conceptual SpMV in CSR
for i in range(m):
    y[i] = 0
    for j_idx in range(indptr[i], indptr[i+1]):
        j = indices[j_idx]
        y[i] += data[j_idx] * x[j]

Each row only touches its nonzero entries. For a matrix with 0.1% density, this is 1000x faster than dense multiplication.

from scipy.sparse import random as sparse_random
import numpy as np
import time

n = 100000
density = 0.001  # 0.1% nonzeros

A_sparse = sparse_random(n, n, density=density, format='csr')
x = np.random.randn(n)

start = time.time()
y = A_sparse @ x
sparse_time = time.time() - start
print(f"Sparse SpMV: {sparse_time:.4f}s, nnz = {A_sparse.nnz}")

Sparse Matrix Arithmetic

Addition

Adding two sparse matrices in CSR format produces a result whose sparsity pattern is the union of the inputs. The cost is $O(\text{nnz}_A + \text{nnz}_B)$ .

Multiplication

Sparse matrix multiplication $\mathbf{C} = \mathbf{A}\mathbf{B}$ is more complex. Even if $\mathbf{A}$ and $\mathbf{B}$ are sparse, $\mathbf{C}$ may be dense (or at least denser). The cost depends on the sparsity patterns of both inputs and the output.

Warning: Sparse-sparse multiplication can produce unexpectedly dense results. Always check the density of the output. In the worst case, multiplying two sparse matrices produces a completely dense matrix.

Sparse Solvers

Solving $\mathbf{A}\mathbf{x} = \mathbf{b}$ with sparse $\mathbf{A}$ uses specialized algorithms:

Direct Methods

Sparse LU and sparse Cholesky decompositions exploit sparsity by reordering rows and columns to minimize fill-in — new nonzeros created during elimination.

from scipy.sparse.linalg import spsolve
from scipy.sparse import random as sparse_random

n = 10000
A = sparse_random(n, n, density=0.01, format='csr') + 5 * eye(n)
b = np.random.randn(n)

x = spsolve(A, b)  # Sparse direct solver

Iterative Methods

For very large systems, iterative methods are preferred:

Method	Matrix Type	Convergence
Conjugate Gradient (CG)	Symmetric positive definite	Fast with good preconditioner
GMRES	General nonsymmetric	Flexible but more expensive
BiCGSTAB	General nonsymmetric	Good for some systems
LSQR / LSMR	Rectangular (least squares)	For overdetermined systems

Iterative methods only require the matrix-vector product $\mathbf{A}\mathbf{x}$ — they never form $\mathbf{A}$ explicitly. This makes them ideal for problems where $\mathbf{A}$ is defined implicitly.

from scipy.sparse.linalg import cg

# Conjugate gradient for symmetric positive definite
x, info = cg(A, b, tol=1e-8, maxiter=1000)
print(f"Converged: {info == 0}")

Preconditioning

Preconditioning transforms the system $\mathbf{A}\mathbf{x} = \mathbf{b}$ into $\mathbf{M}^{-1}\mathbf{A}\mathbf{x} = \mathbf{M}^{-1}\mathbf{b}$ , where $\mathbf{M}$ approximates $\mathbf{A}$ but is easy to invert. Good preconditioning can reduce iterations from thousands to tens.

Common preconditioners:

Diagonal (Jacobi): $\mathbf{M} = \text{diag}(\mathbf{A})$ — trivial but sometimes effective
Incomplete LU (ILU): Sparse LU that drops small fill-in entries
Incomplete Cholesky: Same idea for SPD matrices

Sparse Data in ML

Text Data (NLP)

The term-document matrix (or TF-IDF matrix) is one of the most common sparse matrices in ML:

Rows: documents
Columns: vocabulary terms
Values: word counts or TF-IDF scores

With a vocabulary of 100,000 words and documents averaging 200 unique words, the density is 0.2%.

from sklearn.feature_extraction.text import TfidfVectorizer

corpus = ["the cat sat on the mat",
          "the dog sat on the log",
          "cats and dogs are pets"]

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

print(f"Shape: {X.shape}")
print(f"Density: {X.nnz / (X.shape[0] * X.shape[1]):.2%}")
print(f"Format: {type(X)}")  # scipy.sparse.csr_matrix

One-Hot and Feature Hashing

Categorical features with many categories produce sparse representations:

A feature with 1,000 categories creates a 1,000-dimensional one-hot vector with exactly 1 nonzero
Feature hashing maps high-cardinality features to a fixed-size sparse vector

Graph Data

Graphs are naturally represented as sparse adjacency matrices:

A_{ij} = \begin{cases} 1 & \text{if edge } (i,j) \text{ exists} \\ 0 & \text{otherwise} \end{cases}

Most real graphs are sparse: a social network with millions of nodes has each user connected to at most a few thousand others.

import networkx as nx
from scipy.sparse import csr_matrix

# Create a random sparse graph
G = nx.barabasi_albert_graph(10000, 5)
A = nx.adjacency_matrix(G)  # Returns CSR matrix

print(f"Nodes: {G.number_of_nodes()}")
print(f"Edges: {G.number_of_edges()}")
print(f"Density: {A.nnz / (A.shape[0]**2):.4%}")

Recommender Systems

The user-item interaction matrix is extremely sparse — most users interact with a tiny fraction of all items:

System	Users	Items	Density
Netflix	480K	18K	~1.2%
Amazon	100M+	10M+	<0.01%
Spotify	500M+	100M+	<0.001%

Matrix factorization for recommendation works directly on the sparse matrix, only fitting to observed entries.

Sparse Eigenvalue Problems

For large sparse matrices, computing all eigenvalues is impractical. Instead, we compute a few (typically the largest or smallest) using iterative methods:

from scipy.sparse.linalg import eigsh

# Find top 5 eigenvalues of a sparse symmetric matrix
eigenvalues, eigenvectors = eigsh(A, k=5, which='LM')  # Largest magnitude
print(f"Top 5 eigenvalues: {eigenvalues}")

The Lanczos algorithm (for symmetric) and Arnoldi algorithm (for general) are the standard methods. They only require matrix-vector products and converge to the extremal eigenvalues first.

This is how spectral clustering works at scale: compute the bottom few eigenvectors of the sparse graph Laplacian using Lanczos, then cluster in the eigenvector space.

Sparse vs. Dense: When to Use What

Factor	Use Sparse	Use Dense
Density	< 10% nonzeros	> 10% nonzeros
Size	Too large for dense storage	Fits in memory
Operations	SpMV, solving systems	Matrix decomposition, small systems
GPU support	Limited (improving)	Excellent
Frameworks	SciPy, PyTorch sparse	NumPy, PyTorch, TensorFlow

Warning: Sparse operations on GPUs are less mature than dense operations. While libraries like cuSPARSE exist, the irregular memory access patterns of sparse matrices make GPU parallelization harder. For small-to-medium problems, dense computation on GPU can be faster than sparse computation on CPU.

Worked Example: Sparse Least Squares for NLP

Fit a linear model on TF-IDF features for text classification:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import SGDClassifier
from scipy.sparse.linalg import lsqr
import numpy as np

# Simulated text classification
texts = ["machine learning is great"] * 500 + ["the weather is nice"] * 500
labels = np.array([1] * 500 + [0] * 500)

vectorizer = TfidfVectorizer(max_features=10000)
X = vectorizer.fit_transform(texts)  # Sparse CSR matrix

print(f"X shape: {X.shape}")
print(f"X density: {X.nnz / (X.shape[0] * X.shape[1]):.4%}")
print(f"Memory (sparse): {X.data.nbytes + X.indices.nbytes + X.indptr.nbytes:.0f} bytes")
print(f"Memory (dense): {X.shape[0] * X.shape[1] * 8:.0f} bytes")

# SGDClassifier works efficiently with sparse matrices
clf = SGDClassifier(loss='log_loss')
clf.fit(X, labels)

Why This Matters for ML

NLP pipelines produce sparse TF-IDF or bag-of-words matrices. Scikit-learn’s vectorizers output sparse CSR matrices.
Graph neural networks operate on sparse adjacency matrices. Message passing is sparse matrix-vector multiplication.
Recommender systems factorize sparse user-item matrices. Only observed entries contribute to the loss.
Feature engineering with one-hot encoding and feature hashing produces sparse inputs.
Large-scale eigenproblems (spectral clustering, PageRank) use sparse iterative solvers.
Memory efficiency: A 100K × 100K matrix at 1% density needs 80MB sparse vs. 80GB dense.

Summary

Sparse matrices have mostly zero entries; exploiting this structure saves memory and computation.
COO format is easy to construct; CSR is best for computation; convert COO → CSR as standard practice.
SpMV costs $O(\text{nnz})$ instead of $O(mn)$ — the key operation enabling large-scale ML.
Iterative solvers (CG, GMRES) only need matrix-vector products and scale to millions of dimensions.
Preconditioning dramatically improves convergence of iterative solvers.
NLP, graphs, and recommender systems all produce naturally sparse data.
Sparse eigensolvers (Lanczos, Arnoldi) find extremal eigenvalues efficiently for spectral methods.
Next, we push scalability further with randomized linear algebra.

References

Davis, T. A. (2006). Direct Methods for Sparse Linear Systems. SIAM.
Saad, Y. (2003). Iterative Methods for Sparse Linear Systems (2nd ed.). SIAM.
SciPy Documentation. scipy.sparse. docs.scipy.org
Strang, G. (2019). Linear Algebra and Learning from Data. Wellesley-Cambridge Press.