Sparse Matrices and Efficient Computation

Learn how sparse matrices save memory and speed up computation in NLP, graphs, and recommender systems — from storage formats to sparse solvers.

Linear Algebra March 6, 2026 11 min read

Introduction

Most large matrices in ML are sparse — the vast majority of their entries are zero. A bag-of-words matrix for a million documents has 99.9%+ zeros. A social network adjacency matrix with a billion users stores only the edges. A recommender system’s user-item matrix has ratings for a tiny fraction of all possible pairs.

Storing and computing with these matrices as dense arrays is wasteful and often impossible — a billion-by-billion matrix does not fit in memory. Sparse matrix formats exploit the zero structure to reduce storage from O(mn)O(mn) to O(nnz)O(\text{nnz}) (number of nonzeros) and to speed up matrix operations dramatically.

This article builds on matrices and decompositions, focusing on the practical computational side that makes large-scale ML feasible.

When Are Matrices Sparse?

A matrix is considered sparse when the number of nonzero entries nnz\text{nnz} is much smaller than mnmn. The sparsity is the fraction of zeros:

sparsity=1nnzmn\text{sparsity} = 1 - \frac{\text{nnz}}{mn}
ApplicationMatrixTypical Sparsity
NLP bag-of-wordsTerm-document99%+
Recommender systemsUser-item ratings99.9%+
Social networksAdjacency matrix99.99%+
Finite elementsStiffness matrix99%+
Graph neural networksGraph Laplacian99%+
Feature engineeringOne-hot encoded99%+

Key insight: Sparsity is not just about saving memory. It fundamentally changes which algorithms are practical. Dense matrix multiplication is O(n3)O(n^3); sparse matrix-vector multiplication is O(nnz)O(\text{nnz}), which can be orders of magnitude faster.

Sparse Storage Formats

Coordinate Format (COO)

The simplest format: store three parallel arrays of row indices, column indices, and values.

A=[003400050]\mathbf{A} = \begin{bmatrix} 0 & 0 & 3 \\ 4 & 0 & 0 \\ 0 & 5 & 0 \end{bmatrix}

COO storage:

RowColValue
023
104
215

Storage: 3×nnz3 \times \text{nnz} entries.

Pros: Simple, easy to construct incrementally.

Cons: Slow for arithmetic (need to search for entries), duplicates possible.

from scipy.sparse import coo_matrix
import numpy as np

row = np.array([0, 1, 2])
col = np.array([2, 0, 1])
data = np.array([3, 4, 5])

A = coo_matrix((data, (row, col)), shape=(3, 3))
print(A.toarray())

Compressed Sparse Row (CSR)

CSR is the most widely used format for computation. It stores:

  • data: Array of all nonzero values (length nnz\text{nnz})
  • indices: Column indices for each nonzero (length nnz\text{nnz})
  • indptr: Row pointer array — indptr[i] to indptr[i+1] gives the range of entries in row ii (length m+1m + 1)

For the same matrix:

  • data = [3, 4, 5]
  • indices = [2, 0, 1]
  • indptr = [0, 1, 2, 3]

Storage: 2×nnz+(m+1)2 \times \text{nnz} + (m + 1) entries.

Pros: Fast row slicing, fast matrix-vector multiplication, efficient arithmetic.

Cons: Slow column slicing, expensive to modify structure.

from scipy.sparse import csr_matrix

A_csr = csr_matrix(A)  # Convert from COO
# Or construct directly
A_csr = csr_matrix((data, col, np.array([0, 1, 2, 3])), shape=(3, 3))

Compressed Sparse Column (CSC)

CSC is the column analog of CSR. It stores column pointers instead of row pointers.

Pros: Fast column slicing, efficient for ATx\mathbf{A}^T\mathbf{x} operations.

Cons: Slow row slicing.

Choosing the Right Format

OperationBest Format
Constructing incrementallyCOO
Row slicingCSR
Column slicingCSC
Matrix-vector product Ax\mathbf{A}\mathbf{x}CSR
Transpose product ATx\mathbf{A}^T\mathbf{x}CSC
Element-wise arithmeticCSR or CSC
Changing sparsity patternCOO → convert

Key insight: In practice, build your sparse matrix in COO format (easy to add entries), then convert to CSR for computation. This is the standard pattern in SciPy and most sparse libraries.

Sparse Matrix-Vector Multiplication

The core operation in sparse linear algebra is SpMV (sparse matrix-vector multiplication):

y=Ax\mathbf{y} = \mathbf{A}\mathbf{x}

For dense ARm×n\mathbf{A} \in \mathbb{R}^{m \times n}, this costs O(mn)O(mn). For sparse A\mathbf{A} with nnz\text{nnz} nonzeros, it costs O(nnz)O(\text{nnz}).

In CSR format, the algorithm is:

# Conceptual SpMV in CSR
for i in range(m):
    y[i] = 0
    for j_idx in range(indptr[i], indptr[i+1]):
        j = indices[j_idx]
        y[i] += data[j_idx] * x[j]

Each row only touches its nonzero entries. For a matrix with 0.1% density, this is 1000x faster than dense multiplication.

from scipy.sparse import random as sparse_random
import numpy as np
import time

n = 100000
density = 0.001  # 0.1% nonzeros

A_sparse = sparse_random(n, n, density=density, format='csr')
x = np.random.randn(n)

start = time.time()
y = A_sparse @ x
sparse_time = time.time() - start
print(f"Sparse SpMV: {sparse_time:.4f}s, nnz = {A_sparse.nnz}")

Sparse Matrix Arithmetic

Addition

Adding two sparse matrices in CSR format produces a result whose sparsity pattern is the union of the inputs. The cost is O(nnzA+nnzB)O(\text{nnz}_A + \text{nnz}_B).

Multiplication

Sparse matrix multiplication C=AB\mathbf{C} = \mathbf{A}\mathbf{B} is more complex. Even if A\mathbf{A} and B\mathbf{B} are sparse, C\mathbf{C} may be dense (or at least denser). The cost depends on the sparsity patterns of both inputs and the output.

Warning: Sparse-sparse multiplication can produce unexpectedly dense results. Always check the density of the output. In the worst case, multiplying two sparse matrices produces a completely dense matrix.

Sparse Solvers

Solving Ax=b\mathbf{A}\mathbf{x} = \mathbf{b} with sparse A\mathbf{A} uses specialized algorithms:

Direct Methods

Sparse LU and sparse Cholesky decompositions exploit sparsity by reordering rows and columns to minimize fill-in — new nonzeros created during elimination.

from scipy.sparse.linalg import spsolve
from scipy.sparse import random as sparse_random

n = 10000
A = sparse_random(n, n, density=0.01, format='csr') + 5 * eye(n)
b = np.random.randn(n)

x = spsolve(A, b)  # Sparse direct solver

Iterative Methods

For very large systems, iterative methods are preferred:

MethodMatrix TypeConvergence
Conjugate Gradient (CG)Symmetric positive definiteFast with good preconditioner
GMRESGeneral nonsymmetricFlexible but more expensive
BiCGSTABGeneral nonsymmetricGood for some systems
LSQR / LSMRRectangular (least squares)For overdetermined systems

Iterative methods only require the matrix-vector product Ax\mathbf{A}\mathbf{x} — they never form A\mathbf{A} explicitly. This makes them ideal for problems where A\mathbf{A} is defined implicitly.

from scipy.sparse.linalg import cg

# Conjugate gradient for symmetric positive definite
x, info = cg(A, b, tol=1e-8, maxiter=1000)
print(f"Converged: {info == 0}")

Preconditioning

Preconditioning transforms the system Ax=b\mathbf{A}\mathbf{x} = \mathbf{b} into M1Ax=M1b\mathbf{M}^{-1}\mathbf{A}\mathbf{x} = \mathbf{M}^{-1}\mathbf{b}, where M\mathbf{M} approximates A\mathbf{A} but is easy to invert. Good preconditioning can reduce iterations from thousands to tens.

Common preconditioners:

  • Diagonal (Jacobi): M=diag(A)\mathbf{M} = \text{diag}(\mathbf{A}) — trivial but sometimes effective
  • Incomplete LU (ILU): Sparse LU that drops small fill-in entries
  • Incomplete Cholesky: Same idea for SPD matrices

Sparse Data in ML

Text Data (NLP)

The term-document matrix (or TF-IDF matrix) is one of the most common sparse matrices in ML:

  • Rows: documents
  • Columns: vocabulary terms
  • Values: word counts or TF-IDF scores

With a vocabulary of 100,000 words and documents averaging 200 unique words, the density is 0.2%.

from sklearn.feature_extraction.text import TfidfVectorizer

corpus = ["the cat sat on the mat",
          "the dog sat on the log",
          "cats and dogs are pets"]

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

print(f"Shape: {X.shape}")
print(f"Density: {X.nnz / (X.shape[0] * X.shape[1]):.2%}")
print(f"Format: {type(X)}")  # scipy.sparse.csr_matrix

One-Hot and Feature Hashing

Categorical features with many categories produce sparse representations:

  • A feature with 1,000 categories creates a 1,000-dimensional one-hot vector with exactly 1 nonzero
  • Feature hashing maps high-cardinality features to a fixed-size sparse vector

Graph Data

Graphs are naturally represented as sparse adjacency matrices:

Aij={1if edge (i,j) exists0otherwiseA_{ij} = \begin{cases} 1 & \text{if edge } (i,j) \text{ exists} \\ 0 & \text{otherwise} \end{cases}

Most real graphs are sparse: a social network with millions of nodes has each user connected to at most a few thousand others.

import networkx as nx
from scipy.sparse import csr_matrix

# Create a random sparse graph
G = nx.barabasi_albert_graph(10000, 5)
A = nx.adjacency_matrix(G)  # Returns CSR matrix

print(f"Nodes: {G.number_of_nodes()}")
print(f"Edges: {G.number_of_edges()}")
print(f"Density: {A.nnz / (A.shape[0]**2):.4%}")

Recommender Systems

The user-item interaction matrix is extremely sparse — most users interact with a tiny fraction of all items:

SystemUsersItemsDensity
Netflix480K18K~1.2%
Amazon100M+10M+<0.01%
Spotify500M+100M+<0.001%

Matrix factorization for recommendation works directly on the sparse matrix, only fitting to observed entries.

Sparse Eigenvalue Problems

For large sparse matrices, computing all eigenvalues is impractical. Instead, we compute a few (typically the largest or smallest) using iterative methods:

from scipy.sparse.linalg import eigsh

# Find top 5 eigenvalues of a sparse symmetric matrix
eigenvalues, eigenvectors = eigsh(A, k=5, which='LM')  # Largest magnitude
print(f"Top 5 eigenvalues: {eigenvalues}")

The Lanczos algorithm (for symmetric) and Arnoldi algorithm (for general) are the standard methods. They only require matrix-vector products and converge to the extremal eigenvalues first.

This is how spectral clustering works at scale: compute the bottom few eigenvectors of the sparse graph Laplacian using Lanczos, then cluster in the eigenvector space.

Sparse vs. Dense: When to Use What

FactorUse SparseUse Dense
Density< 10% nonzeros> 10% nonzeros
SizeToo large for dense storageFits in memory
OperationsSpMV, solving systemsMatrix decomposition, small systems
GPU supportLimited (improving)Excellent
FrameworksSciPy, PyTorch sparseNumPy, PyTorch, TensorFlow

Warning: Sparse operations on GPUs are less mature than dense operations. While libraries like cuSPARSE exist, the irregular memory access patterns of sparse matrices make GPU parallelization harder. For small-to-medium problems, dense computation on GPU can be faster than sparse computation on CPU.

Worked Example: Sparse Least Squares for NLP

Fit a linear model on TF-IDF features for text classification:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import SGDClassifier
from scipy.sparse.linalg import lsqr
import numpy as np

# Simulated text classification
texts = ["machine learning is great"] * 500 + ["the weather is nice"] * 500
labels = np.array([1] * 500 + [0] * 500)

vectorizer = TfidfVectorizer(max_features=10000)
X = vectorizer.fit_transform(texts)  # Sparse CSR matrix

print(f"X shape: {X.shape}")
print(f"X density: {X.nnz / (X.shape[0] * X.shape[1]):.4%}")
print(f"Memory (sparse): {X.data.nbytes + X.indices.nbytes + X.indptr.nbytes:.0f} bytes")
print(f"Memory (dense): {X.shape[0] * X.shape[1] * 8:.0f} bytes")

# SGDClassifier works efficiently with sparse matrices
clf = SGDClassifier(loss='log_loss')
clf.fit(X, labels)

Why This Matters for ML

  • NLP pipelines produce sparse TF-IDF or bag-of-words matrices. Scikit-learn’s vectorizers output sparse CSR matrices.
  • Graph neural networks operate on sparse adjacency matrices. Message passing is sparse matrix-vector multiplication.
  • Recommender systems factorize sparse user-item matrices. Only observed entries contribute to the loss.
  • Feature engineering with one-hot encoding and feature hashing produces sparse inputs.
  • Large-scale eigenproblems (spectral clustering, PageRank) use sparse iterative solvers.
  • Memory efficiency: A 100K × 100K matrix at 1% density needs 80MB sparse vs. 80GB dense.

Summary

  • Sparse matrices have mostly zero entries; exploiting this structure saves memory and computation.
  • COO format is easy to construct; CSR is best for computation; convert COO → CSR as standard practice.
  • SpMV costs O(nnz)O(\text{nnz}) instead of O(mn)O(mn) — the key operation enabling large-scale ML.
  • Iterative solvers (CG, GMRES) only need matrix-vector products and scale to millions of dimensions.
  • Preconditioning dramatically improves convergence of iterative solvers.
  • NLP, graphs, and recommender systems all produce naturally sparse data.
  • Sparse eigensolvers (Lanczos, Arnoldi) find extremal eigenvalues efficiently for spectral methods.
  • Next, we push scalability further with randomized linear algebra.

References

  • Davis, T. A. (2006). Direct Methods for Sparse Linear Systems. SIAM.
  • Saad, Y. (2003). Iterative Methods for Sparse Linear Systems (2nd ed.). SIAM.
  • SciPy Documentation. scipy.sparse. docs.scipy.org
  • Strang, G. (2019). Linear Algebra and Learning from Data. Wellesley-Cambridge Press.

Keyboard Shortcuts

Navigation
j
Next heading
k
Previous heading
n
Next article in series
p
Previous article in series
t
Scroll to top
Actions
r
Toggle reading mode
Ctrl K
Search articles
?
Toggle this help
Esc
Close overlay