- 01 Vectors and Vector Spaces: The Language of Data 02 Matrices and Matrix Operations: Organizing Linear Computation 03 Systems of Linear Equations: From Geometry to Algorithms 04 Determinants: The Volume Factor of Linear Maps 05 Linear Transformations: Matrices as Functions 06 Inner Products, Norms, and Orthogonality: Measuring Geometry 07 Eigenvalues and Eigenvectors: The DNA of a Matrix 08 Matrix Decompositions: Breaking Matrices into Simpler Pieces 09 Linear Algebra in Machine Learning: Putting It All Together 10 Matrix Calculus: Derivatives for Machine Learning 11 Tensor Operations: Beyond Matrices 12 Sparse Matrices and Efficient Computation 13 Randomized Linear Algebra: Speed Through Randomness
Introduction
Vectors are 1D arrays. Matrices are 2D arrays. But the data structures in modern ML are often higher-dimensional: a batch of color images is a 4D tensor (batch, channels, height, width), a video is 5D, and attention weights in multi-head Transformers are 4D. To work fluently with deep learning frameworks, you need to think in tensors.
This article extends the linear algebra from vectors and matrices into arbitrary dimensions, covering the operations that PyTorch and TensorFlow execute on every forward pass.
What Is a Tensor?
In the ML context, a tensor is a multi-dimensional array of numbers. The number of dimensions is called the order (or rank, though this term conflicts with matrix rank):
| Order | Name | Example | Shape |
|---|---|---|---|
| 0 | Scalar | Loss value | () |
| 1 | Vector | Feature vector | (n,) |
| 2 | Matrix | Weight matrix | (m, n) |
| 3 | 3rd-order tensor | RGB image | (C, H, W) |
| 4 | 4th-order tensor | Batch of images | (B, C, H, W) |
| 5 | 5th-order tensor | Video batch | (B, T, C, H, W) |
Each dimension is called an axis (or mode). The shape is the tuple of sizes along each axis.
Key distinction: In physics and differential geometry, “tensor” has a precise mathematical definition involving transformation laws. In ML, “tensor” simply means “multi-dimensional array.” The frameworks PyTorch (
torch.Tensor) and TensorFlow (tf.Tensor) use this practical definition.
Tensor Shapes in Deep Learning
Understanding shapes is the most practical tensor skill. Every bug you will debug in deep learning involves shape mismatches.
Fully Connected Layer
Input: (batch of vectors)
Weight:
Bias:
Output:
The bias is added to every row via broadcasting.
Convolutional Layer
Input: — batch, input channels, height, width
Kernel: — output channels, input channels, kernel height, kernel width
Output:
Multi-Head Attention
Query: — batch, heads, tokens, key dimension
Key:
Value:
Attention weights: — a separate attention matrix per head per batch element
Basic Tensor Operations
Element-wise Operations
Element-wise operations apply a function to each element independently. They require tensors of the same shape (or broadcastable shapes):
All activation functions (ReLU, sigmoid, tanh) are element-wise operations.
Broadcasting
Broadcasting automatically expands dimensions of size 1 to match the other tensor’s size:
import torch
x = torch.randn(4, 3) # (4, 3)
b = torch.randn(3) # (3,) → broadcast to (4, 3)
result = x + b # (4, 3) — b is added to each row
Broadcasting rules (right-aligned comparison):
- Dimensions are compatible if they are equal, or one of them is 1.
- Missing dimensions are treated as size 1.
- The output shape is the element-wise maximum of the input shapes.
Warning: Broadcasting is powerful but can silently produce wrong results if shapes are accidentally compatible. Always verify shapes explicitly during development.
Reshaping
Reshape changes the shape without changing the data order in memory:
x = torch.randn(2, 3, 4) # (2, 3, 4) — 24 elements
y = x.reshape(6, 4) # (6, 4) — same 24 elements
z = x.reshape(2, 12) # (2, 12) — same 24 elements
Common reshape operations:
| Operation | Effect | Example |
|---|---|---|
reshape | Change shape, same data | (2,3,4) → (6,4) |
squeeze | Remove dimensions of size 1 | (1,3,1,4) → (3,4) |
unsqueeze | Add a dimension of size 1 | (3,4) → (1,3,4) |
flatten | Collapse all dimensions | (2,3,4) → (24,) |
permute/transpose | Reorder axes | (B,C,H,W) → (B,H,W,C) |
Transposition and Permutation
For matrices, transposition swaps two axes. For higher-order tensors, permutation reorders any axes:
x = torch.randn(2, 3, 4, 5)
# Swap axes 1 and 2
y = x.permute(0, 2, 1, 3) # (2, 4, 3, 5)
# Transpose last two dimensions (useful for batch matmul)
z = x.transpose(-2, -1) # (2, 3, 5, 4)
Tensor Contraction and Einstein Notation
Tensor Contraction
Contraction is the generalization of matrix multiplication to tensors. It sums over one or more shared indices between two tensors.
Matrix multiplication is contraction over one index:
A more general contraction might sum over multiple indices:
Einstein Summation Notation
Einstein notation (einsum) provides a compact, explicit way to express any tensor contraction. The convention: repeated indices are summed over.
In code, torch.einsum takes a string specifying the contraction:
import torch
# Matrix multiplication: C_ij = A_ik * B_kj
A = torch.randn(3, 4)
B = torch.randn(4, 5)
C = torch.einsum('ik,kj->ij', A, B) # (3, 5)
# Batch matrix multiplication: C_bij = A_bik * B_bkj
A = torch.randn(8, 3, 4)
B = torch.randn(8, 4, 5)
C = torch.einsum('bik,bkj->bij', A, B) # (8, 3, 5)
# Dot product: c = a_i * b_i
a = torch.randn(5)
b = torch.randn(5)
c = torch.einsum('i,i->', a, b) # scalar
# Outer product: C_ij = a_i * b_j
C = torch.einsum('i,j->ij', a, b) # (5, 5)
# Trace: t = A_ii
A = torch.randn(4, 4)
t = torch.einsum('ii->', A) # scalar
# Transpose: B_ji = A_ij
B = torch.einsum('ij->ji', A) # (4, 4)
Einsum Patterns for ML
| Pattern | Einsum String | Operation |
|---|---|---|
| Matrix multiply | 'ik,kj->ij' | |
| Batch matmul | 'bik,bkj->bij' | Batched |
| Dot product | 'i,i->' | |
| Outer product | 'i,j->ij' | |
| Attention scores | 'bhid,bhjd->bhij' | per head |
| Bilinear form | 'i,ij,j->' | |
| Trace | 'ii->' | |
| Diagonal | 'ii->i' |
Key insight: Einsum is the universal language for tensor operations. Any linear operation on tensors — matrix multiplication, convolution (as a special case), attention, bilinear forms — can be expressed as an einsum. Learning to read and write einsum strings makes you fluent in the computational language of deep learning.
Tensor Decompositions
Just as matrices have SVD and eigendecomposition, tensors have analogous decompositions.
CP Decomposition (CANDECOMP/PARAFAC)
A rank- CP decomposition expresses a tensor as a sum of rank-1 tensors:
where denotes the outer product. In index notation:
Tucker Decomposition
Tucker decomposition generalizes PCA to tensors:
where is a smaller core tensor and are factor matrices for each mode.
Applications in ML
- Model compression: Decompose large weight tensors in neural networks into smaller factors, reducing parameters and computation.
- Tensor networks: Used in quantum-inspired ML models.
- Multi-relational learning: Knowledge graphs store data as 3rd-order tensors (subject, relation, object).
# Simple CP decomposition idea
import torch
# Original 3D tensor
T = torch.randn(10, 20, 30) # 6000 parameters
# Rank-5 CP approximation
R = 5
a = torch.randn(10, R)
b = torch.randn(20, R)
c = torch.randn(30, R)
# Reconstruct: T_ijk ≈ sum_r a_ir * b_jr * c_kr
T_approx = torch.einsum('ir,jr,kr->ijk', a, b, c)
# Only 10*5 + 20*5 + 30*5 = 300 parameters (50x compression)
Memory Layout and Performance
Contiguous vs. Non-Contiguous
Tensors are stored as flat arrays in memory. A contiguous tensor has elements stored in row-major order (C order). Operations like transpose and permute may create non-contiguous views that share the same memory but with different strides.
x = torch.randn(3, 4) # contiguous
y = x.T # non-contiguous (view with different strides)
z = y.contiguous() # new contiguous copy
Non-contiguous tensors can be slower for some operations. Call .contiguous() when needed (some operations require it).
Batch Operations
Always prefer batched operations over loops. GPUs are optimized for parallel computation on large tensors:
# Slow: loop over batch
results = []
for i in range(batch_size):
results.append(W @ x[i])
result = torch.stack(results)
# Fast: batched matrix multiplication
result = torch.bmm(x.unsqueeze(1), W.unsqueeze(0))
# Or simply:
result = x @ W # broadcasting handles the batch dimension
Worked Example: Multi-Head Attention with Einsum
Implement the core of multi-head attention using einsum:
import torch
import torch.nn.functional as F
B, T, d_model = 2, 10, 64 # batch, tokens, model dim
n_heads, d_k = 8, 8 # heads, key dim per head
# Input
X = torch.randn(B, T, d_model)
# Projection weights
W_Q = torch.randn(d_model, n_heads, d_k)
W_K = torch.randn(d_model, n_heads, d_k)
W_V = torch.randn(d_model, n_heads, d_k)
# Project: Q_bhid = X_btd * W_Q_dhid (sum over d)
Q = torch.einsum('btd,dhk->bhtk', X, W_Q) # (B, H, T, d_k)
K = torch.einsum('btd,dhk->bhtk', X, W_K)
V = torch.einsum('btd,dhk->bhtk', X, W_V)
# Attention scores: score_bhij = Q_bhik * K_bhjk (sum over k)
scores = torch.einsum('bhik,bhjk->bhij', Q, K) / (d_k ** 0.5)
# Softmax over keys (last dimension)
attn = F.softmax(scores, dim=-1) # (B, H, T, T)
# Weighted values: out_bhid = attn_bhij * V_bhjd (sum over j)
out = torch.einsum('bhij,bhjd->bhid', attn, V) # (B, H, T, d_k)
print(f"Output shape: {out.shape}") # (2, 8, 10, 8)
Why This Matters for ML
- Framework fluency: PyTorch and TensorFlow operate on tensors. Understanding shapes, broadcasting, and permutation prevents bugs.
- Efficient computation: Einsum expresses complex operations concisely and allows frameworks to optimize execution.
- Model compression: Tensor decompositions reduce model size by factoring large weight tensors.
- Attention mechanisms: Multi-head attention is a series of batched tensor contractions.
- Data representation: Images, videos, point clouds, and sequences are all naturally tensors.
- Memory optimization: Understanding layout and contiguity helps optimize GPU memory usage.
Summary
- A tensor is a multi-dimensional array; its order is the number of dimensions.
- Broadcasting automatically expands dimensions of size 1, enabling concise batch operations.
- Reshaping, permutation, and squeezing change how you view the same data.
- Tensor contraction generalizes matrix multiplication by summing over shared indices.
- Einstein notation (einsum) is the universal language for expressing any tensor operation.
- Tensor decompositions (CP, Tucker) compress high-dimensional data, analogous to SVD for matrices.
- Memory layout (contiguous vs. non-contiguous) affects performance on GPUs.
- Next, we tackle large-scale efficiency with sparse matrices and efficient computation.
References
- Kolda, T. G., & Bader, B. W. (2009). Tensor Decompositions and Applications. SIAM Review, 51(3), 455-500.
- PyTorch Documentation. torch.einsum. pytorch.org/docs
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. deeplearningbook.org
- Rabanser, S., Shchur, O., & Günnemann, S. (2017). Introduction to Tensor Decompositions and their Applications in Machine Learning. arXiv:1711.10781