Tensor Operations: Beyond Matrices

Linear Algebra Series 11 / 13

Introduction

Vectors are 1D arrays. Matrices are 2D arrays. But the data structures in modern ML are often higher-dimensional: a batch of color images is a 4D tensor (batch, channels, height, width), a video is 5D, and attention weights in multi-head Transformers are 4D. To work fluently with deep learning frameworks, you need to think in tensors.

This article extends the linear algebra from vectors and matrices into arbitrary dimensions, covering the operations that PyTorch and TensorFlow execute on every forward pass.

What Is a Tensor?

In the ML context, a tensor is a multi-dimensional array of numbers. The number of dimensions is called the order (or rank, though this term conflicts with matrix rank):

Order	Name	Example	Shape
0	Scalar	Loss value	`()`
1	Vector	Feature vector	`(n,)`
2	Matrix	Weight matrix	`(m, n)`
3	3rd-order tensor	RGB image	`(C, H, W)`
4	4th-order tensor	Batch of images	`(B, C, H, W)`
5	5th-order tensor	Video batch	`(B, T, C, H, W)`

Each dimension is called an axis (or mode). The shape is the tuple of sizes along each axis.

Key distinction: In physics and differential geometry, “tensor” has a precise mathematical definition involving transformation laws. In ML, “tensor” simply means “multi-dimensional array.” The frameworks PyTorch (torch.Tensor) and TensorFlow (tf.Tensor) use this practical definition.

Tensor Shapes in Deep Learning

Understanding shapes is the most practical tensor skill. Every bug you will debug in deep learning involves shape mismatches.

Fully Connected Layer

Input: $\mathbf{X} \in \mathbb{R}^{B \times d_{\text{in}}}$ (batch of $B$ vectors)

Weight: $\mathbf{W} \in \mathbb{R}^{d_{\text{in}} \times d_{\text{out}}}$

Bias: $\mathbf{b} \in \mathbb{R}^{d_{\text{out}}}$

Output: $\mathbf{H} = \mathbf{X}\mathbf{W} + \mathbf{b} \in \mathbb{R}^{B \times d_{\text{out}}}$

The bias $\mathbf{b}$ is added to every row via broadcasting.

Convolutional Layer

Input: $(B, C_{\text{in}}, H, W)$ — batch, input channels, height, width

Kernel: $(C_{\text{out}}, C_{\text{in}}, k_H, k_W)$ — output channels, input channels, kernel height, kernel width

Output: $(B, C_{\text{out}}, H', W')$

Multi-Head Attention

Query: $(B, H, T, d_k)$ — batch, heads, tokens, key dimension

Key: $(B, H, T, d_k)$

Value: $(B, H, T, d_v)$

Attention weights: $(B, H, T, T)$ — a separate $T \times T$ attention matrix per head per batch element

Basic Tensor Operations

Element-wise Operations

Element-wise operations apply a function to each element independently. They require tensors of the same shape (or broadcastable shapes):

\mathbf{C} = \mathbf{A} \odot \mathbf{B} \quad \text{where} \quad C_{ijk} = A_{ijk} \cdot B_{ijk}

All activation functions (ReLU, sigmoid, tanh) are element-wise operations.

Broadcasting

Broadcasting automatically expands dimensions of size 1 to match the other tensor’s size:

import torch

x = torch.randn(4, 3)     # (4, 3)
b = torch.randn(3)        # (3,) → broadcast to (4, 3)
result = x + b             # (4, 3) — b is added to each row

Broadcasting rules (right-aligned comparison):

Dimensions are compatible if they are equal, or one of them is 1.
Missing dimensions are treated as size 1.
The output shape is the element-wise maximum of the input shapes.

Warning: Broadcasting is powerful but can silently produce wrong results if shapes are accidentally compatible. Always verify shapes explicitly during development.

Reshaping

Reshape changes the shape without changing the data order in memory:

x = torch.randn(2, 3, 4)   # (2, 3, 4) — 24 elements
y = x.reshape(6, 4)          # (6, 4) — same 24 elements
z = x.reshape(2, 12)         # (2, 12) — same 24 elements

Common reshape operations:

Operation	Effect	Example
`reshape`	Change shape, same data	`(2,3,4)` → `(6,4)`
`squeeze`	Remove dimensions of size 1	`(1,3,1,4)` → `(3,4)`
`unsqueeze`	Add a dimension of size 1	`(3,4)` → `(1,3,4)`
`flatten`	Collapse all dimensions	`(2,3,4)` → `(24,)`
`permute/transpose`	Reorder axes	`(B,C,H,W)` → `(B,H,W,C)`

Transposition and Permutation

For matrices, transposition swaps two axes. For higher-order tensors, permutation reorders any axes:

x = torch.randn(2, 3, 4, 5)

# Swap axes 1 and 2
y = x.permute(0, 2, 1, 3)    # (2, 4, 3, 5)

# Transpose last two dimensions (useful for batch matmul)
z = x.transpose(-2, -1)       # (2, 3, 5, 4)

Tensor Contraction and Einstein Notation

Tensor Contraction

Contraction is the generalization of matrix multiplication to tensors. It sums over one or more shared indices between two tensors.

Matrix multiplication is contraction over one index:

C_{ij} = \sum_k A_{ik} B_{kj}

A more general contraction might sum over multiple indices:

C_{il} = \sum_{j,k} A_{ijk} B_{jkl}

Einstein Summation Notation

Einstein notation (einsum) provides a compact, explicit way to express any tensor contraction. The convention: repeated indices are summed over.

C_{ij} = A_{ik} B_{kj} \quad \text{(implicit sum over } k \text{)}

In code, torch.einsum takes a string specifying the contraction:

import torch

# Matrix multiplication: C_ij = A_ik * B_kj
A = torch.randn(3, 4)
B = torch.randn(4, 5)
C = torch.einsum('ik,kj->ij', A, B)  # (3, 5)

# Batch matrix multiplication: C_bij = A_bik * B_bkj
A = torch.randn(8, 3, 4)
B = torch.randn(8, 4, 5)
C = torch.einsum('bik,bkj->bij', A, B)  # (8, 3, 5)

# Dot product: c = a_i * b_i
a = torch.randn(5)
b = torch.randn(5)
c = torch.einsum('i,i->', a, b)  # scalar

# Outer product: C_ij = a_i * b_j
C = torch.einsum('i,j->ij', a, b)  # (5, 5)

# Trace: t = A_ii
A = torch.randn(4, 4)
t = torch.einsum('ii->', A)  # scalar

# Transpose: B_ji = A_ij
B = torch.einsum('ij->ji', A)  # (4, 4)

Einsum Patterns for ML

Pattern	Einsum String	Operation
Matrix multiply	`'ik,kj->ij'`	$\mathbf{C} = \mathbf{A}\mathbf{B}$
Batch matmul	`'bik,bkj->bij'`	Batched $\mathbf{C}_b = \mathbf{A}_b\mathbf{B}_b$
Dot product	`'i,i->'`	$c = \mathbf{a}^T\mathbf{b}$
Outer product	`'i,j->ij'`	$\mathbf{C} = \mathbf{a}\mathbf{b}^T$
Attention scores	`'bhid,bhjd->bhij'`	$\mathbf{Q}\mathbf{K}^T$ per head
Bilinear form	`'i,ij,j->'`	$\mathbf{u}^T\mathbf{A}\mathbf{v}$
Trace	`'ii->'`	$\text{tr}(\mathbf{A})$
Diagonal	`'ii->i'`	$\text{diag}(\mathbf{A})$

Key insight: Einsum is the universal language for tensor operations. Any linear operation on tensors — matrix multiplication, convolution (as a special case), attention, bilinear forms — can be expressed as an einsum. Learning to read and write einsum strings makes you fluent in the computational language of deep learning.

Tensor Decompositions

Just as matrices have SVD and eigendecomposition, tensors have analogous decompositions.

CP Decomposition (CANDECOMP/PARAFAC)

A rank- $R$ CP decomposition expresses a tensor as a sum of rank-1 tensors:

\mathcal{T} \approx \sum_{r=1}^{R} \mathbf{a}_r \otimes \mathbf{b}_r \otimes \mathbf{c}_r

where $\otimes$ denotes the outer product. In index notation:

T_{ijk} \approx \sum_{r=1}^{R} a_{ir} \cdot b_{jr} \cdot c_{kr}

Tucker Decomposition

Tucker decomposition generalizes PCA to tensors:

\mathcal{T} \approx \mathcal{G} \times_1 \mathbf{U}_1 \times_2 \mathbf{U}_2 \times_3 \mathbf{U}_3

where $\mathcal{G}$ is a smaller core tensor and $\mathbf{U}_i$ are factor matrices for each mode.

Applications in ML

Model compression: Decompose large weight tensors in neural networks into smaller factors, reducing parameters and computation.
Tensor networks: Used in quantum-inspired ML models.
Multi-relational learning: Knowledge graphs store data as 3rd-order tensors (subject, relation, object).

# Simple CP decomposition idea
import torch

# Original 3D tensor
T = torch.randn(10, 20, 30)  # 6000 parameters

# Rank-5 CP approximation
R = 5
a = torch.randn(10, R)
b = torch.randn(20, R)
c = torch.randn(30, R)

# Reconstruct: T_ijk ≈ sum_r a_ir * b_jr * c_kr
T_approx = torch.einsum('ir,jr,kr->ijk', a, b, c)
# Only 10*5 + 20*5 + 30*5 = 300 parameters (50x compression)

Memory Layout and Performance

Contiguous vs. Non-Contiguous

Tensors are stored as flat arrays in memory. A contiguous tensor has elements stored in row-major order (C order). Operations like transpose and permute may create non-contiguous views that share the same memory but with different strides.

x = torch.randn(3, 4)       # contiguous
y = x.T                      # non-contiguous (view with different strides)
z = y.contiguous()            # new contiguous copy

Non-contiguous tensors can be slower for some operations. Call .contiguous() when needed (some operations require it).

Batch Operations

Always prefer batched operations over loops. GPUs are optimized for parallel computation on large tensors:

# Slow: loop over batch
results = []
for i in range(batch_size):
    results.append(W @ x[i])
result = torch.stack(results)

# Fast: batched matrix multiplication
result = torch.bmm(x.unsqueeze(1), W.unsqueeze(0))
# Or simply:
result = x @ W  # broadcasting handles the batch dimension

Worked Example: Multi-Head Attention with Einsum

Implement the core of multi-head attention using einsum:

import torch
import torch.nn.functional as F

B, T, d_model = 2, 10, 64   # batch, tokens, model dim
n_heads, d_k = 8, 8          # heads, key dim per head

# Input
X = torch.randn(B, T, d_model)

# Projection weights
W_Q = torch.randn(d_model, n_heads, d_k)
W_K = torch.randn(d_model, n_heads, d_k)
W_V = torch.randn(d_model, n_heads, d_k)

# Project: Q_bhid = X_btd * W_Q_dhid  (sum over d)
Q = torch.einsum('btd,dhk->bhtk', X, W_Q)  # (B, H, T, d_k)
K = torch.einsum('btd,dhk->bhtk', X, W_K)
V = torch.einsum('btd,dhk->bhtk', X, W_V)

# Attention scores: score_bhij = Q_bhik * K_bhjk (sum over k)
scores = torch.einsum('bhik,bhjk->bhij', Q, K) / (d_k ** 0.5)

# Softmax over keys (last dimension)
attn = F.softmax(scores, dim=-1)  # (B, H, T, T)

# Weighted values: out_bhid = attn_bhij * V_bhjd (sum over j)
out = torch.einsum('bhij,bhjd->bhid', attn, V)  # (B, H, T, d_k)

print(f"Output shape: {out.shape}")  # (2, 8, 10, 8)

Why This Matters for ML

Framework fluency: PyTorch and TensorFlow operate on tensors. Understanding shapes, broadcasting, and permutation prevents bugs.
Efficient computation: Einsum expresses complex operations concisely and allows frameworks to optimize execution.
Model compression: Tensor decompositions reduce model size by factoring large weight tensors.
Attention mechanisms: Multi-head attention is a series of batched tensor contractions.
Data representation: Images, videos, point clouds, and sequences are all naturally tensors.
Memory optimization: Understanding layout and contiguity helps optimize GPU memory usage.

Summary

A tensor is a multi-dimensional array; its order is the number of dimensions.
Broadcasting automatically expands dimensions of size 1, enabling concise batch operations.
Reshaping, permutation, and squeezing change how you view the same data.
Tensor contraction generalizes matrix multiplication by summing over shared indices.
Einstein notation (einsum) is the universal language for expressing any tensor operation.
Tensor decompositions (CP, Tucker) compress high-dimensional data, analogous to SVD for matrices.
Memory layout (contiguous vs. non-contiguous) affects performance on GPUs.
Next, we tackle large-scale efficiency with sparse matrices and efficient computation.

References

Kolda, T. G., & Bader, B. W. (2009). Tensor Decompositions and Applications. SIAM Review, 51(3), 455-500.
PyTorch Documentation. torch.einsum. pytorch.org/docs
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. deeplearningbook.org
Rabanser, S., Shchur, O., & Günnemann, S. (2017). Introduction to Tensor Decompositions and their Applications in Machine Learning. arXiv:1711.10781