- 01 Limits and Continuity: The Foundation of Calculus 02 Derivatives and Differentiation: Measuring Rates of Change 03 Partial Derivatives and Gradients: Calculus in Multiple Dimensions 04 The Chain Rule and Computational Graphs: The Engine Behind Backpropagation 05 Taylor Series and Approximation: Local Models of Complex Functions 06 Gradient Descent: The Workhorse of Machine Learning Optimization 07 Stochastic Gradient Descent: Trading Precision for Speed 08 Adaptive Learning Rate Methods: From AdaGrad to Adam 09 Constrained Optimization: Lagrange Multipliers and KKT Conditions 10 Convexity and Convergence Theory: When Optimization Succeeds 11 Integration and Expectation: The Continuous Side of Probability 12 Calculus of Variations: Optimizing Over Functions 13 Second-Order and Natural Gradient Methods 14 Numerical Stability in Optimization: Making Training Work in Practice 15 Non-Smooth Optimization and Proximal Methods 16 Optimization Landscape of Neural Networks: Why Deep Learning Works 17 Implicit Differentiation and Differentiable Programming 18 Min-Max Optimization: Games, GANs, and Adversarial Training
Why Integration Matters for ML
Derivatives tell us how functions change. Integrals tell us how functions accumulate. In machine learning, integration is everywhere:
- Probability densities are defined through integrals:
- Expectations are integrals:
- Marginalizing over latent variables requires integrating them out
- Normalizing constants ensure distributions sum to 1
- Evidence in Bayesian inference is an integral over all possible parameters
If derivatives are the engine of optimization, integrals are the engine of probabilistic reasoning.
The Definite Integral
The definite integral of from to is the signed area under the curve:
where and is a sample point in the -th subinterval. This is the Riemann sum — approximate the area with rectangles, then take the limit as the rectangles become infinitely thin.
Intuition: If the derivative answers “how fast is this changing?”, the integral answers “how much has accumulated?” They are inverse operations — the Fundamental Theorem of Calculus makes this precise.
The Fundamental Theorem of Calculus
The Fundamental Theorem connects differentiation and integration:
Part 1: If , then .
Part 2: If is an antiderivative of (meaning ), then:
This transforms the problem of computing areas into the problem of finding antiderivatives.
Essential Antiderivatives
| ML relevance | ||
|---|---|---|
| Polynomial features | ||
| Exponential family | ||
| Log-likelihood | ||
| No closed form | Gaussian — requires special functions | |
| Sigmoid / softplus |
The Gaussian integral has no elementary antiderivative but has a known closed-form value — a remarkable result that underpins the entire normal distribution.
Integration Techniques
Substitution (Change of Variables)
If , then:
This is the integral counterpart of the chain rule.
Example: . Let , :
Integration by Parts
This is the integral counterpart of the product rule. It is essential for deriving expectations of products and for working with information-theoretic quantities.
The Gaussian Integral
The integral is foundational. The normalization constant of the Gaussian distribution follows directly:
Generalizing to dimensions with covariance matrix :
Multiple Integrals
Functions of several variables require multiple integrals. For :
Fubini’s theorem lets us compute double integrals as iterated single integrals — integrating one variable at a time. This is precisely how marginalization works.
Change of Variables in Multiple Integrals
When transforming coordinates :
The Jacobian determinant accounts for how the transformation stretches or compresses volume. This is the mathematical foundation of normalizing flows — a class of generative models that transform simple distributions into complex ones using invertible functions with tractable Jacobians.
Expectation as Integration
The expected value of a continuous random variable with density is:
More generally, for any function :
This formula computes every statistical quantity we care about:
| Quantity | Formula | Integral form |
|---|---|---|
| Mean | ||
| Variance | ||
| Entropy | ||
| KL divergence | ||
| Cross-entropy |
Key insight: Nearly every loss function and evaluation metric in ML can be written as an expectation — an integral of some function weighted by a probability distribution. Understanding this unifies many seemingly different concepts under one framework.
Marginalization
Marginalization integrates out (eliminates) variables we do not need. If is a joint density:
This is the continuous version of the law of total probability.
Why Marginalization is Hard
In Bayesian inference, the evidence (marginal likelihood) is:
For a neural network with millions of parameters, this integral is over a million-dimensional space — computationally intractable. This intractability motivates:
- Variational inference: Approximate the integral with an optimization problem (see calculus of variations)
- Monte Carlo methods: Estimate the integral using random samples
- Laplace approximation: Approximate the integrand as Gaussian using a second-order Taylor expansion
Monte Carlo Integration
When integrals lack closed-form solutions, Monte Carlo integration estimates them using random samples. The key identity:
Draw samples from , evaluate at each, and average. By the law of large numbers, this converges to the true integral as .
Convergence Rate
Monte Carlo estimators converge at rate regardless of dimension. This is remarkable — deterministic quadrature methods have rates that degrade exponentially with dimension (the curse of dimensionality), but Monte Carlo does not.
Key insight: Monte Carlo integration is the reason probabilistic ML scales to high dimensions. A 1000-dimensional integral is intractable for grid-based methods but routine for Monte Carlo. This is why sampling-based methods (MCMC, variational inference with reparameterization) dominate modern Bayesian deep learning.
Importance Sampling
When sampling from is difficult, we can sample from a different distribution and reweight:
The ratio is the importance weight. Importance sampling appears in:
- Off-policy reinforcement learning (correcting for behavior policy)
- Variational autoencoders (importance-weighted ELBO)
- Rare event estimation
The Reparameterization Trick
Variational autoencoders need to backpropagate through an expectation:
The problem: is sampled from , which depends on the parameters we want to differentiate. We cannot backpropagate through a sampling operation.
The reparameterization trick rewrites the sampling as a deterministic function of a noise variable:
Now the expectation is over (independent of ), and we can move the gradient inside:
Key insight: The reparameterization trick connects integration and differentiation — it allows us to compute gradients of expectations, enabling end-to-end training of models with stochastic latent variables (VAEs, stochastic neural networks).
Worked Example: Computing Evidence
Consider a simple Bayesian model: with prior . The evidence for observing :
This is a Gaussian integral. Completing the square in the exponent yields:
The evidence is itself Gaussian with variance . This closed-form solution is the exception — for most models, the evidence integral is intractable.
Why This Matters for ML
Integration is the mathematical backbone of probabilistic ML:
- Probability densities are normalized by integrals — without integration, we cannot define continuous distributions
- Expectations (means, variances, losses) are all integrals over distributions
- Marginalization integrates out latent variables — essential for Bayesian inference and mixture models
- Monte Carlo methods estimate intractable integrals using random samples, scaling to arbitrary dimensions
- The reparameterization trick allows backpropagation through stochastic sampling, enabling VAE training
- Normalizing flows use the change-of-variables formula (Jacobian determinant) to define complex distributions
Summary
- The definite integral computes accumulated area; the Fundamental Theorem links it to antiderivatives
- Multiple integrals and the change-of-variables formula (Jacobian) generalize to higher dimensions
- Expectation is an integral: — nearly every ML quantity is an expectation
- Marginalization integrates out unwanted variables but is often intractable in high dimensions
- Monte Carlo integration estimates integrals via sampling at rate regardless of dimension
- Importance sampling reweights samples from an easy distribution to estimate expectations under a hard one
- The reparameterization trick enables gradient-based optimization through stochastic sampling
- Next: calculus of variations optimizes over entire functions, not just parameters
References
- Stewart, J. (2015). Calculus: Early Transcendentals (8th ed.). Cengage Learning. Chapters 5-7, 15.
- Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. Chapters 1-2, 10.
- Murphy, K. P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press. Chapters 2-4. probml.github.io
- Kingma, D. P., & Welling, M. (2014). Auto-Encoding Variational Bayes. ICLR. arXiv:1312.6114
- Robert, C. P., & Casella, G. (2004). Monte Carlo Statistical Methods (2nd ed.). Springer.