Integration and Expectation: The Continuous Side of Probability

From Riemann integrals to Monte Carlo estimation — how integration underpins probability densities, expectations, and marginalizations in ML.

Calculus & Optimization March 7, 2026 9 min read

Why Integration Matters for ML

Derivatives tell us how functions change. Integrals tell us how functions accumulate. In machine learning, integration is everywhere:

  • Probability densities are defined through integrals: P(aXb)=abf(x)dxP(a \leq X \leq b) = \int_a^b f(x) \, dx
  • Expectations are integrals: E[X]=xf(x)dx\mathbb{E}[X] = \int x \, f(x) \, dx
  • Marginalizing over latent variables requires integrating them out
  • Normalizing constants ensure distributions sum to 1
  • Evidence in Bayesian inference is an integral over all possible parameters

If derivatives are the engine of optimization, integrals are the engine of probabilistic reasoning.

The Definite Integral

The definite integral of ff from aa to bb is the signed area under the curve:

abf(x)dx=limni=1nf(xi)Δx\int_a^b f(x) \, dx = \lim_{n \to \infty} \sum_{i=1}^{n} f(x_i^*) \Delta x

where Δx=(ba)/n\Delta x = (b - a)/n and xix_i^* is a sample point in the ii-th subinterval. This is the Riemann sum — approximate the area with rectangles, then take the limit as the rectangles become infinitely thin.

Intuition: If the derivative answers “how fast is this changing?”, the integral answers “how much has accumulated?” They are inverse operations — the Fundamental Theorem of Calculus makes this precise.

The Fundamental Theorem of Calculus

The Fundamental Theorem connects differentiation and integration:

Part 1: If F(x)=axf(t)dtF(x) = \int_a^x f(t) \, dt, then F(x)=f(x)F'(x) = f(x).

Part 2: If FF is an antiderivative of ff (meaning F=fF' = f), then:

abf(x)dx=F(b)F(a)\int_a^b f(x) \, dx = F(b) - F(a)

This transforms the problem of computing areas into the problem of finding antiderivatives.

Essential Antiderivatives

f(x)f(x)f(x)dx\int f(x) \, dxML relevance
xnx^nxn+1n+1+C\frac{x^{n+1}}{n+1} + CPolynomial features
exe^xex+Ce^x + CExponential family
1/x1/xlnx+C\ln\|x\| + CLog-likelihood
ex2e^{-x^2}No closed formGaussian — requires special functions
11+ex\frac{1}{1+e^{-x}}ln(1+ex)+C\ln(1 + e^x) + CSigmoid / softplus

The Gaussian integral ex2dx=π\int_{-\infty}^{\infty} e^{-x^2} dx = \sqrt{\pi} has no elementary antiderivative but has a known closed-form value — a remarkable result that underpins the entire normal distribution.

Integration Techniques

Substitution (Change of Variables)

If u=g(x)u = g(x), then:

f(g(x))g(x)dx=f(u)du\int f(g(x)) \, g'(x) \, dx = \int f(u) \, du

This is the integral counterpart of the chain rule.

Example: 2xex2dx\int 2x \, e^{x^2} dx. Let u=x2u = x^2, du=2xdxdu = 2x \, dx: eudu=eu+C=ex2+C\int e^u \, du = e^u + C = e^{x^2} + C

Integration by Parts

udv=uvvdu\int u \, dv = uv - \int v \, du

This is the integral counterpart of the product rule. It is essential for deriving expectations of products and for working with information-theoretic quantities.

The Gaussian Integral

The integral ex2/2dx=2π\int_{-\infty}^{\infty} e^{-x^2/2} dx = \sqrt{2\pi} is foundational. The normalization constant of the Gaussian distribution N(μ,σ2)\mathcal{N}(\mu, \sigma^2) follows directly:

1σ2πexp((xμ)22σ2)dx=1\int_{-\infty}^{\infty} \frac{1}{\sigma\sqrt{2\pi}} \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right) dx = 1

Generalizing to nn dimensions with covariance matrix Σ\boldsymbol{\Sigma}:

Rn1(2π)n/2Σ1/2exp(12(xμ)TΣ1(xμ))dx=1\int_{\mathbb{R}^n} \frac{1}{(2\pi)^{n/2}|\boldsymbol{\Sigma}|^{1/2}} \exp\left(-\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1} (\mathbf{x} - \boldsymbol{\mu})\right) d\mathbf{x} = 1

Multiple Integrals

Functions of several variables require multiple integrals. For f:R2Rf: \mathbb{R}^2 \to \mathbb{R}:

Df(x,y)dA=abc(x)d(x)f(x,y)dydx\iint_D f(x, y) \, dA = \int_a^b \int_{c(x)}^{d(x)} f(x, y) \, dy \, dx

Fubini’s theorem lets us compute double integrals as iterated single integrals — integrating one variable at a time. This is precisely how marginalization works.

Change of Variables in Multiple Integrals

When transforming coordinates u=g(x)\mathbf{u} = g(\mathbf{x}):

f(x)dx=f(g1(u))detJg1du\int f(\mathbf{x}) \, d\mathbf{x} = \int f(g^{-1}(\mathbf{u})) \, |\det \mathbf{J}_{g^{-1}}| \, d\mathbf{u}

The Jacobian determinant detJ|\det \mathbf{J}| accounts for how the transformation stretches or compresses volume. This is the mathematical foundation of normalizing flows — a class of generative models that transform simple distributions into complex ones using invertible functions with tractable Jacobians.

Expectation as Integration

The expected value of a continuous random variable XX with density f(x)f(x) is:

E[X]=xf(x)dx\mathbb{E}[X] = \int_{-\infty}^{\infty} x \, f(x) \, dx

More generally, for any function h(X)h(X):

E[h(X)]=h(x)f(x)dx\mathbb{E}[h(X)] = \int h(x) \, f(x) \, dx

This formula computes every statistical quantity we care about:

QuantityFormulaIntegral form
Meanμ=E[X]\mu = \mathbb{E}[X]xf(x)dx\int x \, f(x) \, dx
VarianceVar(X)=E[(Xμ)2]\text{Var}(X) = \mathbb{E}[(X-\mu)^2](xμ)2f(x)dx\int (x-\mu)^2 f(x) \, dx
EntropyH(X)=E[logf(X)]H(X) = -\mathbb{E}[\log f(X)]f(x)logf(x)dx-\int f(x) \log f(x) \, dx
KL divergenceDKL(pq)D_{KL}(p \| q)p(x)logp(x)q(x)dx\int p(x) \log \frac{p(x)}{q(x)} dx
Cross-entropyH(p,q)H(p, q)p(x)logq(x)dx-\int p(x) \log q(x) \, dx

Key insight: Nearly every loss function and evaluation metric in ML can be written as an expectation — an integral of some function weighted by a probability distribution. Understanding this unifies many seemingly different concepts under one framework.

Marginalization

Marginalization integrates out (eliminates) variables we do not need. If p(x,z)p(x, z) is a joint density:

p(x)=p(x,z)dz=p(xz)p(z)dzp(x) = \int p(x, z) \, dz = \int p(x \mid z) \, p(z) \, dz

This is the continuous version of the law of total probability.

Why Marginalization is Hard

In Bayesian inference, the evidence (marginal likelihood) is:

p(D)=p(Dθ)p(θ)dθp(\mathcal{D}) = \int p(\mathcal{D} \mid \boldsymbol{\theta}) \, p(\boldsymbol{\theta}) \, d\boldsymbol{\theta}

For a neural network with millions of parameters, this integral is over a million-dimensional space — computationally intractable. This intractability motivates:

  • Variational inference: Approximate the integral with an optimization problem (see calculus of variations)
  • Monte Carlo methods: Estimate the integral using random samples
  • Laplace approximation: Approximate the integrand as Gaussian using a second-order Taylor expansion

Monte Carlo Integration

When integrals lack closed-form solutions, Monte Carlo integration estimates them using random samples. The key identity:

Exp[h(x)]=h(x)p(x)dx1Ni=1Nh(xi),xip\mathbb{E}_{x \sim p}[h(x)] = \int h(x) \, p(x) \, dx \approx \frac{1}{N}\sum_{i=1}^{N} h(x_i), \quad x_i \sim p

Draw NN samples from pp, evaluate hh at each, and average. By the law of large numbers, this converges to the true integral as NN \to \infty.

Convergence Rate

Monte Carlo estimators converge at rate O(1/N)O(1/\sqrt{N}) regardless of dimension. This is remarkable — deterministic quadrature methods have rates that degrade exponentially with dimension (the curse of dimensionality), but Monte Carlo does not.

Key insight: Monte Carlo integration is the reason probabilistic ML scales to high dimensions. A 1000-dimensional integral is intractable for grid-based methods but routine for Monte Carlo. This is why sampling-based methods (MCMC, variational inference with reparameterization) dominate modern Bayesian deep learning.

Importance Sampling

When sampling from pp is difficult, we can sample from a different distribution qq and reweight:

Ep[h(x)]=Eq[h(x)p(x)q(x)]1Ni=1Nh(xi)p(xi)q(xi),xiq\mathbb{E}_{p}[h(x)] = \mathbb{E}_{q}\left[h(x) \frac{p(x)}{q(x)}\right] \approx \frac{1}{N}\sum_{i=1}^{N} h(x_i) \frac{p(x_i)}{q(x_i)}, \quad x_i \sim q

The ratio w(x)=p(x)/q(x)w(x) = p(x)/q(x) is the importance weight. Importance sampling appears in:

  • Off-policy reinforcement learning (correcting for behavior policy)
  • Variational autoencoders (importance-weighted ELBO)
  • Rare event estimation

The Reparameterization Trick

Variational autoencoders need to backpropagate through an expectation:

ϕEzqϕ(z)[f(z)]\nabla_\phi \mathbb{E}_{z \sim q_\phi(z)}[f(z)]

The problem: zz is sampled from qϕq_\phi, which depends on the parameters ϕ\phi we want to differentiate. We cannot backpropagate through a sampling operation.

The reparameterization trick rewrites the sampling as a deterministic function of a noise variable:

z=μϕ+σϕϵ,ϵN(0,I)z = \mu_\phi + \sigma_\phi \odot \epsilon, \quad \epsilon \sim \mathcal{N}(0, \mathbf{I})

Now the expectation is over ϵ\epsilon (independent of ϕ\phi), and we can move the gradient inside:

ϕEϵ[f(μϕ+σϕϵ)]=Eϵ[ϕf(μϕ+σϕϵ)]\nabla_\phi \mathbb{E}_{\epsilon}[f(\mu_\phi + \sigma_\phi \odot \epsilon)] = \mathbb{E}_{\epsilon}[\nabla_\phi f(\mu_\phi + \sigma_\phi \odot \epsilon)]

Key insight: The reparameterization trick connects integration and differentiation — it allows us to compute gradients of expectations, enabling end-to-end training of models with stochastic latent variables (VAEs, stochastic neural networks).

Worked Example: Computing Evidence

Consider a simple Bayesian model: yN(θ,1)y \sim \mathcal{N}(\theta, 1) with prior θN(0,σ02)\theta \sim \mathcal{N}(0, \sigma_0^2). The evidence for observing y=3y = 3:

p(y=3)=p(y=3θ)p(θ)dθ=12πe(3θ)2/21σ02πeθ2/(2σ02)dθ=12πσ0exp((3θ)22θ22σ02)dθ\begin{aligned} p(y = 3) &= \int_{-\infty}^{\infty} p(y = 3 \mid \theta) \, p(\theta) \, d\theta \\[6pt] &= \int_{-\infty}^{\infty} \frac{1}{\sqrt{2\pi}} e^{-(3-\theta)^2/2} \cdot \frac{1}{\sigma_0\sqrt{2\pi}} e^{-\theta^2/(2\sigma_0^2)} \, d\theta \\[6pt] &= \frac{1}{2\pi\sigma_0} \int_{-\infty}^{\infty} \exp\left(-\frac{(3-\theta)^2}{2} - \frac{\theta^2}{2\sigma_0^2}\right) d\theta \end{aligned}

This is a Gaussian integral. Completing the square in the exponent yields:

p(y=3)=12π(1+σ02)exp(92(1+σ02))p(y = 3) = \frac{1}{\sqrt{2\pi(1 + \sigma_0^2)}} \exp\left(-\frac{9}{2(1 + \sigma_0^2)}\right)

The evidence is itself Gaussian with variance 1+σ021 + \sigma_0^2. This closed-form solution is the exception — for most models, the evidence integral is intractable.

Why This Matters for ML

Integration is the mathematical backbone of probabilistic ML:

  • Probability densities are normalized by integrals — without integration, we cannot define continuous distributions
  • Expectations (means, variances, losses) are all integrals over distributions
  • Marginalization integrates out latent variables — essential for Bayesian inference and mixture models
  • Monte Carlo methods estimate intractable integrals using random samples, scaling to arbitrary dimensions
  • The reparameterization trick allows backpropagation through stochastic sampling, enabling VAE training
  • Normalizing flows use the change-of-variables formula (Jacobian determinant) to define complex distributions

Summary

  • The definite integral computes accumulated area; the Fundamental Theorem links it to antiderivatives
  • Multiple integrals and the change-of-variables formula (Jacobian) generalize to higher dimensions
  • Expectation is an integral: E[h(X)]=h(x)f(x)dx\mathbb{E}[h(X)] = \int h(x) f(x) \, dx — nearly every ML quantity is an expectation
  • Marginalization integrates out unwanted variables but is often intractable in high dimensions
  • Monte Carlo integration estimates integrals via sampling at rate O(1/N)O(1/\sqrt{N}) regardless of dimension
  • Importance sampling reweights samples from an easy distribution to estimate expectations under a hard one
  • The reparameterization trick enables gradient-based optimization through stochastic sampling
  • Next: calculus of variations optimizes over entire functions, not just parameters

References

  • Stewart, J. (2015). Calculus: Early Transcendentals (8th ed.). Cengage Learning. Chapters 5-7, 15.
  • Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. Chapters 1-2, 10.
  • Murphy, K. P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press. Chapters 2-4. probml.github.io
  • Kingma, D. P., & Welling, M. (2014). Auto-Encoding Variational Bayes. ICLR. arXiv:1312.6114
  • Robert, C. P., & Casella, G. (2004). Monte Carlo Statistical Methods (2nd ed.). Springer.

Keyboard Shortcuts

Navigation
j
Next heading
k
Previous heading
n
Next article in series
p
Previous article in series
t
Scroll to top
Actions
r
Toggle reading mode
Ctrl K
Search articles
?
Toggle this help
Esc
Close overlay