Probabilistic Graphical Models

Bayesian networks, Markov random fields, and how graphs encode conditional independence for tractable reasoning over complex distributions.

Probability & Statistics March 6, 2026 8 min read

The Curse of High Dimensions

A joint distribution over dd binary variables has 2d12^d - 1 free parameters. For d=30d = 30, that’s over a billion parameters — impossible to estimate from any reasonable dataset.

Probabilistic Graphical Models (PGMs) solve this by exploiting conditional independence structure. If variables only interact locally, the joint distribution factors into a product of smaller terms, each involving only a few variables.

PGMs use graphs to make this structure explicit and visual. The nodes represent random variables; the edges encode dependencies.

Two Families of Graphical Models

PropertyBayesian NetworksMarkov Random Fields
Graph typeDirected acyclic graph (DAG)Undirected graph
InterpretationCausal / generativeAssociative / symmetric
FactorizationConditional probabilitiesPotential functions
NormalizationAutomatic (local CPDs)Requires partition function ZZ
Common useCausal reasoning, generationImages, physics, social networks

Bayesian Networks

A Bayesian network is a directed acyclic graph where each node XiX_i has a conditional probability distribution given its parents Pa(Xi)\text{Pa}(X_i).

Factorization

The joint distribution factors as:

P(X1,X2,,Xd)=i=1dP(XiPa(Xi))P(X_1, X_2, \ldots, X_d) = \prod_{i=1}^{d} P(X_i \mid \text{Pa}(X_i))

Each factor P(XiPa(Xi))P(X_i \mid \text{Pa}(X_i)) is a conditional probability table (CPT) for discrete variables, or a conditional density for continuous ones.

Example: Disease Diagnosis

Consider a simple medical model:

  • Smoking (SS) — binary
  • Pollution (PP) — binary
  • Cancer (CC) — depends on SS and PP
  • X-ray (XX) — depends on CC
  • Dyspnea (DD) — depends on CC

The joint factors as:

P(S,P,C,X,D)=P(S)P(P)P(CS,P)P(XC)P(DC)P(S, P, C, X, D) = P(S) \cdot P(P) \cdot P(C \mid S, P) \cdot P(X \mid C) \cdot P(D \mid C)

Without the graph structure, we’d need 251=312^5 - 1 = 31 parameters. With it, we need 1+1+4+2+2=101 + 1 + 4 + 2 + 2 = 10 — a significant reduction.

Conditional Independence in DAGs

The graph encodes conditional independence statements. Three fundamental structures:

Chain: ABCA \to B \to C

  • A ⁣ ⁣ ⁣CBA \perp\!\!\!\perp C \mid B — knowing BB blocks information flow from AA to CC

Fork: ABCA \leftarrow B \to C

  • A ⁣ ⁣ ⁣CBA \perp\!\!\!\perp C \mid BAA and CC are independent given their common cause BB

Collider (V-structure): ABCA \to B \leftarrow C

  • A ⁣ ⁣ ⁣CA \perp\!\!\!\perp C but A̸ ⁣ ⁣ ⁣ ⁣CBA \not\!\perp\!\!\!\perp C \mid B — conditioning on the common effect BB creates a dependency (explaining away)

Explaining away: If a patient has cancer (BB), knowing they smoke (AA) makes pollution (CC) a less likely explanation. Observing the effect creates competition between its causes.

D-Separation

D-separation is the graphical criterion that determines conditional independence in Bayesian networks.

A path between nodes AA and BB is blocked by a set of observed variables C\mathbf{C} if it contains:

  • A chain or fork node that is in C\mathbf{C} (observed intermediate blocks information)
  • A collider node that is NOT in C\mathbf{C} (and has no descendant in C\mathbf{C})

If all paths between AA and BB are blocked by C\mathbf{C}, then A ⁣ ⁣ ⁣BCA \perp\!\!\!\perp B \mid \mathbf{C}.

Naive Bayes as a Bayesian Network

The Naive Bayes classifier is a simple Bayesian network where the class variable CC is the parent of all features:

P(C,X1,,Xd)=P(C)i=1dP(XiC)P(C, X_1, \ldots, X_d) = P(C) \prod_{i=1}^{d} P(X_i \mid C)

The “naive” assumption is that features are conditionally independent given the class. This is rarely true, but the classifier works surprisingly well in practice.

Markov Random Fields (Undirected Models)

A Markov Random Field (MRF) uses an undirected graph. Instead of conditional probabilities, it uses potential functions (also called factors or clique potentials).

Factorization

P(X)=1ZcCψc(Xc)P(\mathbf{X}) = \frac{1}{Z} \prod_{c \in \mathcal{C}} \psi_c(\mathbf{X}_c)

where:

  • C\mathcal{C} is the set of cliques (fully connected subgraphs)
  • ψc(Xc)0\psi_c(\mathbf{X}_c) \geq 0 is the potential function for clique cc
  • Z=Xcψc(Xc)Z = \sum_{\mathbf{X}} \prod_c \psi_c(\mathbf{X}_c) is the partition function (normalizing constant)

The potentials encode compatibility — how “happy” a configuration is. Higher potential means more likely, but the potentials are not probabilities.

The Partition Function Problem

Computing ZZ requires summing over all possible configurations — exponential in the number of variables. This intractability is the central computational challenge of undirected models.

Conditional Independence in MRFs

The Markov property for undirected graphs is simpler than d-separation:

XA ⁣ ⁣ ⁣XBXCif C separates A from B in the graphX_A \perp\!\!\!\perp X_B \mid X_C \quad \text{if } C \text{ separates } A \text{ from } B \text{ in the graph}

If removing the nodes in CC disconnects AA from BB, they are conditionally independent given CC.

Ising Model

A classical MRF from statistical physics, used for binary variables on a grid:

P(x)=1Zexp ⁣(iαixi+(i,j)Eβijxixj)P(\mathbf{x}) = \frac{1}{Z} \exp\!\left(\sum_{i} \alpha_i x_i + \sum_{(i,j) \in E} \beta_{ij} x_i x_j\right)
  • Node potentials αi\alpha_i: bias toward +1+1 or 1-1
  • Edge potentials βij\beta_{ij}: preference for neighbors to agree (β>0\beta > 0) or disagree (β<0\beta < 0)

In ML: Image denoising (neighboring pixels tend to have similar values), social network modeling, and as a building block for Boltzmann machines.

Conditional Random Fields (CRFs)

CRFs are discriminative undirected models that model P(yx)P(\mathbf{y} \mid \mathbf{x}) directly:

P(yx)=1Z(x)exp ⁣(cwϕc(yc,x))P(\mathbf{y} \mid \mathbf{x}) = \frac{1}{Z(\mathbf{x})} \exp\!\left(\sum_{c} \mathbf{w}^\top \boldsymbol{\phi}_c(\mathbf{y}_c, \mathbf{x})\right)

Unlike generative models, CRFs don’t model the input distribution P(x)P(\mathbf{x}) — they focus entirely on the mapping from inputs to outputs.

In ML: Named entity recognition (NER), part-of-speech tagging, image segmentation, and any structured prediction task where outputs have dependencies.

Inference in Graphical Models

Given a model, we need to answer queries: What is P(Xievidence)P(X_i \mid \text{evidence})?

Exact Inference

Variable Elimination: Marginalize out variables one at a time, choosing an elimination order that minimizes intermediate factor sizes.

Belief Propagation (Sum-Product Algorithm): Pass messages between nodes on the graph. Exact on trees, approximate on graphs with cycles (loopy belief propagation).

For a tree-structured graph:

mij(xj)=xiψ(xi,xj)ψ(xi)kNe(i)jmki(xi)m_{i \to j}(x_j) = \sum_{x_i} \psi(x_i, x_j) \cdot \psi(x_i) \prod_{k \in \text{Ne}(i) \setminus j} m_{k \to i}(x_i)

After messages converge, the marginal at node ii is proportional to the product of all incoming messages times the node potential.

Approximate Inference

When exact inference is intractable:

  • Loopy Belief Propagation: Run message passing on graphs with cycles. Not guaranteed to converge, but often works well in practice
  • Variational Inference: Approximate the posterior with a simpler distribution (see Bayesian inference)
  • MCMC sampling: Gibbs sampling is natural for graphical models — each variable’s full conditional depends only on its Markov blanket

Learning Graphical Models

Parameter Learning (Known Structure)

Given the graph structure, learn the parameters from data:

  • Complete data: MLE or MAP for each conditional distribution independently
  • Missing data: EM algorithm — E-step infers missing values, M-step updates parameters

Structure Learning (Unknown Structure)

Learning the graph itself is much harder:

Score-based methods: Search over possible structures and score each one:

  • BIC/AIC scores balance fit and complexity
  • Bayesian structure learning computes posterior over graphs

Constraint-based methods: Use conditional independence tests to determine edges:

  • The PC algorithm tests Xi ⁣ ⁣ ⁣XjSX_i \perp\!\!\!\perp X_j \mid \mathbf{S} for increasing subsets S\mathbf{S}
  • Discovers the graph skeleton, then orients edges

In ML: Structure learning is used in causal discovery — inferring cause-effect relationships from observational data.

PGMs in Modern ML

Classical Applications

  • Medical diagnosis: Bayesian networks model disease-symptom relationships
  • Speech recognition: HMMs (a type of Bayesian network) for acoustic modeling
  • Computer vision: MRFs for image segmentation and denoising
  • NLP: CRFs for sequence labeling tasks

Connection to Deep Learning

Modern deep learning has largely replaced explicit PGMs for many tasks, but the ideas persist:

  • Variational Autoencoders are latent variable models with neural network parameterization — directed graphical models trained with variational inference
  • Attention mechanisms in transformers can be viewed through the lens of graphical model message passing
  • Graph Neural Networks generalize belief propagation to learned message functions
  • Diffusion models are hierarchical latent variable models with a specific graphical structure

The core insight of PGMs — that exploiting conditional independence makes high-dimensional inference tractable — remains as relevant as ever, even when the specific parameterization is a neural network.

Summary

  • Graphical models encode conditional independence structure using graphs
  • Bayesian networks (directed) factor into conditional distributions; factorization follows from the graph
  • MRFs (undirected) use potential functions; the partition function ZZ is the central challenge
  • D-separation determines conditional independence in DAGs; graph separation in MRFs
  • Exact inference (variable elimination, belief propagation) works on trees; approximate methods handle general graphs
  • CRFs are discriminative undirected models for structured prediction
  • Structure learning discovers causal relationships from data
  • PGM concepts (latent variables, message passing, conditional independence) underpin modern deep learning architectures

References

  • Koller, D., & Friedman, N. (2009). Probabilistic Graphical Models: Principles and Techniques. MIT Press.
  • Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. Chapter 8.
  • Murphy, K. P. (2012). Machine Learning: A Probabilistic Perspective. MIT Press. Chapters 10, 19-20.
  • Jordan, M. I. (2003). “An Introduction to Probabilistic Graphical Models.” Unpublished manuscript. UC Berkeley.
  • Wainwright, M. J., & Jordan, M. I. (2008). “Graphical Models, Exponential Families, and Variational Inference.” Foundations and Trends in Machine Learning, 1(1-2), 1—305.
  • Pearl, J. (2009). Causality: Models, Reasoning, and Inference (2nd ed.). Cambridge University Press.
  • Lafferty, J. D., McCallum, A., & Pereira, F. C. N. (2001). “Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data.” ICML 2001.

Keyboard Shortcuts

Navigation
j
Next heading
k
Previous heading
n
Next article in series
p
Previous article in series
t
Scroll to top
Actions
r
Toggle reading mode
Ctrl K
Search articles
?
Toggle this help
Esc
Close overlay