Probabilistic Graphical Models

Probability & Statistics Series 12 / 13

The Curse of High Dimensions

A joint distribution over $d$ binary variables has $2^d - 1$ free parameters. For $d = 30$ , that’s over a billion parameters — impossible to estimate from any reasonable dataset.

Probabilistic Graphical Models (PGMs) solve this by exploiting conditional independence structure. If variables only interact locally, the joint distribution factors into a product of smaller terms, each involving only a few variables.

PGMs use graphs to make this structure explicit and visual. The nodes represent random variables; the edges encode dependencies.

Two Families of Graphical Models

Property	Bayesian Networks	Markov Random Fields
Graph type	Directed acyclic graph (DAG)	Undirected graph
Interpretation	Causal / generative	Associative / symmetric
Factorization	Conditional probabilities	Potential functions
Normalization	Automatic (local CPDs)	Requires partition function $Z$
Common use	Causal reasoning, generation	Images, physics, social networks

Bayesian Networks

A Bayesian network is a directed acyclic graph where each node $X_i$ has a conditional probability distribution given its parents $\text{Pa}(X_i)$ .

Factorization

The joint distribution factors as:

P(X_1, X_2, \ldots, X_d) = \prod_{i=1}^{d} P(X_i \mid \text{Pa}(X_i))

Each factor $P(X_i \mid \text{Pa}(X_i))$ is a conditional probability table (CPT) for discrete variables, or a conditional density for continuous ones.

Example: Disease Diagnosis

Consider a simple medical model:

Smoking ( $S$ ) — binary
Pollution ( $P$ ) — binary
Cancer ( $C$ ) — depends on $S$ and $P$
X-ray ( $X$ ) — depends on $C$
Dyspnea ( $D$ ) — depends on $C$

The joint factors as:

P(S, P, C, X, D) = P(S) \cdot P(P) \cdot P(C \mid S, P) \cdot P(X \mid C) \cdot P(D \mid C)

Without the graph structure, we’d need $2^5 - 1 = 31$ parameters. With it, we need $1 + 1 + 4 + 2 + 2 = 10$ — a significant reduction.

Conditional Independence in DAGs

The graph encodes conditional independence statements. Three fundamental structures:

Chain: $A \to B \to C$

$A \perp\!\!\!\perp C \mid B$ — knowing $B$ blocks information flow from $A$ to $C$

Fork: $A \leftarrow B \to C$

$A \perp\!\!\!\perp C \mid B$ — $A$ and $C$ are independent given their common cause $B$

Collider (V-structure): $A \to B \leftarrow C$

$A \perp\!\!\!\perp C$ but $A \not\!\perp\!\!\!\perp C \mid B$ — conditioning on the common effect $B$ creates a dependency (explaining away)

Explaining away: If a patient has cancer ( $B$ ), knowing they smoke ( $A$ ) makes pollution ( $C$ ) a less likely explanation. Observing the effect creates competition between its causes.

D-Separation

D-separation is the graphical criterion that determines conditional independence in Bayesian networks.

A path between nodes $A$ and $B$ is blocked by a set of observed variables $\mathbf{C}$ if it contains:

A chain or fork node that is in $\mathbf{C}$ (observed intermediate blocks information)
A collider node that is NOT in $\mathbf{C}$ (and has no descendant in $\mathbf{C}$ )

If all paths between $A$ and $B$ are blocked by $\mathbf{C}$ , then $A \perp\!\!\!\perp B \mid \mathbf{C}$ .

Naive Bayes as a Bayesian Network

The Naive Bayes classifier is a simple Bayesian network where the class variable $C$ is the parent of all features:

P(C, X_1, \ldots, X_d) = P(C) \prod_{i=1}^{d} P(X_i \mid C)

The “naive” assumption is that features are conditionally independent given the class. This is rarely true, but the classifier works surprisingly well in practice.

Markov Random Fields (Undirected Models)

A Markov Random Field (MRF) uses an undirected graph. Instead of conditional probabilities, it uses potential functions (also called factors or clique potentials).

Factorization

P(\mathbf{X}) = \frac{1}{Z} \prod_{c \in \mathcal{C}} \psi_c(\mathbf{X}_c)

where:

$\mathcal{C}$ is the set of cliques (fully connected subgraphs)
$\psi_c(\mathbf{X}_c) \geq 0$ is the potential function for clique $c$
$Z = \sum_{\mathbf{X}} \prod_c \psi_c(\mathbf{X}_c)$ is the partition function (normalizing constant)

The potentials encode compatibility — how “happy” a configuration is. Higher potential means more likely, but the potentials are not probabilities.

The Partition Function Problem

Computing $Z$ requires summing over all possible configurations — exponential in the number of variables. This intractability is the central computational challenge of undirected models.

Conditional Independence in MRFs

The Markov property for undirected graphs is simpler than d-separation:

X_A \perp\!\!\!\perp X_B \mid X_C \quad \text{if } C \text{ separates } A \text{ from } B \text{ in the graph}

If removing the nodes in $C$ disconnects $A$ from $B$ , they are conditionally independent given $C$ .

Ising Model

A classical MRF from statistical physics, used for binary variables on a grid:

P(\mathbf{x}) = \frac{1}{Z} \exp\!\left(\sum_{i} \alpha_i x_i + \sum_{(i,j) \in E} \beta_{ij} x_i x_j\right)

Node potentials $\alpha_i$ : bias toward $+1$ or $-1$
Edge potentials $\beta_{ij}$ : preference for neighbors to agree ( $\beta > 0$ ) or disagree ( $\beta < 0$ )

In ML: Image denoising (neighboring pixels tend to have similar values), social network modeling, and as a building block for Boltzmann machines.

Conditional Random Fields (CRFs)

CRFs are discriminative undirected models that model $P(\mathbf{y} \mid \mathbf{x})$ directly:

P(\mathbf{y} \mid \mathbf{x}) = \frac{1}{Z(\mathbf{x})} \exp\!\left(\sum_{c} \mathbf{w}^\top \boldsymbol{\phi}_c(\mathbf{y}_c, \mathbf{x})\right)

Unlike generative models, CRFs don’t model the input distribution $P(\mathbf{x})$ — they focus entirely on the mapping from inputs to outputs.

In ML: Named entity recognition (NER), part-of-speech tagging, image segmentation, and any structured prediction task where outputs have dependencies.

Inference in Graphical Models

Given a model, we need to answer queries: What is $P(X_i \mid \text{evidence})$ ?

Exact Inference

Variable Elimination: Marginalize out variables one at a time, choosing an elimination order that minimizes intermediate factor sizes.

Belief Propagation (Sum-Product Algorithm): Pass messages between nodes on the graph. Exact on trees, approximate on graphs with cycles (loopy belief propagation).

For a tree-structured graph:

m_{i \to j}(x_j) = \sum_{x_i} \psi(x_i, x_j) \cdot \psi(x_i) \prod_{k \in \text{Ne}(i) \setminus j} m_{k \to i}(x_i)

After messages converge, the marginal at node $i$ is proportional to the product of all incoming messages times the node potential.

Approximate Inference

When exact inference is intractable:

Loopy Belief Propagation: Run message passing on graphs with cycles. Not guaranteed to converge, but often works well in practice
Variational Inference: Approximate the posterior with a simpler distribution (see Bayesian inference)
MCMC sampling: Gibbs sampling is natural for graphical models — each variable’s full conditional depends only on its Markov blanket

Learning Graphical Models

Parameter Learning (Known Structure)

Given the graph structure, learn the parameters from data:

Complete data: MLE or MAP for each conditional distribution independently
Missing data: EM algorithm — E-step infers missing values, M-step updates parameters

Structure Learning (Unknown Structure)

Learning the graph itself is much harder:

Score-based methods: Search over possible structures and score each one:

BIC/AIC scores balance fit and complexity
Bayesian structure learning computes posterior over graphs

Constraint-based methods: Use conditional independence tests to determine edges:

The PC algorithm tests $X_i \perp\!\!\!\perp X_j \mid \mathbf{S}$ for increasing subsets $\mathbf{S}$
Discovers the graph skeleton, then orients edges

In ML: Structure learning is used in causal discovery — inferring cause-effect relationships from observational data.

PGMs in Modern ML

Classical Applications

Medical diagnosis: Bayesian networks model disease-symptom relationships
Speech recognition: HMMs (a type of Bayesian network) for acoustic modeling
Computer vision: MRFs for image segmentation and denoising
NLP: CRFs for sequence labeling tasks

Connection to Deep Learning

Modern deep learning has largely replaced explicit PGMs for many tasks, but the ideas persist:

Variational Autoencoders are latent variable models with neural network parameterization — directed graphical models trained with variational inference
Attention mechanisms in transformers can be viewed through the lens of graphical model message passing
Graph Neural Networks generalize belief propagation to learned message functions
Diffusion models are hierarchical latent variable models with a specific graphical structure

The core insight of PGMs — that exploiting conditional independence makes high-dimensional inference tractable — remains as relevant as ever, even when the specific parameterization is a neural network.

Summary

Graphical models encode conditional independence structure using graphs
Bayesian networks (directed) factor into conditional distributions; factorization follows from the graph
MRFs (undirected) use potential functions; the partition function $Z$ is the central challenge
D-separation determines conditional independence in DAGs; graph separation in MRFs
Exact inference (variable elimination, belief propagation) works on trees; approximate methods handle general graphs
CRFs are discriminative undirected models for structured prediction
Structure learning discovers causal relationships from data
PGM concepts (latent variables, message passing, conditional independence) underpin modern deep learning architectures

References

Koller, D., & Friedman, N. (2009). Probabilistic Graphical Models: Principles and Techniques. MIT Press.
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. Chapter 8.
Murphy, K. P. (2012). Machine Learning: A Probabilistic Perspective. MIT Press. Chapters 10, 19-20.
Jordan, M. I. (2003). “An Introduction to Probabilistic Graphical Models.” Unpublished manuscript. UC Berkeley.
Wainwright, M. J., & Jordan, M. I. (2008). “Graphical Models, Exponential Families, and Variational Inference.” Foundations and Trends in Machine Learning, 1(1-2), 1—305.
Pearl, J. (2009). Causality: Models, Reasoning, and Inference (2nd ed.). Cambridge University Press.
Lafferty, J. D., McCallum, A., & Pereira, F. C. N. (2001). “Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data.” ICML 2001.

Probabilistic Graphical Models

The Curse of High Dimensions

Two Families of Graphical Models

Bayesian Networks

Factorization

Example: Disease Diagnosis

Conditional Independence in DAGs

D-Separation

Naive Bayes as a Bayesian Network

Markov Random Fields (Undirected Models)

Factorization

The Partition Function Problem

Conditional Independence in MRFs

Ising Model

Conditional Random Fields (CRFs)

Inference in Graphical Models

Exact Inference

Approximate Inference

Learning Graphical Models

Parameter Learning (Known Structure)

Structure Learning (Unknown Structure)

PGMs in Modern ML

Classical Applications

Connection to Deep Learning

Summary

References

Keyboard Shortcuts