The Mathematical Building Blocks of Diffusion Generative Models

1. Preliminaries

To fully understand the proof of FP equations, we need to recall several results from probability, analysis and linear algebra.

1.1 Wiener Process

Recall that a Wiener process $W_t$ in $\mathbb{R}^d$ possesses the following property:

$$\forall t, h \in \mathbb{R}_+: \quad W_{t+h} - W_t \sim \mathcal{N}(0, h\mathbb{I}_{d\times d})$$

This naturally implies that, for any matrix $A \in \mathbb{R}^{d\times d}$ independent of $W_t$,

$$\mathbb{E}[W_{t+h} - W_t] = 0$$ $$\mathbb{E}[(W_{t+h} - W_t)^\top A (W_{t+h} - W_t)] = \mathbb{E}_{\epsilon \sim \mathcal{N}(0,h\mathbb{I}_{d\times d})}[\epsilon^\top A \epsilon]$$

Here, $\epsilon \in \mathbb{R}^d$, which has independent coordinates. The expectations are taken with respect to the randomness in $W_t$.

1.2 Hutchinson's Trace Estimator

For $\epsilon \sim \mathcal{N}(0, h\mathbb{I}_{d\times d})$ and matrix $A \in \mathbb{R}^{d\times d}$, the trace of $A$ is the expected quadratic form of $\epsilon$ and $A$.

$$\mathbb{E}_\epsilon[\epsilon^\top A \epsilon] = \mathbb{E}_{\epsilon_1,\epsilon_2,\ldots,\epsilon_d}\left[\sum_{i,j} \epsilon_i A_{i,j} \epsilon_j\right] = \sum_{i,j} A_{i,j} \mathbb{E}_{\epsilon_1,\epsilon_2,\ldots,\epsilon_d}[\epsilon_i \epsilon_j] = \sum_{i,j} A_{i,j} h\cdot \delta_{i,j} = h\cdot \text{tr} A$$

Consequently, $\hat{A}_\epsilon = \frac{1}{h} \epsilon^\top A \epsilon$ is an unbiased estimator of $\text{tr} A$.

1.3 Hessian Matrix and Laplacian

For a scalar functional $u: \mathbb{R}^d \to \mathbb{R}$ (which usually takes the physical meaning of a potential field over $\mathbb{R}^d$), the Hessian matrix is defined by packing the second-order derivatives into a matrix:

$$H_u = \nabla^2 u = \left(\frac{\partial^2}{\partial x_i \partial x_j} u(\vec{x})\right)_{i,j \in [d]^2} \in \mathbb{R}^{d \times d}$$

The Laplacian of $u$ is defined as the sum of second order derivatives at the same coordinates:

$$\Delta u = \nabla \cdot \nabla u = \sum_{i=1}^d \frac{\partial}{\partial x_i} u(\vec{x}) \in \mathbb{R}$$

By definition, the trace of the Hessian matrix is the Laplacian.

$$\text{tr} \nabla^2 u = \Delta u$$

Corollary

For a scalar functional $u: \mathbb{R}^d \to \mathbb{R}$ and white Gaussian vector $\epsilon \sim \mathcal{N}(0, h\mathbb{I}_{d\times d})$, the expected value of the quadratic form obtained from the difference of Wiener process $W_t$ and Hessian matrix $\nabla^2 u$ is equivalent to the Laplacian of the potential field $u$.

$$\mathbb{E}[(W_{t+h} - W_t)^\top \nabla^2 u (W_{t+h} - W_t)] = h\Delta u$$

1.4 The Laplacian of probability density; score function

The Laplacian of a probability density function $p$ often appears in the literature of diffusion models. By definition, we have

$$\nabla^2 p = \nabla \cdot (\nabla p)$$

Furthermore, we relate the density gradient to the score function, $s(x) = \nabla \log p(x) = \frac{\nabla p(x)}{p(x)}$, which implies

$$\nabla^2 p = \nabla \cdot (p \nabla \log p)$$

Theorem

The Laplacian of a probability density is the divergence of the product between density and its score function:

$$\nabla^2 p = \nabla \cdot (p(x) s(x))$$

1.5 Point-wise Equivalence, Test Functions, and Integration by Parts

To prove that two real functionals are point-wise equal, we can resort to evaluating the inner products of these functions with the same test function. If the test results (i.e. the inner products) are always the same for any test function used, then the functions are the same. Rigorously speaking,

Theorem

For arbitrary integrable functions $g_1, g_2: \mathbb{R}^d \to \mathbb{R}$ it holds that

$$g_1(x) = g_2(x) \text{ for all } x \in \mathbb{R}^d \quad \Leftrightarrow \quad \int f(x) g_1(x) \mathrm{d}x = \int f(x) g_2(x) \mathrm{d}x \text{ for all test functions } f$$

Recall that a test function is an infinitely differentiable function with and non-zero on a compact support, and it includes dirac delta functions. So, from RHS to LHS is simple, just pick the dirac delta. The transition from LHS to RHS is also straightforward.

More properties of test functions

First, test functions are zero at infinity, or the border of the compact support, due to definition.

Second, for arbitrary test functions $f_1, f_2$, we can use integration by parts to derive

$$\int_D f_1(x) \frac{\partial}{\partial x_i} f_2(x) \mathrm{d}x = f_1(x)f_2(x)\bigg|_{\partial D} - \int f_2(x) \frac{\partial}{\partial x_i} f_1(x) \mathrm{d}x = 0 - \int f_2(x) \frac{\partial}{\partial x_i} f_1(x) \mathrm{d}x = -\int f_2(x) \frac{\partial}{\partial x_i} f_1(x) \mathrm{d}x$$

By using this together with the definition of the divergence and Laplacian, we get the identities:

$$\int \nabla^T u(x) \vec{v}(x) \mathrm{d}x = -\int u(x) \text{div}(\vec{v})(x) \mathrm{d}x \quad (u: \mathbb{R}^d \to \mathbb{R}, \vec{v}: \mathbb{R}^d \to \mathbb{R}^d)$$ $$\int u(x) \Delta v(x) \mathrm{d}x = \int v(x) \Delta u(x) \mathrm{d}x \quad (u: \mathbb{R}^d \to \mathbb{R}, v: \mathbb{R}^d \to \mathbb{R})$$

Notice that in each equation, $x$ is understood as $(x_1,x_2,\ldots,x_d)$ and

$$\int_\Omega f(x) \mathrm{d}x = \int_{\Omega_1\times \ldots \times \Omega_d} f(x_1,x_2,\ldots,x_d) \mathrm{d}x_1 \mathrm{d}x_2 \ldots \mathrm{d}x_d$$

We can show the first equation by recursively calling integration by parts on each coordinate $x_i$. The second identity is proved by calling the integration by parts twice.

2. Fokker-Planck Equations

For a stochastic process determined by a stochastic differential equation (SDE),

$$\mathrm{d}X_t = \text{some function of } X_t, t, \text{ and random noise}$$

the Fokker-Planck (FP) equation tells us how the marginal density of this process $X_t \sim p_t(\cdot)$ changes when the time evolves, and it expresses the marginal density in terms of the coefficients of the underlying SDE. The FP equations precisely characterize the dynamics of this stochastic process, connecting the differential equations of evolution with the statistical properties of this system.

In the following sections, we will detail the proof of the Fokker-Planck equation in its most general form. Afterwards, we will use the FP equation to study two special types of stochastic processes, the first one is deterministic, and the second is stochastic, but the noise is irrelevant to the current state. These two cases are of particular interest to machine learning studies, as the former leads to flow-matching process, and the latter leads to diffusion models.

2.1 How an SDE Drives the Evolution of Marginal Density

First, we provide a statement of the Fokker-Planck equation.

Theorem (Fokker-Planck Equation)

Let $p_t$ be a probability path and consider Itô SDE

$$X_0 \sim p_0, \quad \mathrm{d}X_t = \mu(X_t, t) \mathrm{d}t + \sigma(X_t, t) \mathrm{d}W_t$$

where $X_t \in \mathbb{R}^d$. Then $X_t$ has distribution $p_t$ for all $0 \leq t \leq 1$ if and only if the Fokker-Planck equation holds:

$$\partial_t p_t(x) = -\nabla \cdot (p_t \mu_t)(x) + \frac{1}{2} \Delta(\sigma_t^2 p_t)(x) \quad \text{for all } x \in \mathbb{R}^d, 0 \leq t \leq 1$$

where we use the abbreviations $\mu_t(x) = \mu(x_t, t)$ and $\sigma_t^2(x) = \langle \sigma(x_t, t), \sigma(x_t, t) \rangle$.

Proof

We only show the necessary condition, which is nontrivial.

This equation shows point-wise equivalence between two real functionals. To show that, we pick an arbitrary test function $f: \mathbb{R}^d \to \mathbb{R}$, and show that the left and the right are equivalent after applying $f$ to form inner products. We want to show that

$$\int \partial_t p_t(x) f(x) \mathrm{d}x = \int \left[-\nabla \cdot (p_t \mu_t)(x) + \frac{1}{2} \Delta(\sigma_t^2 p_t)(x)\right] f(x) \mathrm{d}x$$

holds for all test functions. And then for any point $x$, we pick $f(z) = \delta(z-x)$ and finish proving that LHS=RHS at any point.

Let us start from LHS. We will use the property that $p_t(x)$ is in fact a probability density, and the test function is time-irrelevant. Consequently, LHS can be written into an expectation:

$$\text{LHS} = \partial_t \mathbb{E}_{x \sim p_t}[f(x)] = \lim_{h\to 0}\frac{1}{h} \mathbb{E}[f(X_{t+h}) - f(X_t)] = \lim_{h\to 0}\frac{1}{h} \mathbb{E}\left[\mathbb{E}[f(X_{t+h}) - f(X_t) | X_t]\right]$$

We will use the second-order Taylor expansion to evaluate the enumerator, since this limit eliminates any remaining terms of order $o(h)$. Then we will call the helper functions derived in the preliminary section and take the limit to finish the proof. In what follows, we neglect the higher order terms for better readability.

$$\begin{aligned} f(X_{t+h}) - f(X_t) &= f(X_t + h \mu_t(X_t) + \sigma_t(W_{t+h} - W_t)) - f(X_t) \\ &= \nabla f(X_t)^\top(h \mu_t(X_t) + \sigma_t(W_{t+h} - W_t)) \\ &\quad + \frac{1}{2}(h \mu_t(X_t) + \sigma_t(W_{t+h} - W_t))^\top \nabla^2 f(X_t)(h \mu_t(X_t) + \sigma_t(W_{t+h} - W_t)) + o(h^2) \\ &= h \nabla f(X_t)^\top \mu_t(X_t) + \sigma_t \nabla f(X_t)^\top(W_{t+h} - W_t) \\ &\quad + \frac{1}{2} \sigma_t^2(W_{t+h} - W_t)^\top \nabla^2 f(X_t)(W_{t+h} - W_t) + o(h^2) \end{aligned}$$

Now take the conditional expectation w.r.t. $X_t$ to both sides. We will nullify the second term using the fact that the Wiener process is independent with $X_t$ and has zero mean. The third term on RHS will be simplified by the corollary we derived earlier, which suggests that

$$\mathbb{E}[(W_{t+h} - W_t)^\top \nabla^2 u (W_{t+h} - W_t)] = h\Delta u$$

With these observations, we arrive at

$$\mathbb{E}[f(X_{t+h}) - f(X_t) | X_t] = h \nabla f(X_t)^\top \mu_t(X_t) + \frac{h}{2} \sigma_t^2(X_t) \Delta f(X_t) + o(h^2)$$

Finally, marginalize the conditional expectation by computing integrals, and call the integration by parts formula for vector fields, we have

$$\begin{aligned} \partial_t \mathbb{E}[f(X_t)] &= \int f(x) (\partial_t p_t(x)) \mathrm{d}x = \lim_{h\to 0} \frac{1}{h} \mathbb{E}[f(X_{t+h}) - f(X_t)] \\ &= \lim_{h\to 0} \frac{1}{h} \mathbb{E}[\mathbb{E}[f(X_{t+h}) - f(X_t) | X_t]] \\ &= \mathbb{E}[\nabla f(X_t)^\top \mu_t(X_t) + \frac{1}{2} \sigma_t^2(X_t) \Delta f(X_t)] \\ &= \int \nabla f(x)^\top \mu_t(x) p_t(x) \mathrm{d}x + \frac{1}{2} \int \Delta f(x) \cdot \sigma_t^2(x) p_t(x) \mathrm{d}x \\ &= -\int f(x) \text{div}(\mu_t p_t)(x) \mathrm{d}x + \frac{1}{2} \int f(x) \Delta(\sigma_t^2 p_t)(x) \mathrm{d}x \\ &= \int f(x)\left[-\text{div}(\mu_t p_t)(x) + \frac{1}{2} \Delta(\sigma_t^2 p_t)(x)\right] \mathrm{d}x \end{aligned}$$

Now since this result holds for any test functions, at any point $x_0$ we can pick $f(x) = \delta(x-x_0)$ and obtain

$$\partial_t p_t(x)\bigg|_{x=x_0} = -\text{div}(\mu_t(x) p_t(x)) + \frac{1}{2} \Delta(\sigma_t^2(x)p_t(x))\bigg|_{x=x_0}$$

thus these two functionals are equivalent point-wise.

3. Special Case I: ODE, Continuity Equation, and Flow-matching

With the Fokker-Planck equation, we naturally obtain an ODE counterpart by setting the noise magnitude $\sigma_t(X_t)$ to zero. This is known as the continuity equation.

Corollary (Continuity Equation)

Let $p_t$ be a probability path and consider the ODE

$$X_0 \sim p_0, \quad \mathrm{d}X_t = \mu(X_t, t) \mathrm{d}t$$

where $X_t \in \mathbb{R}^d$. Then $X_t$ has distribution $p_t$ for all $0 \leq t \leq 1$ if and only if the continuity equation holds:

$$\partial_t p_t(x) = -\nabla \cdot (p_t \mu_t)(x) \quad \text{for all } x \in \mathbb{R}^d, 0 \leq t \leq 1$$

where we use the abbreviations $\mu_t(x) = \mu(x_t, t)$.

This relationship is very useful in the derivation of flow-matching process's log probabilities.

4. Special Case II: Forward and Reverse-time Diffusion Process

Definition (Diffusion Process)

A diffusion process is a continuous-time stochastic process $\{X_t\}_{t\geq 0}$ governed by the Itô stochastic differential equation (SDE):

$$\mathrm{d}X_t = f(X_t, t) \mathrm{d}t + g(t) \mathrm{d}W_t$$

where:

$f(X_t, t): \mathbb{R}^d \times \mathbb{R}_+ \to \mathbb{R}^d$ is the drift coefficient,
$g(t): \mathbb{R}_+ \to \mathbb{R}_+$ is the state-independent diffusion coefficient,
$W_t$ is a standard $d$-dimensional Brownian motion (Wiener Process)

The key characteristic of a diffusion process is that the diffusion coefficient $g(t)$ depends only on time $t$ and not on the current state $X_t$, distinguishing it from the general Itô SDE where $\sigma(X_t, t)$ may depend on the state.

By the Fokker-Planck equation, the marginal probability density of a forward diffusion process is given by

$$\frac{\partial p_t(x)}{\partial t} = -\nabla \cdot (f(x, t) p_t(x)) + \frac{1}{2} g^2(t) \nabla^2 p_t(x)$$

It turns out that the forward diffusion process is actually invertible mathematically. In detail, there exists a reverse-time diffusion process with density $\tilde{p}_s(x)$, such that $\tilde{p}_s(x) = p_{T-s}(x)$. When the reverse-time process evolves from $s=0$ to $s=T$, the marginal probability is exactly the same as the forward process evolved backward. Next, we find the SDE for the time-reversed process.

4.1 Time Reversal and Density Evolution

We denote by $s$ the increasing time index of the reverse process, and let $t$ be the increasing time index of the forward process. With a time endpoint $T$, these two time indices are related with $s+t=T$. Since $ds = -\mathrm{d}t$, we obtain that the evolution of the reversed density with respect to the forward time $s$ is:

$$\frac{\partial \tilde{p}_s(x)}{\partial s} = \frac{\partial p_{T-s}(x)}{\partial s} = -\frac{\partial p_t(x)}{\partial t}\bigg|_{t=T-s}$$

Substituting the Forward Fokker-Planck Equation into the above, we can express the probability density of the reverse process in terms of the forward diffusion coefficients:

$$\begin{aligned} \frac{\partial \tilde{p}_s(x)}{\partial s} &= -\left[-\nabla \cdot (f(x, t) p_t(x)) + \frac{1}{2} g^2(t) \nabla^2 p_t(x)\right]\bigg|_{t=T-s} \\ &= \nabla \cdot (f(x, T-s) \tilde{p}_s(x)) - \frac{1}{2} g^2(T-s) \nabla^2 \tilde{p}_s(x) \end{aligned}$$

4.2 Diffusion, Laplacian, and Score

Since the magnitude of the noise of a diffusion process is irrelevant to the state, the Laplacian in the Fokker-Planck equation will be applied directly to the probability density. By the theorem we derived earlier, this naturally results in score functions; hence in diffusion processes, the Laplacian are very closely related to the score function. This is why the score appears so many times in diffusion model learning. Concretely speaking, the Laplacian in the diffusion term can be simplified as

$$\frac{1}{2} g^2(T-s) \nabla^2 \tilde{p}_s(x) = \frac{1}{2} g^2(T-s) \nabla \cdot (\tilde{p}_s(x) \nabla \log \tilde{p}_s(x))$$

Substituting this into the reversed density evolution and factoring out the divergence operator and the reverse density, we have

$$\frac{\partial \tilde{p}_s(x)}{\partial s} = \nabla \cdot (\tilde{p}_s(x) [f(x, T-s) - \frac{1}{2} g^2(T-s) \nabla \log \tilde{p}_s(x)])$$

4.3 The Reverse-Time SDE

If the reverse process exists and is also a diffusion process, it must also be governed by an FP equation with state-irrelevant noise. Let the density $\tilde{p}_s(x)$ of the hypothesized reverse-time SDE be determined by

$$\mathrm{d}\tilde{X}_s = h(\tilde{X}_s, s) ds + \tilde{g}(s) \mathrm{d}W_s$$

since it is the reverse process of $p$, it must also satisfy its own Fokker-Planck equation:

$$\frac{\partial \tilde{p}_s(x)}{\partial s} = -\nabla \cdot (h(x, s) \tilde{p}_s(x)) + \frac{1}{2} \tilde{g}^2(s) \nabla^2 \tilde{p}_s(x)$$

If we assume that the diffusion coefficient remains the same, $\tilde{g}(s) = g(T-s)$, we can equate the two equations to determine the reverse drift $h(x, s)$.

By factoring the reverse FP equation using the identity $\frac{1}{2} \tilde{g}^2 \nabla^2 \tilde{p}_s = -\nabla \cdot (-\frac{1}{2} \tilde{g}^2 \tilde{p}_s \nabla \log \tilde{p}_s)$:

$$\frac{\partial \tilde{p}_s(x)}{\partial s} = -\nabla \cdot (\tilde{p}_s(x) [h(x, s) - \frac{1}{2} g^2(T-s) \nabla \log \tilde{p}_s(x)])$$

Equating the drift terms from this and the previous equation:

$$-[h(x, s) - \frac{1}{2} g^2(T-s) \nabla \log \tilde{p}_s(x)] = f(x, T-s) - \frac{1}{2} g^2(T-s) \nabla \log \tilde{p}_s(x)$$

Solving for the reverse drift $h(x, s)$:

$$h(x, s) = -f(x, T-s) + g^2(T-s) \nabla \log \tilde{p}_s(x)$$

Substituting $h(\tilde{X}_s, s)$ back into the reverse SDE and replacing $\tilde{p}_s(\tilde{X}_s)$ with $p_{T-s}(\tilde{X}_s)$, we get the Reverse-Time Diffusion Process Equation:

$$\text{Reverse diffusion SDE: } \mathrm{d}\tilde{X}_s = [-f(\tilde{X}_s, T-s) + g^2(T-s) \nabla \log p_{T-s}(\tilde{X}_s)] ds + g(T-s) \mathrm{d}W_s, \quad s \in [0,T]$$

The convention of reversed time index

The equation derived above is valid for a process $\tilde{p}_s$ running in a forward-time index $s \in [0, T]$. However, literature, particularly in generative modeling, often maintains the original time index $t \in [0, T]$ but interprets the process as running backward from $T$ to $0$.

Let us use the original time index $t$, but define the differential as a negative change $d(-t)$. Let $x(t)$ be the reverse process, where $t$ now runs from $T$ (start) to $0$ (end). The equation can be written as

$$\mathrm{d}x_t = [f(x_t, t) - g^2(t) \nabla \log p_t(x_t)] (-\mathrm{d}t) + g(t) \mathrm{d}\overline{W}_t$$

and we remind the reader that now $t$ decreases from $T$ to $0$. Here, $(-\mathrm{d}t)$ indicates the backward direction and $\mathrm{d}\overline{W}_t$ is a backward Wiener process, which means that when $t$ decreases, it is statistically equivalent to $\mathrm{d}W_s$ with $s$ increasing.

To fully express the backward time, we abuse notations and write $-\mathrm{d}t$ as $\mathrm{d}t$, though it violates the convention that infinitesimal time should be positive. Then with a negative $\mathrm{d}t$, we arrive at what is seen in the diffusion model literature:

$$\text{Reverse diffusion SDE (ML convention): } \mathrm{d}x_t = [f(x_t, t) - g^2(t) \nabla \log p_t(x_t)] \mathrm{d}t + g(t) \mathrm{d}\overline{W}_t, \quad t \in [T,0]$$

and we remind the readers again that this $\mathrm{d}t$ is negative in machine learning's convention.

The Mathematical Building Blocks of
Diffusion Generative Models

Abstract

Table of Contents

1. Preliminaries

1.1 Wiener Process

1.2 Hutchinson's Trace Estimator

1.3 Hessian Matrix and Laplacian

Corollary

1.4 The Laplacian of probability density; score function

Theorem

1.5 Point-wise Equivalence, Test Functions, and Integration by Parts

Theorem

More properties of test functions

2. Fokker-Planck Equations

2.1 How an SDE Drives the Evolution of Marginal Density

Theorem (Fokker-Planck Equation)

Proof

3. Special Case I: ODE, Continuity Equation, and Flow-matching

Corollary (Continuity Equation)

4. Special Case II: Forward and Reverse-time Diffusion Process

Definition (Diffusion Process)

4.1 Time Reversal and Density Evolution

4.2 Diffusion, Laplacian, and Score

4.3 The Reverse-Time SDE

The convention of reversed time index

The Mathematical Building Blocks ofDiffusion Generative Models

Abstract

Table of Contents

1. Preliminaries

1.1 Wiener Process

1.2 Hutchinson's Trace Estimator

1.3 Hessian Matrix and Laplacian

Corollary

1.4 The Laplacian of probability density; score function

Theorem

1.5 Point-wise Equivalence, Test Functions, and Integration by Parts

Theorem

More properties of test functions

2. Fokker-Planck Equations

2.1 How an SDE Drives the Evolution of Marginal Density

Theorem (Fokker-Planck Equation)

Proof

3. Special Case I: ODE, Continuity Equation, and Flow-matching

Corollary (Continuity Equation)

4. Special Case II: Forward and Reverse-time Diffusion Process

Definition (Diffusion Process)

4.1 Time Reversal and Density Evolution

4.2 Diffusion, Laplacian, and Score

4.3 The Reverse-Time SDE

The convention of reversed time index

The Mathematical Building Blocks of
Diffusion Generative Models