We provide a proof of the necessary condition of the Fokker-Planck (FP) equation and derive some important conclusions, including the continuity equation and the reverse-time diffusion process. Then we study a typical variant of diffusion process, namely, flow matching processes with linear interpolation paths, and study the relationship between its score and velocity. Finally, we study how to adopt the FP equation to derive an ODE-SDE conversion formula that links flow ODEs with diffusion SDEs.
To fully understand the proof of FP equations, we need to recall several results from probability, analysis and linear algebra.
Recall that a Wiener process $W_t$ in $\mathbb{R}^d$ possesses the following property:
This naturally implies that, for any matrix $A \in \mathbb{R}^{d\times d}$ independent of $W_t$,
Here, $\epsilon \in \mathbb{R}^d$, which has independent coordinates. The expectations are taken with respect to the randomness in $W_t$.
For $\epsilon \sim \mathcal{N}(0, h\mathbb{I}_{d\times d})$ and matrix $A \in \mathbb{R}^{d\times d}$, the trace of $A$ is the expected quadratic form of $\epsilon$ and $A$.
Consequently, $\hat{A}_\epsilon = \frac{1}{h} \epsilon^\top A \epsilon$ is an unbiased estimator of $\text{tr} A$.
For a scalar functional $u: \mathbb{R}^d \to \mathbb{R}$ (which usually takes the physical meaning of a potential field over $\mathbb{R}^d$), the Hessian matrix is defined by packing the second-order derivatives into a matrix:
The Laplacian of $u$ is defined as the sum of second order derivatives at the same coordinates:
By definition, the trace of the Hessian matrix is the Laplacian.
For a scalar functional $u: \mathbb{R}^d \to \mathbb{R}$ and white Gaussian vector $\epsilon \sim \mathcal{N}(0, h\mathbb{I}_{d\times d})$, the expected value of the quadratic form obtained from the difference of Wiener process $W_t$ and Hessian matrix $\nabla^2 u$ is equivalent to the Laplacian of the potential field $u$.
The Laplacian of a probability density function $p$ often appears in the literature of diffusion models. By definition, we have
Furthermore, we relate the density gradient to the score function, $s(x) = \nabla \log p(x) = \frac{\nabla p(x)}{p(x)}$, which implies
The Laplacian of a probability density is the divergence of the product between density and its score function:
To prove that two real functionals are point-wise equal, we can resort to evaluating the inner products of these functions with the same test function. If the test results (i.e. the inner products) are always the same for any test function used, then the functions are the same. Rigorously speaking,
For arbitrary integrable functions $g_1, g_2: \mathbb{R}^d \to \mathbb{R}$ it holds that
Recall that a test function is an infinitely differentiable function with and non-zero on a compact support, and it includes dirac delta functions. So, from RHS to LHS is simple, just pick the dirac delta. The transition from LHS to RHS is also straightforward.
First, test functions are zero at infinity, or the border of the compact support, due to definition.
Second, for arbitrary test functions $f_1, f_2$, we can use integration by parts to derive
By using this together with the definition of the divergence and Laplacian, we get the identities:
Notice that in each equation, $x$ is understood as $(x_1,x_2,\ldots,x_d)$ and
We can show the first equation by recursively calling integration by parts on each coordinate $x_i$. The second identity is proved by calling the integration by parts twice.
For a stochastic process determined by a stochastic differential equation (SDE),
the Fokker-Planck (FP) equation tells us how the marginal density of this process $X_t \sim p_t(\cdot)$ changes when the time evolves, and it expresses the marginal density in terms of the coefficients of the underlying SDE. The FP equations precisely characterize the dynamics of this stochastic process, connecting the differential equations of evolution with the statistical properties of this system.
In the following sections, we will detail the proof of the Fokker-Planck equation in its most general form. Afterwards, we will use the FP equation to study two special types of stochastic processes, the first one is deterministic, and the second is stochastic, but the noise is irrelevant to the current state. These two cases are of particular interest to machine learning studies, as the former leads to flow-matching process, and the latter leads to diffusion models.
First, we provide a statement of the Fokker-Planck equation.
Let $p_t$ be a probability path and consider Itô SDE
where $X_t \in \mathbb{R}^d$. Then $X_t$ has distribution $p_t$ for all $0 \leq t \leq 1$ if and only if the Fokker-Planck equation holds:
where we use the abbreviations $\mu_t(x) = \mu(x_t, t)$ and $\sigma_t^2(x) = \langle \sigma(x_t, t), \sigma(x_t, t) \rangle$.
We only show the necessary condition, which is nontrivial.
This equation shows point-wise equivalence between two real functionals. To show that, we pick an arbitrary test function $f: \mathbb{R}^d \to \mathbb{R}$, and show that the left and the right are equivalent after applying $f$ to form inner products. We want to show that
holds for all test functions. And then for any point $x$, we pick $f(z) = \delta(z-x)$ and finish proving that LHS=RHS at any point.
Let us start from LHS. We will use the property that $p_t(x)$ is in fact a probability density, and the test function is time-irrelevant. Consequently, LHS can be written into an expectation:
We will use the second-order Taylor expansion to evaluate the enumerator, since this limit eliminates any remaining terms of order $o(h)$. Then we will call the helper functions derived in the preliminary section and take the limit to finish the proof. In what follows, we neglect the higher order terms for better readability.
Now take the conditional expectation w.r.t. $X_t$ to both sides. We will nullify the second term using the fact that the Wiener process is independent with $X_t$ and has zero mean. The third term on RHS will be simplified by the corollary we derived earlier, which suggests that
With these observations, we arrive at
Finally, marginalize the conditional expectation by computing integrals, and call the integration by parts formula for vector fields, we have
Now since this result holds for any test functions, at any point $x_0$ we can pick $f(x) = \delta(x-x_0)$ and obtain
thus these two functionals are equivalent point-wise.
With the Fokker-Planck equation, we naturally obtain an ODE counterpart by setting the noise magnitude $\sigma_t(X_t)$ to zero. This is known as the continuity equation.
Let $p_t$ be a probability path and consider the ODE
where $X_t \in \mathbb{R}^d$. Then $X_t$ has distribution $p_t$ for all $0 \leq t \leq 1$ if and only if the continuity equation holds:
where we use the abbreviations $\mu_t(x) = \mu(x_t, t)$.
This relationship is very useful in the derivation of flow-matching process's log probabilities.
A diffusion process is a continuous-time stochastic process $\{X_t\}_{t\geq 0}$ governed by the Itô stochastic differential equation (SDE):
where:
The key characteristic of a diffusion process is that the diffusion coefficient $g(t)$ depends only on time $t$ and not on the current state $X_t$, distinguishing it from the general Itô SDE where $\sigma(X_t, t)$ may depend on the state.
By the Fokker-Planck equation, the marginal probability density of a forward diffusion process is given by
It turns out that the forward diffusion process is actually invertible mathematically. In detail, there exists a reverse-time diffusion process with density $\tilde{p}_s(x)$, such that $\tilde{p}_s(x) = p_{T-s}(x)$. When the reverse-time process evolves from $s=0$ to $s=T$, the marginal probability is exactly the same as the forward process evolved backward. Next, we find the SDE for the time-reversed process.
We denote by $s$ the increasing time index of the reverse process, and let $t$ be the increasing time index of the forward process. With a time endpoint $T$, these two time indices are related with $s+t=T$. Since $ds = -\mathrm{d}t$, we obtain that the evolution of the reversed density with respect to the forward time $s$ is:
Substituting the Forward Fokker-Planck Equation into the above, we can express the probability density of the reverse process in terms of the forward diffusion coefficients:
Since the magnitude of the noise of a diffusion process is irrelevant to the state, the Laplacian in the Fokker-Planck equation will be applied directly to the probability density. By the theorem we derived earlier, this naturally results in score functions; hence in diffusion processes, the Laplacian are very closely related to the score function. This is why the score appears so many times in diffusion model learning. Concretely speaking, the Laplacian in the diffusion term can be simplified as
Substituting this into the reversed density evolution and factoring out the divergence operator and the reverse density, we have
If the reverse process exists and is also a diffusion process, it must also be governed by an FP equation with state-irrelevant noise. Let the density $\tilde{p}_s(x)$ of the hypothesized reverse-time SDE be determined by
since it is the reverse process of $p$, it must also satisfy its own Fokker-Planck equation:
If we assume that the diffusion coefficient remains the same, $\tilde{g}(s) = g(T-s)$, we can equate the two equations to determine the reverse drift $h(x, s)$.
By factoring the reverse FP equation using the identity $\frac{1}{2} \tilde{g}^2 \nabla^2 \tilde{p}_s = -\nabla \cdot (-\frac{1}{2} \tilde{g}^2 \tilde{p}_s \nabla \log \tilde{p}_s)$:
Equating the drift terms from this and the previous equation:
Solving for the reverse drift $h(x, s)$:
Substituting $h(\tilde{X}_s, s)$ back into the reverse SDE and replacing $\tilde{p}_s(\tilde{X}_s)$ with $p_{T-s}(\tilde{X}_s)$, we get the Reverse-Time Diffusion Process Equation:
The equation derived above is valid for a process $\tilde{p}_s$ running in a forward-time index $s \in [0, T]$. However, literature, particularly in generative modeling, often maintains the original time index $t \in [0, T]$ but interprets the process as running backward from $T$ to $0$.
Let us use the original time index $t$, but define the differential as a negative change $d(-t)$. Let $x(t)$ be the reverse process, where $t$ now runs from $T$ (start) to $0$ (end). The equation can be written as
and we remind the reader that now $t$ decreases from $T$ to $0$. Here, $(-\mathrm{d}t)$ indicates the backward direction and $\mathrm{d}\overline{W}_t$ is a backward Wiener process, which means that when $t$ decreases, it is statistically equivalent to $\mathrm{d}W_s$ with $s$ increasing.
To fully express the backward time, we abuse notations and write $-\mathrm{d}t$ as $\mathrm{d}t$, though it violates the convention that infinitesimal time should be positive. Then with a negative $\mathrm{d}t$, we arrive at what is seen in the diffusion model literature:
and we remind the readers again that this $\mathrm{d}t$ is negative in machine learning's convention.