Diffusion 101

Diffusion is one of the most used models for image / video generation. This note covers some basics of diffusion models.

Generation as Sampling

Unlike many other models, which generate the outcome in one forward pass, the diffusion models generate a single step towards the outcome, and it takes multiple steps to reach there. Generating a step is also known as sampling a step, because some random variable will affect the step direction.

In the context of image generation, a diffusion model typically starts with some random noise. It is able to denoise the image step by step, and reveal the meaningful image contents in the end. The model is called diffusion, but a better name would be inverse diffusion. Imagine we had this meaningful image in the beginning, and a naughty boy spilled the color palatte onto the canvas. The noise diffused into the canvas, but the model is able to repair it by reverting the process step by step.

Differential Equations

Mathematically, the process can be formulated as a Stochastic Differential Equation (SDE):

\[\begin{gather} X_0 \sim P_{init} \\ dX_t = u_t(X_t)dt + \sigma_t dW_t \end{gather}\]

A few notes:

$t \in [0, 1]$ are the normalized timestamps.
We start at $X_0$ as random noise. We end at $X_1$ as meaningful outcome.
The way we proceed from $t=0$ to $t=1$ is called simulation. We sample timestamps within the range and simulate $X_t$ throughout.
$u_t$ is called drift coefficient, as it determines the main direction to proceed. It is also know as a vector field, as each vector in the field determines the direction to proceed in the next timestamp.
$dW_t$ defines a stochastic process that adds randomness to the system. The process is called “Wiener process” or “Brownian motion”, which is defined as $W_{t+h} = W_t + \sqrt{h} \epsilon_t$, where $\epsilon_t \sim N(0, I)$.
$\sigma_t$ is called diffusion coeeficient, as it determines the weight of the stochastic process in the system.

When $\sigma_t$ is 0, the system degrades to a Ordinary Differential Equation (ODE):

\[\begin{gather} X_0 \sim P_{init} \\ dX_t = u_t(X_t)dt \end{gather}\]

In both systems, we train a model $u_t^\theta$ to simulate the vector field $u_t$, which guides the process to navigate from random noise to meaningful outcome. When the diffusion term is included, we call it a “diffusion” model. Otherwise, we call it a “flow matching” model.

Conditional Probability Path

The way $X_0$ proceeds to $X_1$ forms a probabiliy path. We can add constraints to the system, so that the solution $u_t^\theta$ is tractable. Just like how Physics defines gravity to drop an apple from the tree to the ground so that we can revert the process to infer how an apple looks like when it was back on the tree, we are free to define our “Physics” in the system that is easy for models to learn.

A commonly used probability path is Gaussian probability path, and it has the following properties:

\[\begin{gather} P(X_t|X_1) = N(\alpha_t X_1, \beta_t^2 I) \\ X_1 \sim P_{data} \\ X_0 \sim N(0, I) \end{gather}\]

, where $\alpha_t$ is a sequence that goes from 0 to 1 and $\beta_t$ is a sequence that goes from 1 to 0. These two sequences define the noise schedule of the diffusion process. Model developers may choose one that works the best for their use cases. A linear schedule could be a good starting point:

\[\begin{gather} \alpha_t = t \ \ , \ \ \beta_t = 1 - t \end{gather}\]

$P(X_t|X_1)$ also known as a conditional probability path due to the condition on the data sample $X_1$. A quick path to generate $x_t$ given $x_1$ comes as follows:

\[\begin{gather} \epsilon \sim N(0, 1) \\ x_t = \alpha_t x_1 + \beta_t \epsilon \end{gather}\]

Conditional Vector Field

A conditional vector field $u_t(x|z)$ is the solution to fulfilling a conditional probability path $p_t(x|z)$. ODE and SDE start to make a slight difference here. For ODE, it can be proved (but not in this article) that the following conditional vector field $u_t(x|z)$ can fulfill such conditional probability path:

\[\begin{gather} u_t^{ODE}(x_t|x_1) = (\dot\alpha_t - \frac{\dot\beta_t}{\beta_t}\alpha_t) x_1 + \frac{\dot\beta_t}{\beta_t} x_t \\ \dot\alpha_t = \frac{d}{dt}\alpha_t \ \ , \ \ \dot\beta_t = \frac{d}{dt}\beta_t \end{gather}\]

For SDE, it has to include an additional score term:

\[\begin{gather} u_t^{SDE}(x_t|x_1) = u_t^{ODE}(x_t|x_1) + \frac{\sigma_t^2}{2} s_t(x_t|x_1) \\ s_t(x_t|x_1) = \nabla \log p_t(x_t|x_1) \end{gather}\]

$s_t$ is called a conditional score function according to classical statistics (wiki). For a Gaussian probability path, it has an analytical form:

\[\begin{gather} s_t(x_t|x_1) = - \frac{x_t - \alpha_t x_1}{\beta_t^2} \end{gather}\]

Marginal Probability Path

However, a conditional probability path is not good enough because we have no such $X_1$ to condition on during inference time. Diffusion models are able to navigate to a meaningful outcome from random noise without seeing it in the first place. Instead, we need to find a marginal probability path to guide any $X_t$ to $X_1$ without seeing $X_1$:

\[\begin{gather} P(X_t) = \int_{z \sim P_{data}} P(X_t | z) \ dz \end{gather}\]

It can be proved (but not in this article) that the following marginal vector field $u_t(x)$ can fulfill such marginal probability path:

\[\begin{gather} u_t(x_t) = \int_{z \sim P_{data}} u_t (x_t | z) \ \frac{p_t(x_t|z) \ p_{data}(z)}{p_t(x_t)} \ dz \end{gather}\]

The integral is intractible due to the complexity of the data distribution. However, a model $u_t^\theta$ can be trained to approximate this marginal conditional field.

Training Objective

An intuitive loss can be defined as the L2 loss between $u_t^\theta$ and $u_t$.

\[\begin{gather} L(\theta) = E_{t \sim unif[0, 1],\ x \sim p_t} || u_t^\theta(x) - u_t(x) ||^2 \end{gather}\]

This loss is intractible, but we can find an upper bound that is tractible:

\[\begin{gather} \tilde{L}(\theta) = E_{t \sim unif[0, 1],\ z \sim p_{data} ,\ x \sim p_{t|z}} || u_t^\theta(x) - u_t(x|z) ||^2 \end{gather}\]

It can be proved (but not in this article) that

\[\begin{gather} \tilde{L}(\theta) = L(\theta) + C \ \ , \ \ C > 0 \end{gather}\]

For ODE, we can just plug in $u_t^{ODE}$ for training. This algorithm is also known as “flow matching”.

for z in dataset:
  t = uniform(0, 1)
  xt = cond_prob_path(t, z)
  u_gt = cond_vec_field_ode(t, z, xt)
  u_pd = model(t, xt)
  loss = l2(u_pd, u_gt)
  loss.backward()

For SDE, we may encounter numerical issues because the score term is important but is way smaller than the ODE term.

\[\begin{gather} \frac{\sigma_t^2}{2} s_t(x_t|x_1) \ \ll \ u_t^{ODE}(x_t|x_1) \end{gather}\]

However, we can rewrite $u_t^{SDE}$ in terms of $s_t$.

\[\begin{gather} u_t^{SDE}(x_t|x_1) = (\frac{\dot\alpha_t}{\alpha_t} \beta_t^2 - \dot\beta_t \beta_t + \frac{\sigma_t^2}{2}) s_t(x_t|x_1) + \frac{\dot\alpha_t}{\alpha_t} x_t \end{gather}\]

Instead of training a model to approximate $u_t$, we train $s_t^\theta$ to approximate $s_t$:

\[\begin{gather} \tilde{L}^{SDE}(\theta) = E_{t \sim unif[0, 1],\ z \sim p_{data} ,\ x \sim p_{t|z}} || s_t^\theta(x) - s_t(x|z) ||^2 \end{gather}\]

As $s_t(x|z)$ has an analytical form, we can just plug it in for training. This algorithm is also known as “score matching”.

for z in dataset:
  t = uniform(0, 1)
  xt = cond_prob_path(t, z)
  s_gt = cond_score(t, z, xt)
  s_pd = model(t, xt)
  loss = l2(s_pd, s_gt)
  loss.backward()

Simulation

Once the model is trained, we can sample it multiple times to generate meaningful contents. This process is called “simulation” in the context of differential equations. There is a whole literature on different methods, and a commonly used method is called Euler method.

For ODE, the simulation is straightforward:

xt = gauss(0, 1, shape)
for t in arange(0, 1, dt):
  ut = model(t, xt)
  xt += ut * dt

For SDE, we need to plug in the score term and handle the diffusion term properly:

xt = gauss(0, 1, shape)
for t in arange(0, 1, dt):
  alpha_t, beta_t, sigma_t = get_noise_schedule(t)
  ws = get_score_weight(alpha_t, beta_t)
  wx = get_base_weight(alpha_t, beta_t)
  st = model(t, xt)
  ut = ws * st + wx * xt
  dw = sqrt(dt) * gauss(0, 1, shape)
  xt = xt * ut + sigma_t * dw

Guidance

In practice, we also add guidance $y$ (e.g. label, text, image) as condition to the model, so as to generate the desirable contents. A commonly used technique is called classifier-free guidance (CFG). For ODE, the model becomes $u_t^\theta(x|y)$, and the loss function is extended to include a conditional dropout with rate $\eta$.

\[\begin{gather} L^{ODE}_{CFG}(\theta) = E_{...} || u_t^\theta(x|y) - u_t(x|z) ||^2 \end{gather}\]

, where “…” includes the following:

\[\begin{gather} t \sim unif[0, 1] \\ z \sim p_{data} \\ x \sim p_{t|z} \\ y \sim Bern_{1-\eta,\ \eta}\{c, \varnothing\} \end{gather}\]

Upon inference, the vector field $u_t$ is extended to include both the unguided and the guided term with a guidance scale $w > 1$:

\[\begin{gather} \tilde{u_t}(x|c) = (1-w)\ u_t^\theta(x|\varnothing)\ + \ w\ u_t^\theta(x|c) \end{gather}\]

For SDE, the model becomes $s_t^\theta(x|y)$, and the loss function is extended as follows:

\[\begin{gather} L^{SDE}_{CFG}(\theta) = E_{...} || s_t^\theta(x|y) - s_t(x|z) ||^2 \end{gather}\]

, where “…” is the same as that in ODE. Upon inference, the score $s_t$ is extended to include both the unguided and the guided term with a guidance scale $w > 1$:

\[\begin{gather} \tilde{s_t}(x|c) = (1-w)\ s_t^\theta(x|\varnothing)\ + \ w\ s_t^\theta(x|c) \end{gather}\]

Acknowledgement

This note is heavily inspired by the MIT Class 6.S184.

Written on November 27, 2025