Basics of Diffusion

Fundamentals

Theorem

For any distribution over $(X,Y)$ , we have

$\text{argmin}_{f} \space \mathbb{E} ||f(X)-Y||^2=\mathbb{E}[Y|X]$

Proof

$\begin{aligned} &\mathop{\operatorname{argmin}\,}_{f} \mathbb{E} \lvert \lvert f(x)-y \rvert \rvert ^2 \\ =&\mathop{\operatorname{argmin}\,}_{f} \mathbb{E} ||f(x)-y||^2\\ =&\mathop{\operatorname{argmin}\,}_{f} \mathbb{E}_{X} \mathbb{E}_{Y|X} [f(x)-Y|X=x]^2\\ =&\mathop{\operatorname{argmin}\,}_{f(x)} \mathbb{E}_{X} [f(x)^2 - 2f(x) E_{Y|X} (Y|X=x) + \mathbb{E}_{Y|X} (Y|X=x)^2]\\ =&\mathbb{E}_{X} (\mathop{\operatorname{argmin}\,}_{f(x)} f(x)^2 - 2f(x) \mathbb{E}_{Y|X} (Y|X=x) + \mathbb{E}_{Y|X} (Y|X=x)^2) \end{aligned}$

对 $f(x)$ 求导，得 $2f(x)-2\mathbb{E}_{Y|X} (Y|X=x) = 0 \Leftrightarrow f(x)=\mathbb{E}_{Y|X} (Y|X=x)$
所以 $f=\mathbb{E}_{[Y|X]}(Y)$

Gradient Formula of Gaussian Distribution

$\begin{aligned} &y\sim \mathcal{N}(x,\sigma^{2}I) \text{ i.e. } p(y|x)=\dfrac{1}{(2\pi \sigma^{2})^{d/2}}\exp \left( -\dfrac{1}{2\sigma^{2}} \left\| y-x \right\| ^{2} \right) \\ \implies &\nabla_{y}p(y|x)=-\dfrac{1}{\sigma^{2}}(y-x) p(y|x) \end{aligned}$

Proof

$\nabla_{y} p(y)=\int \nabla_{y} p(y|x) p(x) \,\mathrm{d}x$

$\begin{aligned} \nabla_{y} p(y|x) &= p(y|x) \cdot \nabla_{y} \log p(y|x)\\ &=p(y|x) \cdot \nabla_{y} \left( -\dfrac{d}{2} \log(2\pi \sigma^{2})-\dfrac{1}{2\sigma^{2}}\left\| y-x \right\| ^{2} \right)\\ &=p(y|x) \cdot \nabla_{y}\left( -\dfrac{1}{2\sigma^{2}}\left\| y-x \right\| ^{2} \right)\\ &=p(y|x) \cdot \left( -\dfrac{1}{\sigma^{2}}(y-x) \right) \end{aligned}$

Tweedie’s Formula

If $Y \sim \mathcal{N}(x, \sigma^{2})$ , then

$\mathbb{E}[X|Y=y]=y+\sigma^2 \nabla _y \log p(y)$

其中

$p(y):$ $Y$ 的边缘密度（观测到的 $Y$ 的分布）
$\nabla_{y} \log p(y)$ ： $Y$ 的对数密度关于 $y$ 的梯度 a.k.a. score function

Proof

$p(x|y)=\dfrac{p(y|x)p(x)}{p(y)}$
由于 $Y|X=x \sim \mathcal{N}(x, \sigma^2I)$

$p(y|x)=\dfrac{1}{(2\pi \sigma^2)^{d/2}} \exp \left(-\frac{1}{2\sigma^{2}}\right)$

$\begin{aligned} \mathbb{E}[X|Y=y]&=\int x p(x|y)\,\mathrm{d}x\\ &=\dfrac{1}{p(y)}\int x p(y|x) p(x) \,\mathrm{d}x \end{aligned}$

$\nabla_{y} p(y)=\int \nabla_{y} p(y|x) p(x) \,\mathrm{d}x$

高斯分布梯度公式

$\nabla_{y} p(y|x)=-\dfrac{1}{\sigma^{2}} (y-x) p(y|x)$

代入，得

$\begin{aligned} \nabla_{y} p(y)&=\int(-\dfrac{1}{\sigma^{2}}(y-x)p(y|x))p(x)\,\mathrm{d}x\\ &=-\dfrac{1}{\sigma^{2}}\left( y\int p(y|x)p(x)\,\mathrm{d}x-\int xp(y|x)p(x)\,\mathrm{d}x \right) \\ &=-\dfrac{1}{\sigma^{2}}\left( yp(y)-\int xp(y|x)p(x)\,\mathrm{d}x \right) \end{aligned}$

整理，得

$yp(y)+\sigma^{2}\nabla_{y}p(y)=\int xp(y|x)p(x)\,\mathrm{d}x$

对于 $\mathbb{E}[X|Y=y]$ ，有

$\mathbb{E}[X|Y=y]=\int x\dfrac{p(y|x)p(x)}{p(y)}\,\mathrm{d}x=\dfrac{1}{p(y)}\int xp(y|x)p(x)\,\mathrm{d}x=y+\sigma^{2}\nabla_{y}\log p(y)$

VAE

VAE: Variational Auto-Encoder

Latent Variables

Latent Variables $z$ are variables that we do not observe and hense are not part of training dataset.

Encoder: convert from input $x$ to latent variables $z$ .

Decoder: convert from $z$ to generated vector $\hat{x}$

Variational: 变分，关于在函数上的优化

VAE: search for the optimal probability distributions to describe $x$ and $z$ .

Key Distributions

$p(x)$ : The true distribution of $x$ . THE ULTIMATE GOAL of diffusion is to draw a sample from $p(x)$ .
$p(z)$ : The distribution of latent variable. Typically it is made to be $\mathcal{N}(0,I)$ Any distribution can be generated by mapping a Gaussian through a sufficiently complicated function.
$p(z|x)$ : The conditional distribution associated with the encoder, the likelihood of $z$ when given $x$ .
$p(x|z)$ : decoder, posterior probability of getting $x$ given $z$ .
$q_{\Phi}(z|x)$ : The proxy for $p(z|x)$ that can be parameterized using deep neural networks. eg.
$(\mu,\sigma^{2})=\text{EncoderNetwork}_{\Phi}(x), q_{\Phi}(z|x) =\mathcal{N}(z|\mu,\text{diag}(\sigma^{2}))$
$p_{\theta}(x|z)$ : The proxy for $p(x|z)$

![[Pasted image 20250503161111.png]]

ELBO

ELBO: Evidence Lower Bound

$\text{ELBO}(x)\stackrel{\text{def}}{=} \mathbb{E}_{q_{\phi}}(z|x)\left[ \log \dfrac{p(x,z)}{q_{\phi}(z|x)} \right]$

KL-divergence

$\mathbb{D}_{\text{KL}}(P\|Q)=\mathop{\mathbb{E}}_{x \sim P}\left[ \ln \dfrac{p(x)}{q(x)} \right]$

Decomposition of Log-Likelihood

$\log p(x)=\mathbb{E}_{q_{\phi}}(z|x)\left[ \log \dfrac{p(x,z)}{q_{\phi}(z|x)} \right]+\mathbb{D}_{\text{KL}}(q_{\phi}(z|x)\|p(z|x))$

Proof

$\begin{aligned} \log p(x)&=\log p(x)\times \underbrace{ \int q_{\phi} (z|x)\,\mathrm{d}z }_{ 1 }\\ &=\int \log p(x) \times q_{\phi}(z|x) \,\mathrm{d}z\\ &=\mathbb{E}_{q_\phi(z|x)}[\log p(x)]\\ &=\mathbb{E}_{q_\phi(z|x)}\left[ \log \dfrac{p(x,z)}{p(z|x)} \right] \quad &\text{Bayes Theorem}\\ &=\mathbb{E}_{q_\phi(z|x)}\left[ \log \dfrac{p(x,z)}{p(z|x)} \cdot \dfrac{q_\phi(z|x)}{q_\phi(z|x)} \right]\\ &=\underbrace{ \mathbb{E}_{q_\phi(z|x)}\left[ \log \dfrac{p(x,z)}{q_\phi(z|x)} \right] }_{ \text{ELBO} }+\underbrace{ \mathbb{E}_{q_\phi(z|x)}\left[ \log \dfrac{q_\phi(z|x)}{p(z|x)} \right] }_{ \mathbb{D}_{\text{KL}} (q_\phi(z|x)\|p(z|x)) } \quad& \end{aligned}$

So the ELBO is a lower bound of $\log p(x)$ , maximize ELBO can achieve the goal of maximize $\log p(x)$ .

When the KL-divergence is zero, $q_\phi(z|x)=p(z|x)$ , since $p(z|x)$ is delta function, we have

$q_\phi(z|x)=\mathcal{N}\left( z \left| \frac{x-\mu}{\sigma},0 \right.\right)=\delta\left( z-\dfrac{x-\mu}{\sigma} \right)$

ELBO is still not useful, for it involves $p(x,z)$ that we do not have access.

Theorem

$\text{ELBO}(x)=\underbrace{ \mathbb{E}_{q_\phi(z|x)}[\log p_{\theta}(x|z)] }_{ \text{how good your decoder is} }\quad\underbrace{-\quad \mathbb{D}_{\text{KL}}(q_\phi(z|x)\|p(z)) }_{ \text{how good your encoder is} }$

$p_{\theta}(x|z),q_\phi(z|x),p(z)$ are both Gaussian

Proof

$\begin{aligned} \text{ELBO}(x)&\stackrel{\text{def}}{=}\mathbb{E}_{q_\phi(z|x)}\left[ \log \dfrac{p(x,z)}{q_\phi(z|x)} \right]\\ &=\mathbb{E}_{q_\phi(z|x)}\left[ \log \dfrac{p(x|z)p(z)}{q_\phi(z|x)} \right]\\ &=\mathbb{E}_{q_\phi(z|x)}[\log p_{\theta}(x|z)]+\mathbb{E}_{q_\phi(z|x)}\left[ \log \dfrac{p(z)}{q_\phi(z|x)} \right] \end{aligned}$

Note that we replaced $p(x|z)$ by $p_{\theta}(x|z)$ since the latter is accessible.

The meaning of each term:

Reconstruction:
$\mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}[\log p_\theta(\mathbf{x}|\mathbf{z})]$ . It is similar to maximum likelihood where we want to find the model parameter to maximize the likelihood. The expectation is taken w.r.t. samples $\mathbf{z}$ that is sampled from $q_\phi(\mathbf{z}|\mathbf{x})$
Prior Matching:
KL divergence for encoder. Let encoder to turn $x$ to a latent vector $z$ such that the $z$ vector follows the choice of latent distribution $p(\mathbf{z})$ .

Example

$\mathbf{x} \sim p(\mathbf{x})=\mathcal{N}(\mathbf{x}|\boldsymbol{\mu},\sigma^{2}\mathbf{I})$
$\mathbf{z}\sim p(\mathbf{z})=\mathcal{N}(\mathbf{z}|0,\mathbf{I})$
So that $z$ can be trivial solution $\dfrac{x-u}{\sigma}$ , and $\hat{x}=\mu+\sigma z$ .
$p(\mathbf{x}|\mathbf{z})=\delta(\mathbf{x}-(\sigma \mathbf{z}+\boldsymbol{\mu}))$
$p(\mathbf{z}|\mathbf{x})=\delta\left( \mathbf{z}-\dfrac{\mathbf{x}-\boldsymbol{\mu}}{\sigma} \right)$

Suppose we don’t know $p(\mathbf{x})$ so we need to estimate $z$ and $x$ .

$\begin{aligned}(\hat{\boldsymbol{\mu}}(\mathbf{x}),\hat{\sigma}(\mathbf{x})^{2})&=\text{Encoder}_{\phi}(\mathbf{x})\\q_\phi(z|x)&=\mathcal{N}(\mathbf{z}|a\mathbf{x}+\mathbf{b},t^{2}\mathbf{I})\end{aligned}$

Assume $\hat{\boldsymbol{\mu}}$ is an affine function of $x$ .

$q_\phi(\mathbf{z}|\mathbf{x})=\mathcal{N}(\mathbf{z}|a\mathbf{x}+\mathbf{b},t^{2}\mathbf{I})$

For decoder, we have

$p_\theta(\mathbf{x}|\mathbf{z})=\mathcal{N}(\mathbf{z}|c\mathbf{x}+\mathbf{v},s^{2}\mathbf{I})$

For KL-divergence $\mathbb{D}_{\text{KL}}(q_\phi(\mathbf{z}|\mathbf{x})\|p(\mathbf{x}|\mathbf{z}))$ to be zero, we have

$q_\phi(\mathbf{z}|\mathbf{x})=\mathcal{N}\left( \mathbf{z}| \dfrac{x-\mu}{\sigma},0 \right)=\delta\left( \mathbf{z}-\dfrac{x-\mu}{\sigma} \right)$

Substitue to $\mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}[\log p_\theta(\mathbf{x}|\mathbf{z})]$ :

$\begin{aligned} \mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}[\log p_\theta(\mathbf{x}|\mathbf{z})]&=\mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}[\log \mathcal{N}(\mathbf{x}|c\mathbf{z}+\mathbf{v},s^{2}\mathbf{I})]\\ &=-\dfrac{1}{2}\log 2\pi-\log s-\dfrac{c^{2}}{2s^{2}}\left[ \left\lVert \dfrac{x-\mu}{\sigma} -\dfrac{x-v}{c} \right\rVert ^{2} \right]\\ &\leq-\dfrac{1}{2} \log 2\pi-\log s \end{aligned}$

When $\mathbf{v}=\boldsymbol{\mu},c=\sigma$ ，the equal holds. when $s=0$ , this term reach its maximum. This implies that $p_\theta(\mathbf{x}|\mathbf{z})=\mathcal{N}(\mathbf{x}|\sigma \mathbf{z}+\boldsymbol{\mu},0)=\delta(\mathbf{x}-(\sigma \mathbf{z}+\boldsymbol{\mu}))$

If the $p(z)$ and $q_{\phi}$ is both Gaussian and have same covariance, minimizing the KL-divergence is equals to minimizing the distance between the mean of two distribution.

The ELBO have limitations when $q_\phi(\mathbf{z}|\mathbf{x})$ may not equals to $p(\mathbf{z}|\mathbf{x})$ , thus ELBO not same to $\log p(\mathbf{x})$

Example

If we don’t know $p(\mathbf{z}|\mathbf{x})$ , we need to train VAE by maxing ELBO.

$\begin{aligned} &q_\phi(\mathbf{z}|\mathbf{x})=\mathcal{N}\left( \dfrac{\mathbf{z}|(x-\mu)}{\sigma},t^{2}\mathbf{I} \right)\\ &p_\theta(\mathbf{x}|\mathbf{z})=\mathcal{N}(\mathbf{x}|\sigma \mathbf{z}+\boldsymbol{\mu},s^{2}\mathbf{I}) \end{aligned}$

After maximizing $\mathbb{D}_{\text{KL}}(q_\phi(\mathbf{z}|\mathbf{x})\|p(\mathbf{z}))$ , and $\mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}[\log p_\theta(\mathbf{x}|\mathbf{z})]$ ，we have

$\begin{aligned} q_\phi(\mathbf{z}|\mathbf{x})=\mathcal{N}\left( \mathbf{z}| \dfrac{\mathbf{x}-\mu}{\sigma},\mathbf{I} \right)\\ p_\theta(\mathbf{x}|\mathbf{z})=\mathcal{N}(\mathbf{x}| \sigma \mathbf{z}+\boldsymbol{\mu},\sigma^{2}\mathbf{I}) \end{aligned}$

Compare to former result, the result here contains variance $\mathbf{I}$ and $\sigma^{2}\mathbf{I}$ , adds randomness to samples.
Thus we know that maximizing ELBO is not just maximizing $\log p(x)$ .

Optimizing VAE

Since Monte-Carlo can not sample the gradient of distribution $E_{z\sim P_{\phi}(z)}[f(z)]$ itself (i.e. $\int \nabla_{\phi}\{f(z)P_{\phi}(z)\}\,\mathrm{d}z\neq \int \nabla_{\phi} \{ f(z) \}P_{\phi}(z)\,\mathrm{d}z=\dfrac{1}{N}\sum_{i=1}^N \nabla_{\phi}f(z_{i})$ )

We need to introduce the reparameterization trick: express $z$ as some differentiable transformation of another random variable $\varepsilon$ which is independent to parameter $\phi$ .

In the ELBO context, we define a function $g$ s.t. $\mathbf{z}=\mathbf{g}(\boldsymbol{\varepsilon},\boldsymbol{\phi},\mathbf{x})$ for random var $\varepsilon\sim p(\boldsymbol{\varepsilon})$ and $q_\phi(\mathbf{z}|\mathbf{x}) \cdot \left\lvert \det\left( \frac{ \partial \mathbf{z} }{ \partial \boldsymbol{\varepsilon} } \right) \right\rvert=p(\boldsymbol{\varepsilon})$

$\begin{aligned} \mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}(f(\mathbf{z})) &=\int f(z)\cdot q_\phi(\mathbf{z}|\mathbf{x})\,\mathrm{d}\mathbf{z}\\ &=\int f(g(\boldsymbol{\varepsilon}))\cdot q_\phi(\mathbf{g}(\boldsymbol{\varepsilon})|\mathbf{x}) \, \mathrm{d}\mathbf{g}(\mathbf{x}) \\ &=\int f(g(\varepsilon))\cdot q_\phi(\mathbf{g}(\mathbf{\varepsilon})|\mathbf{x})\cdot \left\lvert \det\left( \frac{ \partial \mathbf{g}(\boldsymbol{\varepsilon}) }{ \partial \boldsymbol{\varepsilon} } \right) \right\rvert \, \mathrm{d}\boldsymbol{\varepsilon}\quad\text{(changing variable)}\\ &=\int f(\mathbf{z})\cdot p(\boldsymbol{\varepsilon}) \, \mathrm{d}\boldsymbol{\varepsilon} \\ &=\mathbb{E}_{p(\mathbf{\varepsilon})}[f(\mathbf{z})] \end{aligned}$

$\nabla_{\phi}\mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}[f(\mathbf{z})]=E_{p(\boldsymbol{\varepsilon})} [\nabla_{\phi}f(\mathbf{z})]$

Recall the ELBO formula, we can substitute $f(\mathbf{z})=-\log q_{\phi}(\mathbf{z}|\mathbf{x})$

$\nabla_{\phi}\mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}[-\log q_\phi(\mathbf{z}|\mathbf{x})]=\dfrac{1}{L}\sum_{l=1}^L \nabla_{\phi}\left[ \log \left\lvert \det \frac{ \partial \mathbf{z}^{(l)}}{ \partial \boldsymbol{\varepsilon}^{(l)} }\right\rvert \right]$

Integration by Substitution

$\int _{D}f(\mathbf{z}) \, \mathrm{d}\mathbf{z}=\int_{\mathbf{g}^{-1}(D)}f(\mathbf{g}(\varepsilon)) \cdot \left\lvert \det\left( \frac{ \partial \mathbf{z} }{ \partial \boldsymbol{\varepsilon} } \right) \right\rvert \mathrm{d}\boldsymbol{\varepsilon}$

Gaussian Diffusion

$x_{t+1}:=x_{t}+\eta_{t}, \eta_{t} \sim \mathcal{N}(0, \sigma^2)$

learn to reverse each intermediate step.

DDPM

At time $t$ , given input $z$ , (sample from $p_t$ ), output a sample from conditional distribution $p(x_{t-1}|x_{t}=z)$

Learn the mean of $p(x_{t-1}|x_{t})$ is much simpler.

$\mu_{t-1} (z):= \mathbb{E}[x_{t-1}|x_t=z]\\$

$\begin{aligned} \implies \mu_{t-1}=&\mathop{\,\operatorname{argmin}\,}_{f:\mathbb{R}^d \to \mathbb{R}^d} \mathbb{E}_{x_{t}, x_{t-1}} ||f(x_{t}) - x_{t-1}||^2\\ =&\mathop{\,\operatorname{argmin}\,}_{f} \mathbb{E}_{x_{t-1}, \eta} || f(x_{t-1} + \eta) - x_{t-1}||^2\\ \end{aligned}$

Then the estimate of $\mathbb{E}[x_{t-1}|x_{t}]$ can be done by standard regression loss.

Reverse Sampler

A reverse sample for step $t$ is a function $F_{t}$ such that if $x_{t}\sim p_{t}$ then the marginal distribution of $F_{t}(x_{t})$ is $p_{t-1}$

$\{F_{t}(z):z \sim p_{t}\} \equiv p_{t-1}$

The $\{ F: x \sim D \}$ notation means implying a function on a variable $x$ which follows distribution $D$ , thus creating a new distribution.

Variance scaling:

$p(x, k \Delta t)=p_{k}(x)\text{, where }\Delta t=\frac{1}{T}$ , $T$ is discretization steps.

If $x_{k}=x_{k-1} + \mathcal{N}(0,\sigma^2)$ , then $x_{T} \sim \mathcal{N}(x_{0}, T \sigma^2)$ . So we scale variance by $\sigma=\sigma_{q} \sqrt{ \Delta t }$ , $\sigma_{q}$ is desired terminal variance.

Notations:
In below, $t$ will represent a continuous-value in the interval $[0,1]$ , subscripts will indicate time rather than index.

Claim

For Gaussian diffusion setting, we have

$\mathbb{E}[(x_{t-\Delta t}-x_{t})|x_{t}]=\dfrac{\Delta t}{t}\mathbb{E}[x_{0}|x_{t}]+\left( 1- \dfrac{\Delta t}{t} \right)x_{t}$

DDPM: Stochastic Sampling

DDPM stands for Denoising Diffusion Probabilistic Models

Stochastic Reverse Sampler

For input $x_{t}$ and timestep $t$ , output $\hat{x}_{t-\Delta t} \leftarrow \mu_{t-\Delta t}(x_{t}) + \mathcal{N}(0, \sigma_{q}^2 \Delta t)$

Claim

$\exists \mu_{z}\text{ , s.t. }p(x_{t-\Delta t}|x_{t}=z) \approx \mathcal N (x_{t-\Delta t}; \mu_{z}, \sigma_{q}^2 \Delta t)$

If constant $\mu_{z}$ depends only on $z$ , we can take

$\begin{aligned} \mu_{z} &:= \mathbb{E}_{x_{t-\Delta t}, x_{t}}[x_{t - \Delta t} | x_{t}=z]\\ &=z+(\sigma_{q}^2 \Delta t)\nabla \log p_{t}(z) \end{aligned}$

Proof

The Bayes rule:

$p(x_{t-\Delta t}|x_{t})=\dfrac{p(x_{t}|x_{t-\Delta t})p_{t-\Delta t}(x_{t-\Delta t})}{p_{t}(x_{t})}$

Take log on both side:

$\begin{aligned} &\log p(x_{t-\Delta t}|x_{t})\\ =&\log p(x_{t}|x_{t-\Delta t})+\log p_{t-\Delta t}(x_{t-\Delta t})\cancel{-\log p_{t}(x_{t})} \quad\quad&\text{Drop constants not involve }x_{t-\Delta t}\\ =&\log p(x_{t}|x_{t-\Delta t})+\log p_{t}(x_{t-\Delta t})+\mathcal{O}(\Delta t) &\text{Because } p_{t - \Delta t}=p_{t}+\Delta t \frac{ \partial }{ \partial t } p_{t}\\ =&-\dfrac{1}{2\sigma_{q}^{2}\Delta t}\lVert x_{t-\Delta t}-x_{t} \rVert ^{2}+\log p_{t}(x_{t-\Delta t})&\text{Substitute } \mathcal{N}(x_{t};\,x_{t-\Delta t},\sigma_{q}^{2}\Delta t)\\ =&-\cdots+\cancel{\log p_{t}(x_{t})}+\langle \nabla_{x}\log p_{t}(x_{t}),(x_{t-\Delta t}-x_{t}) \rangle +\mathcal{O}(\Delta t)&\text{Taylor expand, }\langle \rangle\text{ is inner product} \\ =&-\dfrac{1}{2\sigma_{q}^{2}\Delta t}\lVert x_{t-\Delta t}-x_{t}-\sigma_{q}^{2}\Delta t \nabla_{x} \log p_{t}(x_{t}) \rVert ^{2}+C\\ =&-\dfrac{1}{2\sigma_{q}^{2}\Delta t}\lVert x_{t-\Delta t}-x_{t} \rVert ^{2} \end{aligned}$

It is the log density of $\mathcal{N}(x_{t-\Delta t};\mu,\sigma_q^{2}\Delta t)$

The train loss of DDPM:

def train_loss(f_theta, p):
 x0 = p.sample()
 t = uniform(0, 1).sample()
 x = x0 + normal(0, sigma_q**2 * t).sample()
 x_ = x + normal(0, sigma_q**2 * dt).sample()
 return (f_theta(x_, t + dt) - x)**2

Sampling:

def DDPM(f_theta):
 x = normal(0, sigma_q**2).sample()
 for t in reversed(range(0, 1, dt)):
  eta = normal(0, sigma_q**2 * dt).sample()
  x = f_theta(x, t) + eta
 return x

def DDIM(f_theta):
 x = normal(0, sigma_q**2).sample()
 for t in reversed(range(0, 1, dt)):
  weight = (t**0.5) / ((t-dt)**0.5 + t**0.5)
  x = x + weight * (f_theta(x, t) - x)
 return x

The ELBO of DDPM:

The ELBO of DDPM

$\begin{align} \text{ELBO}_{\phi,\theta}(\mathbf{x})&=\mathbb{E}_{q_{\phi}(x_{\Delta t}|x_{0})}[\log p_{\theta}(x_{0}|x_{\Delta t})] \\ &\quad\quad -\mathbb{E}_{q_{\phi}(x_{t-\Delta t}|x_{0})}\mathbb{D}_{\text{KL}}(q_{\phi}(x_{t}|x_{t-\Delta t})\|p(x_{t})) \\ &\quad\quad -\sum E_{q\phi(x_{t-\Delta t},x_{t+\Delta t})|x_{0}}[\mathbb{D}_{\text{KL}}(q_{\phi}(x_{t}|x_{t-\Delta t})\|p_{\theta}(x_{t}|x_{t+\Delta t}))] \end{align}$

The last KL term contain two direction of sampling (both forward and backward), and may introduce more variance by Monte-Carlo simulation. Try to reduce it.

$q(x_{t}|x_{t-\Delta t})=\dfrac{q(x_{t-\Delta t}|x_{t})q(x_{t})}{q(x_{t-\Delta t})}=\dfrac{q(x_{t-\Delta t}|x_{t},x_{0})q(x_{t}|x_{0})}{q(x_{t-\Delta t}|x_{0})}$

By substituting this formula, we get:

The ELBO of DDPM, optimized

$\begin{align} \text{ELBO}_{\phi,\theta}(\mathbf{x}) &=\mathbb{E}_{q_{\phi}(x_{\Delta t}|x_{0})}[\log p_{\theta}(x_{0}|x_{\Delta t})] - \mathbb{D}_{\text{KL}}(q_{\phi}(x_{1}|x_{0})\|p(x_{1})) \\ &\quad-\sum \mathbb{E}_{q_{\phi}(x_{t}|x_{0})}[\mathbb{D}_{\text{KL}}(q_{\phi}(x_{t-\Delta t}|x_{t},x_{0})\|p_{\theta}(x_{t-\Delta t}|x_{t}))] \end{align}$

Forward Distribution of DDPM

$q(x_{t+\Delta t}|x_{t})=\mathcal{N}(x_{t+\Delta t}|\sqrt{ \alpha_{t} }x_{t},(1-\alpha_{t})\mathbf{I})$

q_{\phi}(x_{t}|x_{0})

in DDPM

$q_{\phi}(\mathbf{x}_{t}|\mathbf{x}_{0})=\mathcal{N}(\mathbf{x}_{t}|\sqrt{ \overline{\alpha_{t}} }\mathbf{x}_{0},(1-\overline{\alpha})\mathbf{I})$ , where $\overline{\alpha}_{t}=\prod_{i=1}^{t/\Delta t} \alpha_{i\Delta t}$

Proof

By reparameterizing:
$x_{t+\Delta t}=\sqrt{ \alpha_{t} }x_{t}+\sqrt{ 1-\alpha_{t} }\varepsilon_{1},\quad \varepsilon_{1}\sim \mathcal{N}(0,1)$
$x_{t+2\Delta t}=\sqrt{ \alpha_{t+\Delta t} }(\sqrt{ \alpha_{t} }x_{t}+\sqrt{ 1-\alpha _{t} }\varepsilon_{1})+\sqrt{ 1-\alpha_{t+\Delta t} }\varepsilon_{2}, \quad\varepsilon_{1},\varepsilon_{2}\sim \mathcal{N}(0,1)$
$=\sqrt{ \alpha_{t+\Delta t}\alpha_{t} }x_{t}+\sqrt{ \alpha_{t+\Delta t}(1-\alpha_{t}) }\varepsilon+\sqrt{ 1-\alpha_{t+\Delta t} }\varepsilon_{2}$
$=\sqrt{ \alpha_{t+\Delta t}\alpha_{t} }x_{t}+\sqrt{ 1-\alpha_{t+\Delta t}\alpha_{t} }\varepsilon, \quad\varepsilon\sim \mathcal{N}(0,1)$ (sum of gaussian samples is still gaussian)
Same as $t\to 1$ then get the result.

Since $q$ is determined after the noise schedule $\alpha_{t}$ is chosen, all we need to do is to minimize the KL-divergence between the $p_{\theta}$ and fixed $q_{\phi}$ .

If we set $p_{\theta}=\mathcal{N}(x_{t-\Delta t}|\mu_{\theta}(x_{t}),\sigma_{q}^{2}(t)\mathbf{I})$ , the KL term would be simplified to $\dfrac{1}{2\sigma_{q}^{2}(t)}\lVert \mu_{q}(x_{t},x_{0})-\mu_{\theta}(x_{t}) \rVert^{2}$ (the KL-divergence feature of two gaussian distributions with same variance)

$\mu_{q}(x_{t},x_{0})$ can be derived from the distribution of $q(x_{t-\Delta t}|x_{t},x_{0})$

Recall that $q(x_{t-\Delta t}|x_{t},x_{0})$ is $\dfrac{q(x_{t-\Delta t}|x_{0})q(x_{t}|x_{t-\Delta t})}{q(x_{t}|x_{0})}$

$q(x_{t-\Delta t}|x_{t},x_{0})=\dfrac{\mathcal{N}(x_{t}|\sqrt{ \alpha_{t} }x_{t-\Delta t},(1-\alpha_{t})\mathbf{I})\mathcal{N}(x_{t-\Delta t}|\sqrt{ \overline{\alpha_{t-\Delta t}} }x_{0},(1-\overline{\alpha_{t-\Delta t}})\mathbf{I})}{\mathcal{N}(x_{t}|\sqrt{ \overline\alpha_{t} }x_{0},(1-\overline{\alpha_{t}})\mathbf{I})}$

Consider the negative log-likelihood $f$ of the above formula, and take the zero point of $f'$ , we can get mean(i.e. $f'(\mu)=0$ ), and the $\dfrac{1}{f''}$ is variance.

Intuitively, we take the simple Gaussian Distribution $\mathcal{N}(x|\mu,\sigma^{2})$

$f(x)=\dfrac{(x-\mu)^{2}}{2\sigma^{2}}$

$f'(x)=\dfrac{x-\mu}{\sigma^{2}}\implies f'(\mu)=0$

$f''(x)=\dfrac{1}{\sigma^{2}}\implies \dfrac{1}{f''}=\sigma^{2}$

Anyway, we can get $\bar{x}_{t-\Delta t}=\mu_{q}(x_{t},x_{0})=\dfrac{(1-\bar{\alpha}_{t-\Delta t})\sqrt{ \alpha_{t} }}{1-\overline{\alpha}_{t}}x_{t}+\dfrac{(1-\alpha_{t})\sqrt{ \overline{\alpha}_{t-\Delta t} }}{1-\overline{\alpha}_{t}}x_{0}$

$\boldsymbol{\Sigma} _q(t)=\dfrac{(1-\alpha_{t})(1-\overline{\alpha}_{t-\Delta t})}{1-\overline{\alpha}_{t}}\mathbf{I}$

There some variations about training target.

Consider $\mu_{\theta}(x_{t},t)\stackrel{\text{def}}{=}\dfrac{(1-\bar{\alpha}_{t-\Delta t})\sqrt{ \alpha_{t} }}{1-\overline{\alpha}_{t}}x_{t}+\dfrac{(1-\alpha_{t})\sqrt{ \overline{\alpha}_{t-\Delta t} }}{1-\overline{\alpha}_{t}}\hat{x}_{\theta}(x_{t},t)$ i.e. prediction clean sample $x_{0}$
The consistency term would be $\dfrac{1}{2\sigma_{q}(t)^{2}} \dfrac{(1-\alpha_{t})^{2}\overline{\alpha}_{t-\Delta t}}{(1-\overline{\alpha}_{t})^{2}}\lVert \hat{x}_{\theta}(x_{t},t)-x_{0} \rVert^{2}$
Consider the noise $\varepsilon_{0}$ from $x_{t}=\sqrt{ \overline{\alpha}_{t} }x_{0}+\sqrt{ 1-\overline{\alpha}_{t} }\varepsilon_{0}$ :
$x_{0}=\dfrac{x_{t}-\sqrt{ 1-\overline{\alpha}_{t} }\varepsilon_{0}}{\sqrt{ \overline{\alpha}_{t} }}$ , substituting it to $\mu_{q}$ above, we can get $\mu_{q}=\dfrac{1}{\sqrt{ \alpha _{t} }}x_{t}-\dfrac{1-\alpha_{t}}{\sqrt{ 1-\overline{\alpha}_{t} }\sqrt{ \alpha_{t} }}\varepsilon_{0}$
If we change the $\mu_{\theta}$ to same formula expect change $\varepsilon_{0}$ to $\hat{\varepsilon}_{\theta}(x_{t},t)$ , we can get the new consistency term: $\dfrac{1}{2\sigma_{q}(t)^{2}} \dfrac{(1-\alpha_{t})^{2}}{(1-\overline{\alpha}_{t})\alpha_{t}}\lVert \varepsilon_{0}-\hat{\varepsilon}_{\theta}(x_{t},t) \rVert^{2}$
Consider the Tweedie formula:
$q(x_{t}|x_{0})=\mathcal{N}(x_{t}|\sqrt{ \overline{\alpha}_{t} }x_{0},(1-\overline{\alpha_{t}})\mathbf{I})$
$\mathbb{E}[\mu_{x_{t}}|x_{t}]=x_{t}+(1-\overline{\alpha}_{t})(\nabla x_{t} \log p(x_{t}))=\sqrt{ \overline{\alpha}_{t} }x_{0}$
So $x_{0}=\dfrac{x_{t}+(1-\overline{\alpha}_{t})\nabla \log p(x_{t})}{\sqrt{ \overline{\alpha}_{t} }}$ , substitute to $\mu_{q}(x_{t},x_{0})$ , we can get $\mu_{q}(x_{t} ,x_{0})=\dfrac{1}{\sqrt{ \alpha_{t} }}x_{t}+\dfrac{1-\alpha_{t}}{\sqrt{ \alpha_{t} }}\nabla \log p(x_{t})$
The corresponding consistency term: $\dfrac{1}{2\sigma_{q}(t)^{2}} \dfrac{(1-\alpha_{t})^{2}}{\alpha_{t}}\lVert s_{\theta}(x_{t},t)-\nabla \log p(x_{t}) \rVert$

The score function $\nabla \log p(x_{t})$ is same to $\varepsilon_{0}$ :

$x_{0}=\dfrac{x_{t}+(1-\overline{\alpha}_{t})\nabla \log p(x_{t})}{\sqrt{ \overline{\alpha}_{t} }}=\dfrac{x_{t}-\sqrt{ 1-\overline{\alpha}_{t} }\varepsilon_{0}}{\sqrt{ \overline{\alpha}_{t} }}$

$\begin{align} \implies(1-\overline{\alpha}_{t})\nabla \log p(x_{t})&=-\sqrt{ 1-\overline{\alpha}_{t} }\varepsilon_{0} \\ \nabla \log p(x_{t})&=-\dfrac{1}{\sqrt{ 1-\overline{\alpha}_{t} }}\varepsilon_{0} \end{align}$

Intuitively, the gradient points to the direction at $x_{t}\longrightarrow x_{0}$ , which is $-\varepsilon_{0}$

Inferencing

Inferencing of DDPM

$x_{1}\sim \mathcal{N}(0,1)$
$x_{t-\Delta t}=\dfrac{(1-\overline{\alpha}_{t-\Delta t})\sqrt{ \alpha_{t} }}{1-\overline{\alpha}_{t}}x_{t}+ \dfrac{(1-\alpha_{t})\sqrt{ \overline{\alpha_{t-\Delta t}} }}{1-\overline{\alpha}_{t}}\hat{x}_{\theta}(x_{t})+\sigma_{q}(t)\varepsilon, \quad\varepsilon\sim \mathcal{N}(0,1)$

Conditional Generation

For condition $y$ :

$\nabla \log p(\mathbf{x}_{t}|y)=\nabla \log\left( \dfrac{p(\mathbf{x}_{t})p(y|\mathbf{x}_{t})}{p(y)} \right)=\underbrace{ \nabla \log p(\mathbf{x}_{t}) }_{ \text{unconditional score} }+\underbrace{ \nabla \log p(y|\mathbf{x}_{t}) }_{ \text{adversial gradient from classifier} }\cancel{ -\nabla \log p(y) }$

Scale the unconditional gradient:

$\nabla \log p_{\gamma}(\mathbf{x}_{t}|y)=\nabla \log p(\mathbf{x}_{t})+\gamma \nabla \log(p(y|\mathbf{x}_{t}))$

However, to classifier $\mathbf{x}_{t}$ , we need to train another classifier works on noised sample.

Or use predicted clean sample $\hat{x}_{0}$

The classifier can be removed by train a conditional denoising model $p(x|z,y)$ .

DDIM: Deterministic Sampling

For simplify notations, substitute $\alpha_{t}$ to $\dfrac{\alpha_{t}}{\alpha_{t-\Delta t}}$

Then $q(x_{t}|x_{0})=\mathcal{N}(x_{t}|\sqrt{ \alpha_{t} }x_{0},(1-\alpha_{t})\mathbf{I})$

$\implies\varepsilon= \dfrac{x_{t}-\sqrt{ \alpha _{t} }x_{0}}{\sqrt{ 1-\alpha_{t} }}$

$\implies x_{t}=\sqrt{ \alpha_{t-\Delta t} }x_{0}+\sqrt{ 1-\alpha_{t-\Delta t} }\varepsilon=\sqrt{ \alpha_{t-\Delta t} }x_{0}+\sqrt{ 1-\alpha_{t-\Delta t} }\left( \dfrac{x_{t}-\sqrt{ \alpha_{t} }x_{0}}{\sqrt{ 1-\alpha_{t} }} \right)$

Try to let $q(x_{t-\Delta t}|x_{t},x_{0})=\mathcal{N}\left( \sqrt{ \alpha_{t-\Delta t} }x_{0}+\sqrt{ 1-\alpha_{t-\Delta t} }\left( \dfrac{x_{t}-\sqrt{ \alpha_{t} }x_{0}}{\sqrt{ 1-\alpha_{t} }} \right), \sigma_{t}^{2}I \right)$ to be the marginal distribution $q(x_{t-\Delta t}|x_{0})=\mathcal{N}(\sqrt{ \alpha_{t-\Delta t} }x_{0},(1-\alpha_{t-\Delta t})\mathbf{I})$

By matching mean and variance, we get the result

Transition Distribution of DDIM

$q(x_{t-\Delta t}|x_{t},x_{0})=\mathcal{N}\left( \sqrt{ \alpha_{t-\Delta t} }x_{0}+\sqrt{ 1-\alpha_{t-\Delta t} \textcolor{red}{-\sigma_{t}^{2}}}\left( \dfrac{x_{t}-\sqrt{ \alpha_{t} }x_{0}}{\sqrt{ 1-\alpha_{t} }} \right), \sigma_{t}^{2}I \right)$

Setting $\sigma_{t}$ to $0$ can make it be deterministic.

Flow Matching

flow

A flow is a collection of time-indexed vector fields $v=\{ v_{t} \}_{t \in[0,1]}$ ， $v_{t}$ : velocity-field of a gas at each time $t$ .

For flow $v$ and initial point $x_{1}$ , there has $\dfrac{\,\mathrm{d}x_{t}}{\,\mathrm{d}t}=-v_{t}(x_{t})$

The Goal of Flow Matching

Learn a flow $v^*$ transports $q$ to $p$ , where $p$ is the target distribution, $q$ is some easy-to-sample base distribution (ie. Gaussian)
The DDIM algorithm is a special case of this.

Pointwise Flow

A pointwise flow $v^{[x_{1},x_{0}]}$ is a flow $\{ v_{t} \}_{t}$ that satisfies $\dfrac{\,\mathrm{d}x_{t}}{\,\mathrm{d}t}=-v_{t}(x_{t})$ , with boundary conditions $x_{1}$ and $x_{0}$

Marginal Flow

weighted average of all individual partical velocities $v_{t}^{[x_{1},x_{0}]}$

$\mathbb{E}_{x_{0},x_{1}|x_{t}} [v_{t}^{[x_{1},x_{0}]}(x_{t})|x_{t}]$

The $(x_{1},x_{0},x_{t})$ is induced by sampling $(x_{1},x_{0}) \sim \Pi_{q,p}$ , $x_{t}\leftarrow\text{RunFlow}(v^{[x_{1},x_{0}]},x_{1},t)$

Flow Matching:

$\begin{aligned} &v_{t}^*(x_{t}):= \mathbb{E}_{x_{0},x_{1}|x_{t}} [v_{t}^{[x_{1},x_{0}]} (x_{t})|x_{t}]\\ \implies &v_{t}^*=\mathop{\,\operatorname{argmin}\,}_{f:\mathbb{R}^d\to \mathbb{R}^{d}} \mathbb{E}_{(x_{0},x_{1},x_{t})} \lVert f(x_{t})-v_{t}^{[x_{1},x_{0}]}(x_{t}) \rVert ^{2} \end{aligned}$

Train loss of Flow-matching:

def train_loss(f_theta, q_p_dist, pointwise_flow):
 x1, x0 = q_p_dist.sample()
 t = uniform(0, 1).sample()
 xt = run_flow(pointwise_flow(x1, x0), x1, t)
 return (f_theta(xt, t) - vt(x1,x0)(xt)) ** 2

def sample(f_theta, base_dist, step_size):
 x1 = base_dist.sample()
 x0 = x1
 for i in reversed(range(0, 1, step_size)):
  x0 = x0 + f_theta(x0, t) * step_size
 return x0

Reference

Step-by-Step Diffusion: An Elementary Tutorial
Tutorial on Diffusion Models for Imaging and Vision