Fundamentals

Theorem

For any distribution over (X,Y)(X,Y), we have

argminf Ef(X)Y2=E[YX]\text{argmin}_{f} \space \mathbb{E} ||f(X)-Y||^2=\mathbb{E}[Y|X]

Proof

argminfEf(x)y2=argminfEf(x)y2=argminfEXEYX[f(x)YX=x]2=argminf(x)EX[f(x)22f(x)EYX(YX=x)+EYX(YX=x)2]=EX(argminf(x)f(x)22f(x)EYX(YX=x)+EYX(YX=x)2)\begin{aligned} &\mathop{\operatorname{argmin}\,}_{f} \mathbb{E} \lvert \lvert f(x)-y \rvert \rvert ^2 \\ =&\mathop{\operatorname{argmin}\,}_{f} \mathbb{E} ||f(x)-y||^2\\ =&\mathop{\operatorname{argmin}\,}_{f} \mathbb{E}_{X} \mathbb{E}_{Y|X} [f(x)-Y|X=x]^2\\ =&\mathop{\operatorname{argmin}\,}_{f(x)} \mathbb{E}_{X} [f(x)^2 - 2f(x) E_{Y|X} (Y|X=x) + \mathbb{E}_{Y|X} (Y|X=x)^2]\\ =&\mathbb{E}_{X} (\mathop{\operatorname{argmin}\,}_{f(x)} f(x)^2 - 2f(x) \mathbb{E}_{Y|X} (Y|X=x) + \mathbb{E}_{Y|X} (Y|X=x)^2) \end{aligned}

f(x)f(x) 求导,得2f(x)2EYX(YX=x)=0f(x)=EYX(YX=x)2f(x)-2\mathbb{E}_{Y|X} (Y|X=x) = 0 \Leftrightarrow f(x)=\mathbb{E}_{Y|X} (Y|X=x)
所以 f=E[YX](Y)f=\mathbb{E}_{[Y|X]}(Y)

Gradient Formula of Gaussian Distribution

yN(x,σ2I) i.e. p(yx)=1(2πσ2)d/2exp(12σ2yx2)    yp(yx)=1σ2(yx)p(yx)\begin{aligned} &y\sim \mathcal{N}(x,\sigma^{2}I) \text{ i.e. } p(y|x)=\dfrac{1}{(2\pi \sigma^{2})^{d/2}}\exp \left( -\dfrac{1}{2\sigma^{2}} \left\| y-x \right\| ^{2} \right) \\ \implies &\nabla_{y}p(y|x)=-\dfrac{1}{\sigma^{2}}(y-x) p(y|x) \end{aligned}

Proof

yp(y)=yp(yx)p(x)dx\nabla_{y} p(y)=\int \nabla_{y} p(y|x) p(x) \,\mathrm{d}x

yp(yx)=p(yx)ylogp(yx)=p(yx)y(d2log(2πσ2)12σ2yx2)=p(yx)y(12σ2yx2)=p(yx)(1σ2(yx))\begin{aligned} \nabla_{y} p(y|x) &= p(y|x) \cdot \nabla_{y} \log p(y|x)\\ &=p(y|x) \cdot \nabla_{y} \left( -\dfrac{d}{2} \log(2\pi \sigma^{2})-\dfrac{1}{2\sigma^{2}}\left\| y-x \right\| ^{2} \right)\\ &=p(y|x) \cdot \nabla_{y}\left( -\dfrac{1}{2\sigma^{2}}\left\| y-x \right\| ^{2} \right)\\ &=p(y|x) \cdot \left( -\dfrac{1}{\sigma^{2}}(y-x) \right) \end{aligned}

Tweedie’s Formula

If YN(x,σ2)Y \sim \mathcal{N}(x, \sigma^{2}), then

E[XY=y]=y+σ2ylogp(y)\mathbb{E}[X|Y=y]=y+\sigma^2 \nabla _y \log p(y)

其中

  • p(y):p(y): YY 的边缘密度(观测到的 YY 的分布)
  • ylogp(y)\nabla_{y} \log p(y)YY 的对数密度关于 yy 的梯度 a.k.a. score function
Proof

p(xy)=p(yx)p(x)p(y)p(x|y)=\dfrac{p(y|x)p(x)}{p(y)}
由于 YX=xN(x,σ2I)Y|X=x \sim \mathcal{N}(x, \sigma^2I)

p(yx)=1(2πσ2)d/2exp(12σ2)p(y|x)=\dfrac{1}{(2\pi \sigma^2)^{d/2}} \exp \left(-\frac{1}{2\sigma^{2}}\right)

E[XY=y]=xp(xy)dx=1p(y)xp(yx)p(x)dx\begin{aligned} \mathbb{E}[X|Y=y]&=\int x p(x|y)\,\mathrm{d}x\\ &=\dfrac{1}{p(y)}\int x p(y|x) p(x) \,\mathrm{d}x \end{aligned}

yp(y)=yp(yx)p(x)dx\nabla_{y} p(y)=\int \nabla_{y} p(y|x) p(x) \,\mathrm{d}x

高斯分布梯度公式

yp(yx)=1σ2(yx)p(yx)\nabla_{y} p(y|x)=-\dfrac{1}{\sigma^{2}} (y-x) p(y|x)

代入,得

yp(y)=(1σ2(yx)p(yx))p(x)dx=1σ2(yp(yx)p(x)dxxp(yx)p(x)dx)=1σ2(yp(y)xp(yx)p(x)dx)\begin{aligned} \nabla_{y} p(y)&=\int(-\dfrac{1}{\sigma^{2}}(y-x)p(y|x))p(x)\,\mathrm{d}x\\ &=-\dfrac{1}{\sigma^{2}}\left( y\int p(y|x)p(x)\,\mathrm{d}x-\int xp(y|x)p(x)\,\mathrm{d}x \right) \\ &=-\dfrac{1}{\sigma^{2}}\left( yp(y)-\int xp(y|x)p(x)\,\mathrm{d}x \right) \end{aligned}

整理,得

yp(y)+σ2yp(y)=xp(yx)p(x)dxyp(y)+\sigma^{2}\nabla_{y}p(y)=\int xp(y|x)p(x)\,\mathrm{d}x

对于 E[XY=y]\mathbb{E}[X|Y=y],有

E[XY=y]=xp(yx)p(x)p(y)dx=1p(y)xp(yx)p(x)dx=y+σ2ylogp(y)\mathbb{E}[X|Y=y]=\int x\dfrac{p(y|x)p(x)}{p(y)}\,\mathrm{d}x=\dfrac{1}{p(y)}\int xp(y|x)p(x)\,\mathrm{d}x=y+\sigma^{2}\nabla_{y}\log p(y)

VAE

VAE: Variational Auto-Encoder

Latent Variables

Latent Variables zz are variables that we do not observe and hense are not part of training dataset.

Encoder: convert from input xx to latent variables zz.

Decoder: convert from zz to generated vector x^\hat{x}

Variational: 变分,关于在函数上的优化

VAE: search for the optimal probability distributions to describe xx and zz.

Key Distributions

  • p(x)p(x): The true distribution of xx. THE ULTIMATE GOAL of diffusion is to draw a sample from p(x)p(x).
  • p(z)p(z): The distribution of latent variable. Typically it is made to be N(0,I)\mathcal{N}(0,I) Any distribution can be generated by mapping a Gaussian through a sufficiently complicated function.
  • p(zx)p(z|x): The conditional distribution associated with the encoder, the likelihood of zz when given xx.
  • p(xz)p(x|z): decoder, posterior probability of getting xx given zz.
  • qΦ(zx)q_{\Phi}(z|x): The proxy for p(zx)p(z|x) that can be parameterized using deep neural networks. eg.

    (μ,σ2)=EncoderNetworkΦ(x),qΦ(zx)=N(zμ,diag(σ2))(\mu,\sigma^{2})=\text{EncoderNetwork}_{\Phi}(x), q_{\Phi}(z|x) =\mathcal{N}(z|\mu,\text{diag}(\sigma^{2}))

  • pθ(xz)p_{\theta}(x|z): The proxy for p(xz)p(x|z)

![[Pasted image 20250503161111.png]]

ELBO

ELBO: Evidence Lower Bound

ELBO(x)=defEqϕ(zx)[logp(x,z)qϕ(zx)]\text{ELBO}(x)\stackrel{\text{def}}{=} \mathbb{E}_{q_{\phi}}(z|x)\left[ \log \dfrac{p(x,z)}{q_{\phi}(z|x)} \right]

KL-divergence

DKL(PQ)=ExP[lnp(x)q(x)]\mathbb{D}_{\text{KL}}(P\|Q)=\mathop{\mathbb{E}}_{x \sim P}\left[ \ln \dfrac{p(x)}{q(x)} \right]

Decomposition of Log-Likelihood

logp(x)=Eqϕ(zx)[logp(x,z)qϕ(zx)]+DKL(qϕ(zx)p(zx))\log p(x)=\mathbb{E}_{q_{\phi}}(z|x)\left[ \log \dfrac{p(x,z)}{q_{\phi}(z|x)} \right]+\mathbb{D}_{\text{KL}}(q_{\phi}(z|x)\|p(z|x))

Proof

logp(x)=logp(x)×qϕ(zx)dz1=logp(x)×qϕ(zx)dz=Eqϕ(zx)[logp(x)]=Eqϕ(zx)[logp(x,z)p(zx)]Bayes Theorem=Eqϕ(zx)[logp(x,z)p(zx)qϕ(zx)qϕ(zx)]=Eqϕ(zx)[logp(x,z)qϕ(zx)]ELBO+Eqϕ(zx)[logqϕ(zx)p(zx)]DKL(qϕ(zx)p(zx))\begin{aligned} \log p(x)&=\log p(x)\times \underbrace{ \int q_{\phi} (z|x)\,\mathrm{d}z }_{ 1 }\\ &=\int \log p(x) \times q_{\phi}(z|x) \,\mathrm{d}z\\ &=\mathbb{E}_{q_\phi(z|x)}[\log p(x)]\\ &=\mathbb{E}_{q_\phi(z|x)}\left[ \log \dfrac{p(x,z)}{p(z|x)} \right] \quad &\text{Bayes Theorem}\\ &=\mathbb{E}_{q_\phi(z|x)}\left[ \log \dfrac{p(x,z)}{p(z|x)} \cdot \dfrac{q_\phi(z|x)}{q_\phi(z|x)} \right]\\ &=\underbrace{ \mathbb{E}_{q_\phi(z|x)}\left[ \log \dfrac{p(x,z)}{q_\phi(z|x)} \right] }_{ \text{ELBO} }+\underbrace{ \mathbb{E}_{q_\phi(z|x)}\left[ \log \dfrac{q_\phi(z|x)}{p(z|x)} \right] }_{ \mathbb{D}_{\text{KL}} (q_\phi(z|x)\|p(z|x)) } \quad& \end{aligned}

So the ELBO is a lower bound of logp(x)\log p(x), maximize ELBO can achieve the goal of maximize logp(x)\log p(x).

When the KL-divergence is zero, qϕ(zx)=p(zx)q_\phi(z|x)=p(z|x), since p(zx)p(z|x) is delta function, we have

qϕ(zx)=N(zxμσ,0)=δ(zxμσ)q_\phi(z|x)=\mathcal{N}\left( z \left| \frac{x-\mu}{\sigma},0 \right.\right)=\delta\left( z-\dfrac{x-\mu}{\sigma} \right)

ELBO is still not useful, for it involves p(x,z)p(x,z) that we do not have access.

Theorem

ELBO(x)=Eqϕ(zx)[logpθ(xz)]how good your decoder isDKL(qϕ(zx)p(z))how good your encoder is\text{ELBO}(x)=\underbrace{ \mathbb{E}_{q_\phi(z|x)}[\log p_{\theta}(x|z)] }_{ \text{how good your decoder is} }\quad\underbrace{-\quad \mathbb{D}_{\text{KL}}(q_\phi(z|x)\|p(z)) }_{ \text{how good your encoder is} }

pθ(xz),qϕ(zx),p(z)p_{\theta}(x|z),q_\phi(z|x),p(z) are both Gaussian

Proof

ELBO(x)=defEqϕ(zx)[logp(x,z)qϕ(zx)]=Eqϕ(zx)[logp(xz)p(z)qϕ(zx)]=Eqϕ(zx)[logpθ(xz)]+Eqϕ(zx)[logp(z)qϕ(zx)] \begin{aligned} \text{ELBO}(x)&\stackrel{\text{def}}{=}\mathbb{E}_{q_\phi(z|x)}\left[ \log \dfrac{p(x,z)}{q_\phi(z|x)} \right]\\ &=\mathbb{E}_{q_\phi(z|x)}\left[ \log \dfrac{p(x|z)p(z)}{q_\phi(z|x)} \right]\\ &=\mathbb{E}_{q_\phi(z|x)}[\log p_{\theta}(x|z)]+\mathbb{E}_{q_\phi(z|x)}\left[ \log \dfrac{p(z)}{q_\phi(z|x)} \right] \end{aligned}

Note that we replaced p(xz)p(x|z) by pθ(xz)p_{\theta}(x|z) since the latter is accessible.

The meaning of each term:

  • Reconstruction:
    Eqϕ(zx)[logpθ(xz)]\mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}[\log p_\theta(\mathbf{x}|\mathbf{z})]. It is similar to maximum likelihood where we want to find the model parameter to maximize the likelihood. The expectation is taken w.r.t. samples z\mathbf{z} that is sampled from qϕ(zx)q_\phi(\mathbf{z}|\mathbf{x})
  • Prior Matching:
    KL divergence for encoder. Let encoder to turn xx to a latent vector zz such that the zz vector follows the choice of latent distribution p(z)p(\mathbf{z}).
Example

xp(x)=N(xμ,σ2I)\mathbf{x} \sim p(\mathbf{x})=\mathcal{N}(\mathbf{x}|\boldsymbol{\mu},\sigma^{2}\mathbf{I})
zp(z)=N(z0,I)\mathbf{z}\sim p(\mathbf{z})=\mathcal{N}(\mathbf{z}|0,\mathbf{I})
So that zz can be trivial solution xuσ\dfrac{x-u}{\sigma}, and x^=μ+σz\hat{x}=\mu+\sigma z.
p(xz)=δ(x(σz+μ))p(\mathbf{x}|\mathbf{z})=\delta(\mathbf{x}-(\sigma \mathbf{z}+\boldsymbol{\mu}))
p(zx)=δ(zxμσ)p(\mathbf{z}|\mathbf{x})=\delta\left( \mathbf{z}-\dfrac{\mathbf{x}-\boldsymbol{\mu}}{\sigma} \right)

Suppose we don’t know p(x)p(\mathbf{x}) so we need to estimate zz and xx.

(μ^(x),σ^(x)2)=Encoderϕ(x)qϕ(zx)=N(zax+b,t2I)\begin{aligned}(\hat{\boldsymbol{\mu}}(\mathbf{x}),\hat{\sigma}(\mathbf{x})^{2})&=\text{Encoder}_{\phi}(\mathbf{x})\\q_\phi(z|x)&=\mathcal{N}(\mathbf{z}|a\mathbf{x}+\mathbf{b},t^{2}\mathbf{I})\end{aligned}

Assume μ^\hat{\boldsymbol{\mu}} is an affine function of xx.

qϕ(zx)=N(zax+b,t2I)q_\phi(\mathbf{z}|\mathbf{x})=\mathcal{N}(\mathbf{z}|a\mathbf{x}+\mathbf{b},t^{2}\mathbf{I})

For decoder, we have

pθ(xz)=N(zcx+v,s2I)p_\theta(\mathbf{x}|\mathbf{z})=\mathcal{N}(\mathbf{z}|c\mathbf{x}+\mathbf{v},s^{2}\mathbf{I})

For KL-divergence DKL(qϕ(zx)p(xz))\mathbb{D}_{\text{KL}}(q_\phi(\mathbf{z}|\mathbf{x})\|p(\mathbf{x}|\mathbf{z})) to be zero, we have

qϕ(zx)=N(zxμσ,0)=δ(zxμσ)q_\phi(\mathbf{z}|\mathbf{x})=\mathcal{N}\left( \mathbf{z}| \dfrac{x-\mu}{\sigma},0 \right)=\delta\left( \mathbf{z}-\dfrac{x-\mu}{\sigma} \right)

Substitue to Eqϕ(zx)[logpθ(xz)]\mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}[\log p_\theta(\mathbf{x}|\mathbf{z})]:

Eqϕ(zx)[logpθ(xz)]=Eqϕ(zx)[logN(xcz+v,s2I)]=12log2πlogsc22s2[xμσxvc2]12log2πlogs\begin{aligned} \mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}[\log p_\theta(\mathbf{x}|\mathbf{z})]&=\mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}[\log \mathcal{N}(\mathbf{x}|c\mathbf{z}+\mathbf{v},s^{2}\mathbf{I})]\\ &=-\dfrac{1}{2}\log 2\pi-\log s-\dfrac{c^{2}}{2s^{2}}\left[ \left\lVert \dfrac{x-\mu}{\sigma} -\dfrac{x-v}{c} \right\rVert ^{2} \right]\\ &\leq-\dfrac{1}{2} \log 2\pi-\log s \end{aligned}

When v=μ,c=σ\mathbf{v}=\boldsymbol{\mu},c=\sigma,the equal holds. when s=0s=0, this term reach its maximum. This implies that pθ(xz)=N(xσz+μ,0)=δ(x(σz+μ))p_\theta(\mathbf{x}|\mathbf{z})=\mathcal{N}(\mathbf{x}|\sigma \mathbf{z}+\boldsymbol{\mu},0)=\delta(\mathbf{x}-(\sigma \mathbf{z}+\boldsymbol{\mu}))

If the p(z)p(z) and qϕq_{\phi} is both Gaussian and have same covariance, minimizing the KL-divergence is equals to minimizing the distance between the mean of two distribution.

The ELBO have limitations when qϕ(zx)q_\phi(\mathbf{z}|\mathbf{x}) may not equals to p(zx)p(\mathbf{z}|\mathbf{x}), thus ELBO not same to logp(x)\log p(\mathbf{x})

Example

If we don’t know p(zx)p(\mathbf{z}|\mathbf{x}), we need to train VAE by maxing ELBO.

qϕ(zx)=N(z(xμ)σ,t2I)pθ(xz)=N(xσz+μ,s2I)\begin{aligned} &q_\phi(\mathbf{z}|\mathbf{x})=\mathcal{N}\left( \dfrac{\mathbf{z}|(x-\mu)}{\sigma},t^{2}\mathbf{I} \right)\\ &p_\theta(\mathbf{x}|\mathbf{z})=\mathcal{N}(\mathbf{x}|\sigma \mathbf{z}+\boldsymbol{\mu},s^{2}\mathbf{I}) \end{aligned}

After maximizing DKL(qϕ(zx)p(z))\mathbb{D}_{\text{KL}}(q_\phi(\mathbf{z}|\mathbf{x})\|p(\mathbf{z})), and Eqϕ(zx)[logpθ(xz)]\mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}[\log p_\theta(\mathbf{x}|\mathbf{z})],we have

qϕ(zx)=N(zxμσ,I)pθ(xz)=N(xσz+μ,σ2I)\begin{aligned} q_\phi(\mathbf{z}|\mathbf{x})=\mathcal{N}\left( \mathbf{z}| \dfrac{\mathbf{x}-\mu}{\sigma},\mathbf{I} \right)\\ p_\theta(\mathbf{x}|\mathbf{z})=\mathcal{N}(\mathbf{x}| \sigma \mathbf{z}+\boldsymbol{\mu},\sigma^{2}\mathbf{I}) \end{aligned}

Compare to former result, the result here contains variance I\mathbf{I} and σ2I\sigma^{2}\mathbf{I}, adds randomness to samples.
Thus we know that maximizing ELBO is not just maximizing logp(x)\log p(x).

Optimizing VAE

Since Monte-Carlo can not sample the gradient of distribution EzPϕ(z)[f(z)]E_{z\sim P_{\phi}(z)}[f(z)] itself (i.e. ϕ{f(z)Pϕ(z)}dzϕ{f(z)}Pϕ(z)dz=1Ni=1Nϕf(zi)\int \nabla_{\phi}\{f(z)P_{\phi}(z)\}\,\mathrm{d}z\neq \int \nabla_{\phi} \{ f(z) \}P_{\phi}(z)\,\mathrm{d}z=\dfrac{1}{N}\sum_{i=1}^N \nabla_{\phi}f(z_{i}))

We need to introduce the reparameterization trick: express zz as some differentiable transformation of another random variable ε\varepsilon which is independent to parameter ϕ\phi.

In the ELBO context, we define a function gg s.t. z=g(ε,ϕ,x)\mathbf{z}=\mathbf{g}(\boldsymbol{\varepsilon},\boldsymbol{\phi},\mathbf{x}) for random var εp(ε)\varepsilon\sim p(\boldsymbol{\varepsilon}) and qϕ(zx)det(zε)=p(ε)q_\phi(\mathbf{z}|\mathbf{x}) \cdot \left\lvert \det\left( \frac{ \partial \mathbf{z} }{ \partial \boldsymbol{\varepsilon} } \right) \right\rvert=p(\boldsymbol{\varepsilon})

Eqϕ(zx)(f(z))=f(z)qϕ(zx)dz=f(g(ε))qϕ(g(ε)x)dg(x)=f(g(ε))qϕ(g(ε)x)det(g(ε)ε)dε(changing variable)=f(z)p(ε)dε=Ep(ε)[f(z)]\begin{aligned} \mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}(f(\mathbf{z})) &=\int f(z)\cdot q_\phi(\mathbf{z}|\mathbf{x})\,\mathrm{d}\mathbf{z}\\ &=\int f(g(\boldsymbol{\varepsilon}))\cdot q_\phi(\mathbf{g}(\boldsymbol{\varepsilon})|\mathbf{x}) \, \mathrm{d}\mathbf{g}(\mathbf{x}) \\ &=\int f(g(\varepsilon))\cdot q_\phi(\mathbf{g}(\mathbf{\varepsilon})|\mathbf{x})\cdot \left\lvert \det\left( \frac{ \partial \mathbf{g}(\boldsymbol{\varepsilon}) }{ \partial \boldsymbol{\varepsilon} } \right) \right\rvert \, \mathrm{d}\boldsymbol{\varepsilon}\quad\text{(changing variable)}\\ &=\int f(\mathbf{z})\cdot p(\boldsymbol{\varepsilon}) \, \mathrm{d}\boldsymbol{\varepsilon} \\ &=\mathbb{E}_{p(\mathbf{\varepsilon})}[f(\mathbf{z})] \end{aligned}

ϕEqϕ(zx)[f(z)]=Ep(ε)[ϕf(z)]\nabla_{\phi}\mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}[f(\mathbf{z})]=E_{p(\boldsymbol{\varepsilon})} [\nabla_{\phi}f(\mathbf{z})]

Recall the ELBO formula, we can substitute f(z)=logqϕ(zx)f(\mathbf{z})=-\log q_{\phi}(\mathbf{z}|\mathbf{x})

ϕEqϕ(zx)[logqϕ(zx)]=1Ll=1Lϕ[logdetz(l)ε(l)]\nabla_{\phi}\mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}[-\log q_\phi(\mathbf{z}|\mathbf{x})]=\dfrac{1}{L}\sum_{l=1}^L \nabla_{\phi}\left[ \log \left\lvert \det \frac{ \partial \mathbf{z}^{(l)}}{ \partial \boldsymbol{\varepsilon}^{(l)} }\right\rvert \right]

Integration by Substitution

Df(z)dz=g1(D)f(g(ε))det(zε)dε\int _{D}f(\mathbf{z}) \, \mathrm{d}\mathbf{z}=\int_{\mathbf{g}^{-1}(D)}f(\mathbf{g}(\varepsilon)) \cdot \left\lvert \det\left( \frac{ \partial \mathbf{z} }{ \partial \boldsymbol{\varepsilon} } \right) \right\rvert \mathrm{d}\boldsymbol{\varepsilon}

Gaussian Diffusion

xt+1:=xt+ηt,ηtN(0,σ2)x_{t+1}:=x_{t}+\eta_{t}, \eta_{t} \sim \mathcal{N}(0, \sigma^2)

learn to reverse each intermediate step.

DDPM

At time tt, given input zz, (sample from ptp_t), output a sample from conditional distribution p(xt1xt=z)p(x_{t-1}|x_{t}=z)

Learn the mean of p(xt1xt)p(x_{t-1}|x_{t}) is much simpler.

μt1(z):=E[xt1xt=z]\mu_{t-1} (z):= \mathbb{E}[x_{t-1}|x_t=z]\\

    μt1=argminf:RdRdExt,xt1f(xt)xt12=argminfExt1,ηf(xt1+η)xt12\begin{aligned} \implies \mu_{t-1}=&\mathop{\,\operatorname{argmin}\,}_{f:\mathbb{R}^d \to \mathbb{R}^d} \mathbb{E}_{x_{t}, x_{t-1}} ||f(x_{t}) - x_{t-1}||^2\\ =&\mathop{\,\operatorname{argmin}\,}_{f} \mathbb{E}_{x_{t-1}, \eta} || f(x_{t-1} + \eta) - x_{t-1}||^2\\ \end{aligned}

Then the estimate of E[xt1xt]\mathbb{E}[x_{t-1}|x_{t}] can be done by standard regression loss.

Reverse Sampler

A reverse sample for step tt is a function FtF_{t} such that if xtptx_{t}\sim p_{t} then the marginal distribution of Ft(xt)F_{t}(x_{t}) is pt1p_{t-1}

{Ft(z):zpt}pt1\{F_{t}(z):z \sim p_{t}\} \equiv p_{t-1}

The {F:xD}\{ F: x \sim D \} notation means implying a function on a variable xx which follows distribution DD, thus creating a new distribution.

Variance scaling:

p(x,kΔt)=pk(x), where Δt=1Tp(x, k \Delta t)=p_{k}(x)\text{, where }\Delta t=\frac{1}{T}, TT is discretization steps.

If xk=xk1+N(0,σ2)x_{k}=x_{k-1} + \mathcal{N}(0,\sigma^2), then xTN(x0,Tσ2)x_{T} \sim \mathcal{N}(x_{0}, T \sigma^2). So we scale variance by σ=σqΔt\sigma=\sigma_{q} \sqrt{ \Delta t }, σq\sigma_{q} is desired terminal variance.

Notations:
In below, tt will represent a continuous-value in the interval [0,1][0,1], subscripts will indicate time rather than index.

Claim

For Gaussian diffusion setting, we have

E[(xtΔtxt)xt]=ΔttE[x0xt]+(1Δtt)xt\mathbb{E}[(x_{t-\Delta t}-x_{t})|x_{t}]=\dfrac{\Delta t}{t}\mathbb{E}[x_{0}|x_{t}]+\left( 1- \dfrac{\Delta t}{t} \right)x_{t}

DDPM: Stochastic Sampling

DDPM stands for Denoising Diffusion Probabilistic Models

Stochastic Reverse Sampler

For input xtx_{t} and timestep tt, output x^tΔtμtΔt(xt)+N(0,σq2Δt)\hat{x}_{t-\Delta t} \leftarrow \mu_{t-\Delta t}(x_{t}) + \mathcal{N}(0, \sigma_{q}^2 \Delta t)

Claim

μz , s.t. p(xtΔtxt=z)N(xtΔt;μz,σq2Δt)\exists \mu_{z}\text{ , s.t. }p(x_{t-\Delta t}|x_{t}=z) \approx \mathcal N (x_{t-\Delta t}; \mu_{z}, \sigma_{q}^2 \Delta t)

If constant μz\mu_{z} depends only on zz, we can take

μz:=ExtΔt,xt[xtΔtxt=z]=z+(σq2Δt)logpt(z)\begin{aligned} \mu_{z} &:= \mathbb{E}_{x_{t-\Delta t}, x_{t}}[x_{t - \Delta t} | x_{t}=z]\\ &=z+(\sigma_{q}^2 \Delta t)\nabla \log p_{t}(z) \end{aligned}

Proof

The Bayes rule:

p(xtΔtxt)=p(xtxtΔt)ptΔt(xtΔt)pt(xt)p(x_{t-\Delta t}|x_{t})=\dfrac{p(x_{t}|x_{t-\Delta t})p_{t-\Delta t}(x_{t-\Delta t})}{p_{t}(x_{t})}

Take log on both side:

logp(xtΔtxt)=logp(xtxtΔt)+logptΔt(xtΔt)logpt(xt)Drop constants not involve xtΔt=logp(xtxtΔt)+logpt(xtΔt)+O(Δt)Because ptΔt=pt+Δttpt=12σq2ΔtxtΔtxt2+logpt(xtΔt)Substitute N(xt;xtΔt,σq2Δt)=+logpt(xt)+xlogpt(xt),(xtΔtxt)+O(Δt)Taylor expand,  is inner product=12σq2ΔtxtΔtxtσq2Δtxlogpt(xt)2+C=12σq2ΔtxtΔtxt2\begin{aligned} &\log p(x_{t-\Delta t}|x_{t})\\ =&\log p(x_{t}|x_{t-\Delta t})+\log p_{t-\Delta t}(x_{t-\Delta t})\cancel{-\log p_{t}(x_{t})} \quad\quad&\text{Drop constants not involve }x_{t-\Delta t}\\ =&\log p(x_{t}|x_{t-\Delta t})+\log p_{t}(x_{t-\Delta t})+\mathcal{O}(\Delta t) &\text{Because } p_{t - \Delta t}=p_{t}+\Delta t \frac{ \partial }{ \partial t } p_{t}\\ =&-\dfrac{1}{2\sigma_{q}^{2}\Delta t}\lVert x_{t-\Delta t}-x_{t} \rVert ^{2}+\log p_{t}(x_{t-\Delta t})&\text{Substitute } \mathcal{N}(x_{t};\,x_{t-\Delta t},\sigma_{q}^{2}\Delta t)\\ =&-\cdots+\cancel{\log p_{t}(x_{t})}+\langle \nabla_{x}\log p_{t}(x_{t}),(x_{t-\Delta t}-x_{t}) \rangle +\mathcal{O}(\Delta t)&\text{Taylor expand, }\langle \rangle\text{ is inner product} \\ =&-\dfrac{1}{2\sigma_{q}^{2}\Delta t}\lVert x_{t-\Delta t}-x_{t}-\sigma_{q}^{2}\Delta t \nabla_{x} \log p_{t}(x_{t}) \rVert ^{2}+C\\ =&-\dfrac{1}{2\sigma_{q}^{2}\Delta t}\lVert x_{t-\Delta t}-x_{t} \rVert ^{2} \end{aligned}

It is the log density of N(xtΔt;μ,σq2Δt)\mathcal{N}(x_{t-\Delta t};\mu,\sigma_q^{2}\Delta t)

The train loss of DDPM:

1
2
3
4
5
6
def train_loss(f_theta, p):
x0 = p.sample()
t = uniform(0, 1).sample()
x = x0 + normal(0, sigma_q**2 * t).sample()
x_ = x + normal(0, sigma_q**2 * dt).sample()
return (f_theta(x_, t + dt) - x)**2

Sampling:

1
2
3
4
5
6
def DDPM(f_theta):
x = normal(0, sigma_q**2).sample()
for t in reversed(range(0, 1, dt)):
eta = normal(0, sigma_q**2 * dt).sample()
x = f_theta(x, t) + eta
return x
1
2
3
4
5
6
def DDIM(f_theta):
x = normal(0, sigma_q**2).sample()
for t in reversed(range(0, 1, dt)):
weight = (t**0.5) / ((t-dt)**0.5 + t**0.5)
x = x + weight * (f_theta(x, t) - x)
return x

The ELBO of DDPM:

The ELBO of DDPM

ELBOϕ,θ(x)=Eqϕ(xΔtx0)[logpθ(x0xΔt)]Eqϕ(xtΔtx0)DKL(qϕ(xtxtΔt)p(xt))Eqϕ(xtΔt,xt+Δt)x0[DKL(qϕ(xtxtΔt)pθ(xtxt+Δt))]\begin{align} \text{ELBO}_{\phi,\theta}(\mathbf{x})&=\mathbb{E}_{q_{\phi}(x_{\Delta t}|x_{0})}[\log p_{\theta}(x_{0}|x_{\Delta t})] \\ &\quad\quad -\mathbb{E}_{q_{\phi}(x_{t-\Delta t}|x_{0})}\mathbb{D}_{\text{KL}}(q_{\phi}(x_{t}|x_{t-\Delta t})\|p(x_{t})) \\ &\quad\quad -\sum E_{q\phi(x_{t-\Delta t},x_{t+\Delta t})|x_{0}}[\mathbb{D}_{\text{KL}}(q_{\phi}(x_{t}|x_{t-\Delta t})\|p_{\theta}(x_{t}|x_{t+\Delta t}))] \end{align}

The last KL term contain two direction of sampling (both forward and backward), and may introduce more variance by Monte-Carlo simulation. Try to reduce it.

q(xtxtΔt)=q(xtΔtxt)q(xt)q(xtΔt)=q(xtΔtxt,x0)q(xtx0)q(xtΔtx0)q(x_{t}|x_{t-\Delta t})=\dfrac{q(x_{t-\Delta t}|x_{t})q(x_{t})}{q(x_{t-\Delta t})}=\dfrac{q(x_{t-\Delta t}|x_{t},x_{0})q(x_{t}|x_{0})}{q(x_{t-\Delta t}|x_{0})}

By substituting this formula, we get:

The ELBO of DDPM, optimized

ELBOϕ,θ(x)=Eqϕ(xΔtx0)[logpθ(x0xΔt)]DKL(qϕ(x1x0)p(x1))Eqϕ(xtx0)[DKL(qϕ(xtΔtxt,x0)pθ(xtΔtxt))]\begin{align} \text{ELBO}_{\phi,\theta}(\mathbf{x}) &=\mathbb{E}_{q_{\phi}(x_{\Delta t}|x_{0})}[\log p_{\theta}(x_{0}|x_{\Delta t})] - \mathbb{D}_{\text{KL}}(q_{\phi}(x_{1}|x_{0})\|p(x_{1})) \\ &\quad-\sum \mathbb{E}_{q_{\phi}(x_{t}|x_{0})}[\mathbb{D}_{\text{KL}}(q_{\phi}(x_{t-\Delta t}|x_{t},x_{0})\|p_{\theta}(x_{t-\Delta t}|x_{t}))] \end{align}

Forward Distribution of DDPM

q(xt+Δtxt)=N(xt+Δtαtxt,(1αt)I)q(x_{t+\Delta t}|x_{t})=\mathcal{N}(x_{t+\Delta t}|\sqrt{ \alpha_{t} }x_{t},(1-\alpha_{t})\mathbf{I})

qϕ(xtx0)q_{\phi}(x_{t}|x_{0}) in DDPM

qϕ(xtx0)=N(xtαtx0,(1α)I)q_{\phi}(\mathbf{x}_{t}|\mathbf{x}_{0})=\mathcal{N}(\mathbf{x}_{t}|\sqrt{ \overline{\alpha_{t}} }\mathbf{x}_{0},(1-\overline{\alpha})\mathbf{I}), where αt=i=1t/ΔtαiΔt\overline{\alpha}_{t}=\prod_{i=1}^{t/\Delta t} \alpha_{i\Delta t}

Proof

By reparameterizing:
xt+Δt=αtxt+1αtε1,ε1N(0,1)x_{t+\Delta t}=\sqrt{ \alpha_{t} }x_{t}+\sqrt{ 1-\alpha_{t} }\varepsilon_{1},\quad \varepsilon_{1}\sim \mathcal{N}(0,1)
xt+2Δt=αt+Δt(αtxt+1αtε1)+1αt+Δtε2,ε1,ε2N(0,1)x_{t+2\Delta t}=\sqrt{ \alpha_{t+\Delta t} }(\sqrt{ \alpha_{t} }x_{t}+\sqrt{ 1-\alpha _{t} }\varepsilon_{1})+\sqrt{ 1-\alpha_{t+\Delta t} }\varepsilon_{2}, \quad\varepsilon_{1},\varepsilon_{2}\sim \mathcal{N}(0,1)
=αt+Δtαtxt+αt+Δt(1αt)ε+1αt+Δtε2=\sqrt{ \alpha_{t+\Delta t}\alpha_{t} }x_{t}+\sqrt{ \alpha_{t+\Delta t}(1-\alpha_{t}) }\varepsilon+\sqrt{ 1-\alpha_{t+\Delta t} }\varepsilon_{2}
=αt+Δtαtxt+1αt+Δtαtε,εN(0,1)=\sqrt{ \alpha_{t+\Delta t}\alpha_{t} }x_{t}+\sqrt{ 1-\alpha_{t+\Delta t}\alpha_{t} }\varepsilon, \quad\varepsilon\sim \mathcal{N}(0,1) (sum of gaussian samples is still gaussian)
Same as t1t\to 1 then get the result.

Since qq is determined after the noise schedule αt\alpha_{t} is chosen, all we need to do is to minimize the KL-divergence between the pθp_{\theta} and fixed qϕq_{\phi}.

If we set pθ=N(xtΔtμθ(xt),σq2(t)I)p_{\theta}=\mathcal{N}(x_{t-\Delta t}|\mu_{\theta}(x_{t}),\sigma_{q}^{2}(t)\mathbf{I}), the KL term would be simplified to 12σq2(t)μq(xt,x0)μθ(xt)2\dfrac{1}{2\sigma_{q}^{2}(t)}\lVert \mu_{q}(x_{t},x_{0})-\mu_{\theta}(x_{t}) \rVert^{2} (the KL-divergence feature of two gaussian distributions with same variance)

μq(xt,x0)\mu_{q}(x_{t},x_{0}) can be derived from the distribution of q(xtΔtxt,x0)q(x_{t-\Delta t}|x_{t},x_{0})

Recall that q(xtΔtxt,x0)q(x_{t-\Delta t}|x_{t},x_{0}) is q(xtΔtx0)q(xtxtΔt)q(xtx0)\dfrac{q(x_{t-\Delta t}|x_{0})q(x_{t}|x_{t-\Delta t})}{q(x_{t}|x_{0})}

q(xtΔtxt,x0)=N(xtαtxtΔt,(1αt)I)N(xtΔtαtΔtx0,(1αtΔt)I)N(xtαtx0,(1αt)I)q(x_{t-\Delta t}|x_{t},x_{0})=\dfrac{\mathcal{N}(x_{t}|\sqrt{ \alpha_{t} }x_{t-\Delta t},(1-\alpha_{t})\mathbf{I})\mathcal{N}(x_{t-\Delta t}|\sqrt{ \overline{\alpha_{t-\Delta t}} }x_{0},(1-\overline{\alpha_{t-\Delta t}})\mathbf{I})}{\mathcal{N}(x_{t}|\sqrt{ \overline\alpha_{t} }x_{0},(1-\overline{\alpha_{t}})\mathbf{I})}

Consider the negative log-likelihood ff of the above formula, and take the zero point of ff', we can get mean(i.e. f(μ)=0f'(\mu)=0), and the 1f\dfrac{1}{f''} is variance.

Intuitively, we take the simple Gaussian Distribution N(xμ,σ2)\mathcal{N}(x|\mu,\sigma^{2})

f(x)=(xμ)22σ2f(x)=\dfrac{(x-\mu)^{2}}{2\sigma^{2}}

f(x)=xμσ2    f(μ)=0f'(x)=\dfrac{x-\mu}{\sigma^{2}}\implies f'(\mu)=0

f(x)=1σ2    1f=σ2f''(x)=\dfrac{1}{\sigma^{2}}\implies \dfrac{1}{f''}=\sigma^{2}

Anyway, we can get xˉtΔt=μq(xt,x0)=(1αˉtΔt)αt1αtxt+(1αt)αtΔt1αtx0\bar{x}_{t-\Delta t}=\mu_{q}(x_{t},x_{0})=\dfrac{(1-\bar{\alpha}_{t-\Delta t})\sqrt{ \alpha_{t} }}{1-\overline{\alpha}_{t}}x_{t}+\dfrac{(1-\alpha_{t})\sqrt{ \overline{\alpha}_{t-\Delta t} }}{1-\overline{\alpha}_{t}}x_{0}

Σq(t)=(1αt)(1αtΔt)1αtI\boldsymbol{\Sigma} _q(t)=\dfrac{(1-\alpha_{t})(1-\overline{\alpha}_{t-\Delta t})}{1-\overline{\alpha}_{t}}\mathbf{I}

There some variations about training target.

  • Consider μθ(xt,t)=def(1αˉtΔt)αt1αtxt+(1αt)αtΔt1αtx^θ(xt,t)\mu_{\theta}(x_{t},t)\stackrel{\text{def}}{=}\dfrac{(1-\bar{\alpha}_{t-\Delta t})\sqrt{ \alpha_{t} }}{1-\overline{\alpha}_{t}}x_{t}+\dfrac{(1-\alpha_{t})\sqrt{ \overline{\alpha}_{t-\Delta t} }}{1-\overline{\alpha}_{t}}\hat{x}_{\theta}(x_{t},t) i.e. prediction clean sample x0x_{0}
    The consistency term would be 12σq(t)2(1αt)2αtΔt(1αt)2x^θ(xt,t)x02\dfrac{1}{2\sigma_{q}(t)^{2}} \dfrac{(1-\alpha_{t})^{2}\overline{\alpha}_{t-\Delta t}}{(1-\overline{\alpha}_{t})^{2}}\lVert \hat{x}_{\theta}(x_{t},t)-x_{0} \rVert^{2}

  • Consider the noise ε0\varepsilon_{0} from xt=αtx0+1αtε0x_{t}=\sqrt{ \overline{\alpha}_{t} }x_{0}+\sqrt{ 1-\overline{\alpha}_{t} }\varepsilon_{0}:
    x0=xt1αtε0αtx_{0}=\dfrac{x_{t}-\sqrt{ 1-\overline{\alpha}_{t} }\varepsilon_{0}}{\sqrt{ \overline{\alpha}_{t} }}, substituting it to μq\mu_{q} above, we can get μq=1αtxt1αt1αtαtε0\mu_{q}=\dfrac{1}{\sqrt{ \alpha _{t} }}x_{t}-\dfrac{1-\alpha_{t}}{\sqrt{ 1-\overline{\alpha}_{t} }\sqrt{ \alpha_{t} }}\varepsilon_{0}
    If we change the μθ\mu_{\theta} to same formula expect change ε0\varepsilon_{0} to ε^θ(xt,t)\hat{\varepsilon}_{\theta}(x_{t},t), we can get the new consistency term: 12σq(t)2(1αt)2(1αt)αtε0ε^θ(xt,t)2\dfrac{1}{2\sigma_{q}(t)^{2}} \dfrac{(1-\alpha_{t})^{2}}{(1-\overline{\alpha}_{t})\alpha_{t}}\lVert \varepsilon_{0}-\hat{\varepsilon}_{\theta}(x_{t},t) \rVert^{2}

  • Consider the Tweedie formula:
    q(xtx0)=N(xtαtx0,(1αt)I)q(x_{t}|x_{0})=\mathcal{N}(x_{t}|\sqrt{ \overline{\alpha}_{t} }x_{0},(1-\overline{\alpha_{t}})\mathbf{I})
    E[μxtxt]=xt+(1αt)(xtlogp(xt))=αtx0\mathbb{E}[\mu_{x_{t}}|x_{t}]=x_{t}+(1-\overline{\alpha}_{t})(\nabla x_{t} \log p(x_{t}))=\sqrt{ \overline{\alpha}_{t} }x_{0}
    So x0=xt+(1αt)logp(xt)αtx_{0}=\dfrac{x_{t}+(1-\overline{\alpha}_{t})\nabla \log p(x_{t})}{\sqrt{ \overline{\alpha}_{t} }}, substitute to μq(xt,x0)\mu_{q}(x_{t},x_{0}), we can get μq(xt,x0)=1αtxt+1αtαtlogp(xt)\mu_{q}(x_{t} ,x_{0})=\dfrac{1}{\sqrt{ \alpha_{t} }}x_{t}+\dfrac{1-\alpha_{t}}{\sqrt{ \alpha_{t} }}\nabla \log p(x_{t})
    The corresponding consistency term: 12σq(t)2(1αt)2αtsθ(xt,t)logp(xt)\dfrac{1}{2\sigma_{q}(t)^{2}} \dfrac{(1-\alpha_{t})^{2}}{\alpha_{t}}\lVert s_{\theta}(x_{t},t)-\nabla \log p(x_{t}) \rVert

The score function logp(xt)\nabla \log p(x_{t}) is same to ε0\varepsilon_{0}:

x0=xt+(1αt)logp(xt)αt=xt1αtε0αtx_{0}=\dfrac{x_{t}+(1-\overline{\alpha}_{t})\nabla \log p(x_{t})}{\sqrt{ \overline{\alpha}_{t} }}=\dfrac{x_{t}-\sqrt{ 1-\overline{\alpha}_{t} }\varepsilon_{0}}{\sqrt{ \overline{\alpha}_{t} }}

    (1αt)logp(xt)=1αtε0logp(xt)=11αtε0\begin{align} \implies(1-\overline{\alpha}_{t})\nabla \log p(x_{t})&=-\sqrt{ 1-\overline{\alpha}_{t} }\varepsilon_{0} \\ \nabla \log p(x_{t})&=-\dfrac{1}{\sqrt{ 1-\overline{\alpha}_{t} }}\varepsilon_{0} \end{align}

Intuitively, the gradient points to the direction at xtx0x_{t}\longrightarrow x_{0}, which is ε0-\varepsilon_{0}

Inferencing

Inferencing of DDPM

x1N(0,1)x_{1}\sim \mathcal{N}(0,1)
xtΔt=(1αtΔt)αt1αtxt+(1αt)αtΔt1αtx^θ(xt)+σq(t)ε,εN(0,1)x_{t-\Delta t}=\dfrac{(1-\overline{\alpha}_{t-\Delta t})\sqrt{ \alpha_{t} }}{1-\overline{\alpha}_{t}}x_{t}+ \dfrac{(1-\alpha_{t})\sqrt{ \overline{\alpha_{t-\Delta t}} }}{1-\overline{\alpha}_{t}}\hat{x}_{\theta}(x_{t})+\sigma_{q}(t)\varepsilon, \quad\varepsilon\sim \mathcal{N}(0,1)

Conditional Generation

For condition yy:

logp(xty)=log(p(xt)p(yxt)p(y))=logp(xt)unconditional score+logp(yxt)adversial gradient from classifierlogp(y)\nabla \log p(\mathbf{x}_{t}|y)=\nabla \log\left( \dfrac{p(\mathbf{x}_{t})p(y|\mathbf{x}_{t})}{p(y)} \right)=\underbrace{ \nabla \log p(\mathbf{x}_{t}) }_{ \text{unconditional score} }+\underbrace{ \nabla \log p(y|\mathbf{x}_{t}) }_{ \text{adversial gradient from classifier} }\cancel{ -\nabla \log p(y) }

Scale the unconditional gradient:

logpγ(xty)=logp(xt)+γlog(p(yxt))\nabla \log p_{\gamma}(\mathbf{x}_{t}|y)=\nabla \log p(\mathbf{x}_{t})+\gamma \nabla \log(p(y|\mathbf{x}_{t}))

However, to classifier xt\mathbf{x}_{t}, we need to train another classifier works on noised sample.

Or use predicted clean sample x^0\hat{x}_{0}

The classifier can be removed by train a conditional denoising model p(xz,y)p(x|z,y).

DDIM: Deterministic Sampling

For simplify notations, substitute αt\alpha_{t} to αtαtΔt\dfrac{\alpha_{t}}{\alpha_{t-\Delta t}}

Then q(xtx0)=N(xtαtx0,(1αt)I)q(x_{t}|x_{0})=\mathcal{N}(x_{t}|\sqrt{ \alpha_{t} }x_{0},(1-\alpha_{t})\mathbf{I})

    ε=xtαtx01αt\implies\varepsilon= \dfrac{x_{t}-\sqrt{ \alpha _{t} }x_{0}}{\sqrt{ 1-\alpha_{t} }}

    xt=αtΔtx0+1αtΔtε=αtΔtx0+1αtΔt(xtαtx01αt)\implies x_{t}=\sqrt{ \alpha_{t-\Delta t} }x_{0}+\sqrt{ 1-\alpha_{t-\Delta t} }\varepsilon=\sqrt{ \alpha_{t-\Delta t} }x_{0}+\sqrt{ 1-\alpha_{t-\Delta t} }\left( \dfrac{x_{t}-\sqrt{ \alpha_{t} }x_{0}}{\sqrt{ 1-\alpha_{t} }} \right)

Try to let q(xtΔtxt,x0)=N(αtΔtx0+1αtΔt(xtαtx01αt),σt2I)q(x_{t-\Delta t}|x_{t},x_{0})=\mathcal{N}\left( \sqrt{ \alpha_{t-\Delta t} }x_{0}+\sqrt{ 1-\alpha_{t-\Delta t} }\left( \dfrac{x_{t}-\sqrt{ \alpha_{t} }x_{0}}{\sqrt{ 1-\alpha_{t} }} \right), \sigma_{t}^{2}I \right) to be the marginal distribution q(xtΔtx0)=N(αtΔtx0,(1αtΔt)I)q(x_{t-\Delta t}|x_{0})=\mathcal{N}(\sqrt{ \alpha_{t-\Delta t} }x_{0},(1-\alpha_{t-\Delta t})\mathbf{I})

By matching mean and variance, we get the result

Transition Distribution of DDIM

q(xtΔtxt,x0)=N(αtΔtx0+1αtΔtσt2(xtαtx01αt),σt2I)q(x_{t-\Delta t}|x_{t},x_{0})=\mathcal{N}\left( \sqrt{ \alpha_{t-\Delta t} }x_{0}+\sqrt{ 1-\alpha_{t-\Delta t} \textcolor{red}{-\sigma_{t}^{2}}}\left( \dfrac{x_{t}-\sqrt{ \alpha_{t} }x_{0}}{\sqrt{ 1-\alpha_{t} }} \right), \sigma_{t}^{2}I \right)

Setting σt\sigma_{t} to 00 can make it be deterministic.

Flow Matching

flow

A flow is a collection of time-indexed vector fields v={vt}t[0,1]v=\{ v_{t} \}_{t \in[0,1]}vtv_{t}: velocity-field of a gas at each time tt.

For flow vv and initial point x1x_{1}, there has dxtdt=vt(xt)\dfrac{\,\mathrm{d}x_{t}}{\,\mathrm{d}t}=-v_{t}(x_{t})

The Goal of Flow Matching

Learn a flow vv^* transports qq to pp, where pp is the target distribution, qq is some easy-to-sample base distribution (ie. Gaussian)
The DDIM algorithm is a special case of this.

Pointwise Flow

A pointwise flow v[x1,x0]v^{[x_{1},x_{0}]} is a flow {vt}t\{ v_{t} \}_{t} that satisfies dxtdt=vt(xt)\dfrac{\,\mathrm{d}x_{t}}{\,\mathrm{d}t}=-v_{t}(x_{t}), with boundary conditions x1x_{1} and x0x_{0}

Marginal Flow

weighted average of all individual partical velocities vt[x1,x0]v_{t}^{[x_{1},x_{0}]}

Ex0,x1xt[vt[x1,x0](xt)xt]\mathbb{E}_{x_{0},x_{1}|x_{t}} [v_{t}^{[x_{1},x_{0}]}(x_{t})|x_{t}]

The (x1,x0,xt)(x_{1},x_{0},x_{t}) is induced by sampling (x1,x0)Πq,p(x_{1},x_{0}) \sim \Pi_{q,p}, xtRunFlow(v[x1,x0],x1,t)x_{t}\leftarrow\text{RunFlow}(v^{[x_{1},x_{0}]},x_{1},t)

Flow Matching:

vt(xt):=Ex0,x1xt[vt[x1,x0](xt)xt]    vt=argminf:RdRdE(x0,x1,xt)f(xt)vt[x1,x0](xt)2\begin{aligned} &v_{t}^*(x_{t}):= \mathbb{E}_{x_{0},x_{1}|x_{t}} [v_{t}^{[x_{1},x_{0}]} (x_{t})|x_{t}]\\ \implies &v_{t}^*=\mathop{\,\operatorname{argmin}\,}_{f:\mathbb{R}^d\to \mathbb{R}^{d}} \mathbb{E}_{(x_{0},x_{1},x_{t})} \lVert f(x_{t})-v_{t}^{[x_{1},x_{0}]}(x_{t}) \rVert ^{2} \end{aligned}

Train loss of Flow-matching:

1
2
3
4
5
def train_loss(f_theta, q_p_dist, pointwise_flow):
x1, x0 = q_p_dist.sample()
t = uniform(0, 1).sample()
xt = run_flow(pointwise_flow(x1, x0), x1, t)
return (f_theta(xt, t) - vt(x1,x0)(xt)) ** 2
1
2
3
4
5
6
def sample(f_theta, base_dist, step_size):
x1 = base_dist.sample()
x0 = x1
for i in reversed(range(0, 1, step_size)):
x0 = x0 + f_theta(x0, t) * step_size
return x0

Reference

Step-by-Step Diffusion: An Elementary Tutorial
Tutorial on Diffusion Models for Imaging and Vision