Fundamentals

Theorem

For any distribution over (X,Y)(X,Y), we have

argminf Ef(X)Y2=E[YX]\text{argmin}_{f} \space \mathbb{E} ||f(X)-Y||^2=\mathbb{E}[Y|X]

Proof

argminfEf(x)y2=argminfEf(x)y2=argminfEXEYX[f(x)YX=x]2=argminf(x)EX[f(x)22f(x)EYX(YX=x)+EYX(YX=x)2]=EX(argminf(x)f(x)22f(x)EYX(YX=x)+EYX(YX=x)2)\begin{aligned} &\mathop{\operatorname{argmin}\,}_{f} \mathbb{E} \lvert \lvert f(x)-y \rvert \rvert ^2 \\ =&\mathop{\operatorname{argmin}\,}_{f} \mathbb{E} ||f(x)-y||^2\\ =&\mathop{\operatorname{argmin}\,}_{f} \mathbb{E}_{X} \mathbb{E}_{Y|X} [f(x)-Y|X=x]^2\\ =&\mathop{\operatorname{argmin}\,}_{f(x)} \mathbb{E}_{X} [f(x)^2 - 2f(x) E_{Y|X} (Y|X=x) + \mathbb{E}_{Y|X} (Y|X=x)^2]\\ =&\mathbb{E}_{X} (\mathop{\operatorname{argmin}\,}_{f(x)} f(x)^2 - 2f(x) \mathbb{E}_{Y|X} (Y|X=x) + \mathbb{E}_{Y|X} (Y|X=x)^2) \end{aligned}

f(x)f(x) 求导,得2f(x)2EYX(YX=x)=0f(x)=EYX(YX=x)2f(x)-2\mathbb{E}_{Y|X} (Y|X=x) = 0 \Leftrightarrow f(x)=\mathbb{E}_{Y|X} (Y|X=x)
所以 f=E[YX](Y)f=\mathbb{E}_{[Y|X]}(Y)

Gradient Formula of Gaussian Distribution

yN(x,σ2I) i.e. p(yx)=1(2πσ2)d/2exp(12σ2yx2)    yp(yx)=1σ2(yx)p(yx)\begin{aligned} &y\sim \mathcal{N}(x,\sigma^{2}I) \text{ i.e. } p(y|x)=\dfrac{1}{(2\pi \sigma^{2})^{d/2}}\exp \left( -\dfrac{1}{2\sigma^{2}} \left\| y-x \right\| ^{2} \right) \\ \implies &\nabla_{y}p(y|x)=-\dfrac{1}{\sigma^{2}}(y-x) p(y|x) \end{aligned}

Proof

yp(y)=yp(yx)p(x)dx\nabla_{y} p(y)=\int \nabla_{y} p(y|x) p(x) \,\mathrm{d}x

yp(yx)=p(yx)ylogp(yx)=p(yx)y(d2log(2πσ2)12σ2yx2)=p(yx)y(12σ2yx2)=p(yx)(1σ2(yx))\begin{aligned} \nabla_{y} p(y|x) &= p(y|x) \cdot \nabla_{y} \log p(y|x)\\ &=p(y|x) \cdot \nabla_{y} \left( -\dfrac{d}{2} \log(2\pi \sigma^{2})-\dfrac{1}{2\sigma^{2}}\left\| y-x \right\| ^{2} \right)\\ &=p(y|x) \cdot \nabla_{y}\left( -\dfrac{1}{2\sigma^{2}}\left\| y-x \right\| ^{2} \right)\\ &=p(y|x) \cdot \left( -\dfrac{1}{\sigma^{2}}(y-x) \right) \end{aligned}

Tweedie’s Formula

If YN(x,σ2)Y \sim \mathcal{N}(x, \sigma^{2}), then

E[XY=y]=y+σ2ylogp(y)\mathbb{E}[X|Y=y]=y+\sigma^2 \nabla _y \log p(y)

其中

  • p(y):p(y): YY 的边缘密度(观测到的 YY 的分布)
  • ylogp(y)\nabla_{y} \log p(y)YY 的对数密度关于 yy 的梯度 a.k.a. score function
Proof

p(xy)=p(yx)p(x)p(y)p(x|y)=\dfrac{p(y|x)p(x)}{p(y)}
由于 YX=xN(x,σ2I)Y|X=x \sim \mathcal{N}(x, \sigma^2I)

p(yx)=1(2πσ2)d/2exp(12σ2)p(y|x)=\dfrac{1}{(2\pi \sigma^2)^{d/2}} \exp \left(-\frac{1}{2\sigma^{2}}\right)

E[XY=y]=xp(xy)dx=1p(y)xp(yx)p(x)dx\begin{aligned} \mathbb{E}[X|Y=y]&=\int x p(x|y)\,\mathrm{d}x\\ &=\dfrac{1}{p(y)}\int x p(y|x) p(x) \,\mathrm{d}x \end{aligned}

yp(y)=yp(yx)p(x)dx\nabla_{y} p(y)=\int \nabla_{y} p(y|x) p(x) \,\mathrm{d}x

高斯分布梯度公式

yp(yx)=1σ2(yx)p(yx)\nabla_{y} p(y|x)=-\dfrac{1}{\sigma^{2}} (y-x) p(y|x)

代入,得

yp(y)=(1σ2(yx)p(yx))p(x)dx=1σ2(yp(yx)p(x)dxxp(yx)p(x)dx)=1σ2(yp(y)xp(yx)p(x)dx)\begin{aligned} \nabla_{y} p(y)&=\int(-\dfrac{1}{\sigma^{2}}(y-x)p(y|x))p(x)\,\mathrm{d}x\\ &=-\dfrac{1}{\sigma^{2}}\left( y\int p(y|x)p(x)\,\mathrm{d}x-\int xp(y|x)p(x)\,\mathrm{d}x \right) \\ &=-\dfrac{1}{\sigma^{2}}\left( yp(y)-\int xp(y|x)p(x)\,\mathrm{d}x \right) \end{aligned}

整理,得

yp(y)+σ2yp(y)=xp(yx)p(x)dxyp(y)+\sigma^{2}\nabla_{y}p(y)=\int xp(y|x)p(x)\,\mathrm{d}x

对于 E[XY=y]\mathbb{E}[X|Y=y],有

E[XY=y]=xp(yx)p(x)p(y)dx=1p(y)xp(yx)p(x)dx=y+σ2ylogp(y)\mathbb{E}[X|Y=y]=\int x\dfrac{p(y|x)p(x)}{p(y)}\,\mathrm{d}x=\dfrac{1}{p(y)}\int xp(y|x)p(x)\,\mathrm{d}x=y+\sigma^{2}\nabla_{y}\log p(y)

Gaussian Diffusion

xt+1:=xt+ηt,ηtN(0,σ2)x_{t+1}:=x_{t}+\eta_{t}, \eta_{t} \sim \mathcal{N}(0, \sigma^2)

learn to reverse each intermediate step.

DDPM

At time tt, given input zz, (sample from ptp_t), output a sample from conditional distribution p(xt1xt=z)p(x_{t-1}|x_{t}=z)

Learn the mean of p(xt1xt)p(x_{t-1}|x_{t}) is much simpler.

μt1(z):=E[xt1xt=z]\mu_{t-1} (z):= \mathbb{E}[x_{t-1}|x_t=z]\\

    μt1=argminf:RdRdExt,xt1f(xt)xt12=argminfExt1,ηf(xt1+η)xt12\begin{aligned} \implies \mu_{t-1}=&\mathop{\,\operatorname{argmin}\,}_{f:\mathbb{R}^d \to \mathbb{R}^d} \mathbb{E}_{x_{t}, x_{t-1}} ||f(x_{t}) - x_{t-1}||^2\\ =&\mathop{\,\operatorname{argmin}\,}_{f} \mathbb{E}_{x_{t-1}, \eta} || f(x_{t-1} + \eta) - x_{t-1}||^2\\ \end{aligned}

Then the estimate of E[xt1xt]\mathbb{E}[x_{t-1}|x_{t}] can be done by standard regression loss.

Reverse Sampler

A reverse sample for step tt is a function FtF_{t} such that if xtptx_{t}\sim p_{t} then the marginal distribution of Ft(xt)F_{t}(x_{t}) is pt1p_{t-1}

{Ft(z):zpt}pt1\{F_{t}(z):z \sim p_{t}\} \equiv p_{t-1}

The {F:xD}\{ F: x \sim D \} notation means implying a function on a variable xx which follows distribution DD, thus creating a new distribution.

Variance scaling:

p(x,kΔt)=pk(x), where Δt=1Tp(x, k \Delta t)=p_{k}(x)\text{, where }\Delta t=\frac{1}{T}, TT is discretization steps.

If xk=xk1+N(0,σ2)x_{k}=x_{k-1} + \mathcal{N}(0,\sigma^2), then xTN(x0,Tσ2)x_{T} \sim \mathcal{N}(x_{0}, T \sigma^2). So we scale variance by σ=σqΔt\sigma=\sigma_{q} \sqrt{ \Delta t }, σq\sigma_{q} is desired terminal variance.

Notations:
In below, tt will represent a continuous-value in the interval [0,1][0,1], subscripts will indicate time rather than index.

DDPM: Stochastic Sampling

DDPM stands for Denoising Diffusion Probabilistic Models

Stochastic Reverse Sampler

For input xtx_{t} and timestep tt, output x^tΔtμtΔt(xt)+N(0,σq2Δt)\hat{x}_{t-\Delta t} \leftarrow \mu_{t-\Delta t}(x_{t}) + \mathcal{N}(0, \sigma_{q}^2 \Delta t)

Claim

μz , s.t. p(xtΔtxt=z)N(xtΔt;μz,σq2Δt)\exists \mu_{z}\text{ , s.t. }p(x_{t-\Delta t}|x_{t}=z) \approx \mathcal N (x_{t-\Delta t}; \mu_{z}, \sigma_{q}^2 \Delta t)

If constant μz\mu_{z} depends only on zz, we can take

μz:=ExtΔt,xt[xtΔtxt=z]=z+(σq2Δt)logpt(z)\begin{aligned} \mu_{z} &:= \mathbb{E}_{x_{t-\Delta t}, x_{t}}[x_{t - \Delta t} | x_{t}=z]\\ &=z+(\sigma_{q}^2 \Delta t)\nabla \log p_{t}(z) \end{aligned}

Proof

The Bayes rule:

p(xtΔtxt)=p(xtxtΔt)ptΔt(xtΔt)pt(xt)p(x_{t-\Delta t}|x_{t})=\dfrac{p(x_{t}|x_{t-\Delta t})p_{t-\Delta t}(x_{t-\Delta t})}{p_{t}(x_{t})}

Take log on both side:

logp(xtΔtxt)=logp(xtxtΔt)+logptΔt(xtΔt)logpt(xt)Drop constants not involve xtΔt=logp(xtxtΔt)+logpt(xtΔt)+O(Δt)Because ptΔt=pt+Δttpt=12σq2ΔtxtΔtxt2+logpt(xtΔt)Substitute N(xt;xtΔt,σq2Δt)=+logpt(xt)+xlogpt(xt),(xtΔtxt)+O(Δt)Taylor expand,  is inner product=12σq2ΔtxtΔtxtσq2Δtxlogpt(xt)2+C=12σq2ΔtxtΔtxt2\begin{aligned} &\log p(x_{t-\Delta t}|x_{t})\\ =&\log p(x_{t}|x_{t-\Delta t})+\log p_{t-\Delta t}(x_{t-\Delta t})\cancel{-\log p_{t}(x_{t})} \quad\quad&\text{Drop constants not involve }x_{t-\Delta t}\\ =&\log p(x_{t}|x_{t-\Delta t})+\log p_{t}(x_{t-\Delta t})+\mathcal{O}(\Delta t) &\text{Because } p_{t - \Delta t}=p_{t}+\Delta t \frac{ \partial }{ \partial t } p_{t}\\ =&-\dfrac{1}{2\sigma_{q}^{2}\Delta t}\lVert x_{t-\Delta t}-x_{t} \rVert ^{2}+\log p_{t}(x_{t-\Delta t})&\text{Substitute } \mathcal{N}(x_{t};\,x_{t-\Delta t},\sigma_{q}^{2}\Delta t)\\ =&-\cdots+\cancel{\log p_{t}(x_{t})}+\langle \nabla_{x}\log p_{t}(x_{t}),(x_{t-\Delta t}-x_{t}) \rangle +\mathcal{O}(\Delta t)&\text{Taylor expand, }\langle \rangle\text{ is inner product} \\ =&-\dfrac{1}{2\sigma_{q}^{2}\Delta t}\lVert x_{t-\Delta t}-x_{t}-\sigma_{q}^{2}\Delta t \nabla_{x} \log p_{t}(x_{t}) \rVert ^{2}+C\\ =&-\dfrac{1}{2\sigma_{q}^{2}\Delta t}\lVert x_{t-\Delta t}-x_{t} \rVert ^{2} \end{aligned}

It is the log density of N(xtΔt;μ,σq2Δt)\mathcal{N}(x_{t-\Delta t};\mu,\sigma_q^{2}\Delta t)

The train loss of DDPM:

1
2
3
4
5
6
def train_loss(f_theta, p):
x0 = p.sample()
t = uniform(0, 1).sample()
x = x0 + normal(0, sigma_q**2 * t).sample()
x_ = x + normal(0, sigma_q**2 * dt).sample()
return (f_theta(x_, t + dt) - x)**2

Sampling:

1
2
3
4
5
6
def DDPM(f_theta):
x = normal(0, sigma_q**2).sample()
for t in reversed(range(0, 1, dt)):
eta = normal(0, sigma_q**2 * dt).sample()
x = f_theta(x, t) + eta
return x
1
2
3
4
5
6
def DDIM(f_theta):
x = normal(0, sigma_q**2).sample()
for t in reversed(range(0, 1, dt)):
weight = (t**0.5) / ((t-dt)**0.5 + t**0.5)
x = x + weight * (f_theta(x, t) - x)
return x
Claim

For Gaussian diffusion setting, we have

E[(xtΔtxt)xt]=ΔttE[x0xt]+(1Δtt)xt\mathbb{E}[(x_{t-\Delta t}-x_{t})|x_{t}]=\dfrac{\Delta t}{t}\mathbb{E}[x_{0}|x_{t}]+\left( 1- \dfrac{\Delta t}{t} \right)x_{t}

DDIM: Deterministic Sampling

Deterministic Reverse Sampler

Given joint distribution (x0,xΔt,,x1)(x_{0},x_{\Delta t},\dots,x_{1}) and conditional expectation μt(z):=E[xtxtΔt=z]\mu_{t}(z):=\mathbb{E}[x_{t}|x_{t-\Delta t}=z]
The DDIM sample receives input sample xtx_{t} and step index tt, outputs

x^tΔtxt+λ(μtΔt(xt)xt),λ:=(σtσtΔt+σt)σtσqt\hat x_{t-\Delta t}\leftarrow x_{t}+\lambda(\mu_{t-\Delta t}(x_{t})-x_{t}), \quad \lambda:=\left( \dfrac{\sigma_{t}}{\sigma_{t-\Delta t}+\sigma_{t}} \right),\sigma_{t}\equiv \sigma_q\sqrt{ t }

Flow Matching

flow

A flow is a collection of time-indexed vector fields v={vt}t[0,1]v=\{ v_{t} \}_{t \in[0,1]}vtv_{t}: velocity-field of a gas at each time tt.

For flow vv and initial point x1x_{1}, there has dxtdt=vt(xt)\dfrac{\,\mathrm{d}x_{t}}{\,\mathrm{d}t}=-v_{t}(x_{t})

The Goal of Flow Matching

Learn a flow vv^* transports qq to pp, where pp is the target distribution, qq is some easy-to-sample base distribution (ie. Gaussian)
The DDIM algorithm is a special case of this.

Pointwise Flow

A pointwise flow v[x1,x0]v^{[x_{1},x_{0}]} is a flow {vt}t\{ v_{t} \}_{t} that satisfies dxtdt=vt(xt)\dfrac{\,\mathrm{d}x_{t}}{\,\mathrm{d}t}=-v_{t}(x_{t}), with boundary conditions x1x_{1} and x0x_{0}

Marginal Flow

weighted average of all individual partical velocities vt[x1,x0]v_{t}^{[x_{1},x_{0}]}

Ex0,x1xt[vt[x1,x0](xt)xt]\mathbb{E}_{x_{0},x_{1}|x_{t}} [v_{t}^{[x_{1},x_{0}]}(x_{t})|x_{t}]

The (x1,x0,xt)(x_{1},x_{0},x_{t}) is induced by sampling (x1,x0)Πq,p(x_{1},x_{0}) \sim \Pi_{q,p}, xtRunFlow(v[x1,x0],x1,t)x_{t}\leftarrow\text{RunFlow}(v^{[x_{1},x_{0}]},x_{1},t)

Flow Matching:

vt(xt):=Ex0,x1xt[vt[x1,x0](xt)xt]    vt=argminf:RdRdE(x0,x1,xt)f(xt)vt[x1,x0](xt)2\begin{aligned} &v_{t}^*(x_{t}):= \mathbb{E}_{x_{0},x_{1}|x_{t}} [v_{t}^{[x_{1},x_{0}]} (x_{t})|x_{t}]\\ \implies &v_{t}^*=\mathop{\,\operatorname{argmin}\,}_{f:\mathbb{R}^d\to \mathbb{R}^{d}} \mathbb{E}_{(x_{0},x_{1},x_{t})} \lVert f(x_{t})-v_{t}^{[x_{1},x_{0}]}(x_{t}) \rVert ^{2} \end{aligned}

Train loss of Flow-matching:

1
2
3
4
5
def train_loss(f_theta, q_p_dist, pointwise_flow):
x1, x0 = q_p_dist.sample()
t = uniform(0, 1).sample()
xt = run_flow(pointwise_flow(x1, x0), x1, t)
return (f_theta(xt, t) - vt(x1,x0)(xt)) ** 2
1
2
3
4
5
6
def sample(f_theta, base_dist, step_size):
x1 = base_dist.sample()
x0 = x1
for i in reversed(range(0, 1, step_size)):
x0 = x0 + f_theta(x0, t) * step_size
return x0

VAE

VAE: Variational Auto-Encoder

Latent Variables

Latent Variables zz are variables that we do not observe and hense are not part of training dataset.

Encoder: convert from input xx to latent variables zz.

Decoder: convert from zz to generated vector x^\hat{x}

Variational: 变分,关于在函数上的优化

VAE: search for the optimal probability distributions to describe xx and zz.

Key Distributions

  • p(x)p(x): The true distribution of xx. THE ULTIMATE GOAL of diffusion is to draw a sample from p(x)p(x).
  • p(z)p(z): The distribution of latent variable. Typically it is made to be N(0,I)\mathcal{N}(0,I) Any distribution can be generated by mapping a Gaussian through a sufficiently complicated function.
  • p(zx)p(z|x): The conditional distribution associated with the encoder, the likelihood of zz when given xx.
  • p(xz)p(x|z): decoder, posterior probability of getting xx given zz.
  • qΦ(zx)q_{\Phi}(z|x): The proxy for p(zx)p(z|x) that can be parameterized using deep neural networks. eg.

    (μ,σ2)=EncoderNetworkΦ(x),qΦ(zx)=N(zμ,diag(σ2))(\mu,\sigma^{2})=\text{EncoderNetwork}_{\Phi}(x), q_{\Phi}(z|x) =\mathcal{N}(z|\mu,\text{diag}(\sigma^{2}))

  • pθ(xz)p_{\theta}(x|z): The proxy for p(xz)p(x|z)

![[Pasted image 20250503161111.png]]

ELBO

ELBO: Evidence Lower Bound

ELBO(x)=defEqϕ(zx)[logp(x,z)qϕ(zx)]\text{ELBO}(x)\stackrel{\text{def}}{=} \mathbb{E}_{q_{\phi}}(z|x)\left[ \log \dfrac{p(x,z)}{q_{\phi}(z|x)} \right]

KL-divergence

DKL(PQ)=ExP[lnp(x)q(x)]\mathbb{D}_{\text{KL}}(P\|Q)=\mathop{\mathbb{E}}_{x \sim P}\left[ \ln \dfrac{p(x)}{q(x)} \right]

Decomposition of Log-Likelihood

logp(x)=Eqϕ(zx)[logp(x,z)qϕ(zx)]+DKL(qϕ(zx)p(zx))\log p(x)=\mathbb{E}_{q_{\phi}}(z|x)\left[ \log \dfrac{p(x,z)}{q_{\phi}(z|x)} \right]+\mathbb{D}_{\text{KL}}(q_{\phi}(z|x)\|p(z|x))

Proof

logp(x)=logp(x)×qϕ(zx)dz1=logp(x)×qϕ(zx)dz=Eqϕ(zx)[logp(x)]=Eqϕ(zx)[logp(x,z)p(zx)]Bayes Theorem=Eqϕ(zx)[logp(x,z)p(zx)qϕ(zx)qϕ(zx)]=Eqϕ(zx)[logp(x,z)qϕ(zx)]ELBO+Eqϕ(zx)[logqϕ(zx)p(zx)]DKL(qϕ(zx)p(zx))\begin{aligned} \log p(x)&=\log p(x)\times \underbrace{ \int q_{\phi} (z|x)\,\mathrm{d}z }_{ 1 }\\ &=\int \log p(x) \times q_{\phi}(z|x) \,\mathrm{d}z\\ &=\mathbb{E}_{q_\phi(z|x)}[\log p(x)]\\ &=\mathbb{E}_{q_\phi(z|x)}\left[ \log \dfrac{p(x,z)}{p(z|x)} \right] \quad &\text{Bayes Theorem}\\ &=\mathbb{E}_{q_\phi(z|x)}\left[ \log \dfrac{p(x,z)}{p(z|x)} \cdot \dfrac{q_\phi(z|x)}{q_\phi(z|x)} \right]\\ &=\underbrace{ \mathbb{E}_{q_\phi(z|x)}\left[ \log \dfrac{p(x,z)}{q_\phi(z|x)} \right] }_{ \text{ELBO} }+\underbrace{ \mathbb{E}_{q_\phi(z|x)}\left[ \log \dfrac{q_\phi(z|x)}{p(z|x)} \right] }_{ \mathbb{D}_{\text{KL}} (q_\phi(z|x)\|p(z|x)) } \quad& \end{aligned}

So the ELBO is a lower bound of logp(x)\log p(x), maximize ELBO can achieve the goal of maximize logp(x)\log p(x).

When the KL-divergence is zero, qϕ(zx)=p(zx)q_\phi(z|x)=p(z|x), since p(zx)p(z|x) is delta function, we have

qϕ(zx)=N(zxμσ,0)=δ(zxμσ)q_\phi(z|x)=\mathcal{N}\left( z \left| \frac{x-\mu}{\sigma},0 \right.\right)=\delta\left( z-\dfrac{x-\mu}{\sigma} \right)

ELBO is still not useful, for it involves p(x,z)p(x,z) that we do not have access.

Theorem

ELBO(x)=Eqϕ(zx)[logpθ(xz)]how good your decoder isDKL(qϕ(zx)p(z))how good your encoder is\text{ELBO}(x)=\underbrace{ \mathbb{E}_{q_\phi(z|x)}[\log p_{\theta}(x|z)] }_{ \text{how good your decoder is} }\quad\underbrace{-\quad \mathbb{D}_{\text{KL}}(q_\phi(z|x)\|p(z)) }_{ \text{how good your encoder is} }

pθ(xz),qϕ(zx),p(z)p_{\theta}(x|z),q_\phi(z|x),p(z) are both Gaussian

Proof

ELBO(x)=defEqϕ(zx)[logp(x,z)qϕ(zx)]=Eqϕ(zx)[logp(xz)p(z)qϕ(zx)]=Eqϕ(zx)[logpθ(xz)]+Eqϕ(zx)[logp(z)qϕ(zx)] \begin{aligned} \text{ELBO}(x)&\stackrel{\text{def}}{=}\mathbb{E}_{q_\phi(z|x)}\left[ \log \dfrac{p(x,z)}{q_\phi(z|x)} \right]\\ &=\mathbb{E}_{q_\phi(z|x)}\left[ \log \dfrac{p(x|z)p(z)}{q_\phi(z|x)} \right]\\ &=\mathbb{E}_{q_\phi(z|x)}[\log p_{\theta}(x|z)]+\mathbb{E}_{q_\phi(z|x)}\left[ \log \dfrac{p(z)}{q_\phi(z|x)} \right] \end{aligned}

Note that we replaced p(xz)p(x|z) by pθ(xz)p_{\theta}(x|z) since the latter is accessible.

The meaning of each term:

  • Reconstruction:
    Eqϕ(zx)[logpθ(xz)]\mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}[\log p_\theta(\mathbf{x}|\mathbf{z})]. It is similar to maximum likelihood where we want to find the model parameter to maximize the likelihood. The expectation is taken w.r.t. samples z\mathbf{z} that is sampled from qϕ(zx)q_\phi(\mathbf{z}|\mathbf{x})
  • Prior Matching:
    KL divergence for encoder. Let encoder to turn xx to a latent vector zz such that the zz vector follows the choice of latent distribution p(z)p(\mathbf{z}).
Example

xp(x)=N(xμ,σ2I)\mathbf{x} \sim p(\mathbf{x})=\mathcal{N}(\mathbf{x}|\boldsymbol{\mu},\sigma^{2}\mathbf{I})
zp(z)=N(z0,I)\mathbf{z}\sim p(\mathbf{z})=\mathcal{N}(\mathbf{z}|0,\mathbf{I})
So that zz can be trivial solution xuσ\dfrac{x-u}{\sigma}, and x^=μ+σz\hat{x}=\mu+\sigma z.
p(xz)=δ(x(σz+μ))p(\mathbf{x}|\mathbf{z})=\delta(\mathbf{x}-(\sigma \mathbf{z}+\boldsymbol{\mu}))
p(zx)=δ(zxμσ)p(\mathbf{z}|\mathbf{x})=\delta\left( \mathbf{z}-\dfrac{\mathbf{x}-\boldsymbol{\mu}}{\sigma} \right)

Suppose we don’t know p(x)p(\mathbf{x}) so we need to estimate zz and xx.

(μ^(x),σ^(x)2)=Encoderϕ(x)qϕ(zx)=N(zax+b,t2I)\begin{aligned}(\hat{\boldsymbol{\mu}}(\mathbf{x}),\hat{\sigma}(\mathbf{x})^{2})&=\text{Encoder}_{\phi}(\mathbf{x})\\q_\phi(z|x)&=\mathcal{N}(\mathbf{z}|a\mathbf{x}+\mathbf{b},t^{2}\mathbf{I})\end{aligned}

Assume μ^\hat{\boldsymbol{\mu}} is an affine function of xx.

qϕ(zx)=N(zax+b,t2I)q_\phi(\mathbf{z}|\mathbf{x})=\mathcal{N}(\mathbf{z}|a\mathbf{x}+\mathbf{b},t^{2}\mathbf{I})

For decoder, we have

pθ(xz)=N(zcx+v,s2I)p_\theta(\mathbf{x}|\mathbf{z})=\mathcal{N}(\mathbf{z}|c\mathbf{x}+\mathbf{v},s^{2}\mathbf{I})

For KL-divergence DKL(qϕ(zx)p(xz))\mathbb{D}_{\text{KL}}(q_\phi(\mathbf{z}|\mathbf{x})\|p(\mathbf{x}|\mathbf{z})) to be zero, we have

qϕ(zx)=N(zxμσ,0)=δ(zxμσ)q_\phi(\mathbf{z}|\mathbf{x})=\mathcal{N}\left( \mathbf{z}| \dfrac{x-\mu}{\sigma},0 \right)=\delta\left( \mathbf{z}-\dfrac{x-\mu}{\sigma} \right)

Substitue to Eqϕ(zx)[logpθ(xz)]\mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}[\log p_\theta(\mathbf{x}|\mathbf{z})]:

Eqϕ(zx)[logpθ(xz)]=Eqϕ(zx)[logN(xcz+v,s2I)]=12log2πlogsc22s2[xμσxvc2]12log2πlogs\begin{aligned} \mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}[\log p_\theta(\mathbf{x}|\mathbf{z})]&=\mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}[\log \mathcal{N}(\mathbf{x}|c\mathbf{z}+\mathbf{v},s^{2}\mathbf{I})]\\ &=-\dfrac{1}{2}\log 2\pi-\log s-\dfrac{c^{2}}{2s^{2}}\left[ \left\lVert \dfrac{x-\mu}{\sigma} -\dfrac{x-v}{c} \right\rVert ^{2} \right]\\ &\leq-\dfrac{1}{2} \log 2\pi-\log s \end{aligned}

When v=μ,c=σ\mathbf{v}=\boldsymbol{\mu},c=\sigma,the equal holds. when s=0s=0, this term reach its maximum. This implies that pθ(xz)=N(xσz+μ,0)=δ(x(σz+μ))p_\theta(\mathbf{x}|\mathbf{z})=\mathcal{N}(\mathbf{x}|\sigma \mathbf{z}+\boldsymbol{\mu},0)=\delta(\mathbf{x}-(\sigma \mathbf{z}+\boldsymbol{\mu}))

The ELBO have limitations when qϕ(zx)q_\phi(\mathbf{z}|\mathbf{x}) may not equals to p(zx)p(\mathbf{z}|\mathbf{x}), thus ELBO not same to logp(x)\log p(\mathbf{x})

Example

If we don’t know p(zx)p(\mathbf{z}|\mathbf{x}), we need to train VAE by maxing ELBO.

qϕ(zx)=N(z(xμ)σ,t2I)pθ(xz)=N(xσz+μ,s2I)\begin{aligned} &q_\phi(\mathbf{z}|\mathbf{x})=\mathcal{N}\left( \dfrac{\mathbf{z}|(x-\mu)}{\sigma},t^{2}\mathbf{I} \right)\\ &p_\theta(\mathbf{x}|\mathbf{z})=\mathcal{N}(\mathbf{x}|\sigma \mathbf{z}+\boldsymbol{\mu},s^{2}\mathbf{I}) \end{aligned}

After maximizing DKL(qϕ(zx)p(z))\mathbb{D}_{\text{KL}}(q_\phi(\mathbf{z}|\mathbf{x})\|p(\mathbf{z})), and Eqϕ(zx)[logpθ(xz)]\mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}[\log p_\theta(\mathbf{x}|\mathbf{z})],we have

qϕ(zx)=N(zxμσ,I)pθ(xz)=N(xσz+μ,σ2I)\begin{aligned} q_\phi(\mathbf{z}|\mathbf{x})=\mathcal{N}\left( \mathbf{z}| \dfrac{\mathbf{x}-\mu}{\sigma},\mathbf{I} \right)\\ p_\theta(\mathbf{x}|\mathbf{z})=\mathcal{N}(\mathbf{x}| \sigma \mathbf{z}+\boldsymbol{\mu},\sigma^{2}\mathbf{I}) \end{aligned}

Compare to former result, the result here contains variance I\mathbf{I} and σ2I\sigma^{2}\mathbf{I}, adds randomness to samples.
Thus we know that maximizing ELBO is not just maximizing logp(x)\log p(x).

Optimizing VAE

Since Monte-Carlo can not sample the gradient of distribution EzPϕ(z)[f(z)]E_{z\sim P_{\phi}(z)}[f(z)] itself (i.e. ϕ{f(z)Pϕ(z)}dzϕ{f(z)}Pϕ(z)dz=1Ni=1Nϕf(zi)\int \nabla_{\phi}\{f(z)P_{\phi}(z)\}\,\mathrm{d}z\neq \int \nabla_{\phi} \{ f(z) \}P_{\phi}(z)\,\mathrm{d}z=\dfrac{1}{N}\sum_{i=1}^N \nabla_{\phi}f(z_{i}))

We need to introduce the reparameterization trick: express zz as some differentiable transformation of another random variable ε\varepsilon which is independent to parameter ϕ\phi.

In the ELBO context, we define a function gg s.t. z=g(ε,ϕ,x)\mathbf{z}=\mathbf{g}(\boldsymbol{\varepsilon},\boldsymbol{\phi},\mathbf{x}) for random var εp(ε)\varepsilon\sim p(\boldsymbol{\varepsilon}) and qϕ(zx)det(zε)=p(ε)q_\phi(\mathbf{z}|\mathbf{x}) \cdot \left\lvert \det\left( \frac{ \partial \mathbf{z} }{ \partial \boldsymbol{\varepsilon} } \right) \right\rvert=p(\boldsymbol{\varepsilon})

Eqϕ(zx)(f(z))=f(z)qϕ(zx)dz=f(g(ε))qϕ(g(ε)x)dg(x)=f(g(ε))qϕ(g(ε)x)det(g(ε)ε)dε(changing variable)=f(z)p(ε)dε=Ep(ε)[f(z)]\begin{aligned} \mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}(f(\mathbf{z})) &=\int f(z)\cdot q_\phi(\mathbf{z}|\mathbf{x})\,\mathrm{d}\mathbf{z}\\ &=\int f(g(\boldsymbol{\varepsilon}))\cdot q_\phi(\mathbf{g}(\boldsymbol{\varepsilon})|\mathbf{x}) \, \mathrm{d}\mathbf{g}(\mathbf{x}) \\ &=\int f(g(\varepsilon))\cdot q_\phi(\mathbf{g}(\mathbf{\varepsilon})|\mathbf{x})\cdot \left\lvert \det\left( \frac{ \partial \mathbf{g}(\boldsymbol{\varepsilon}) }{ \partial \boldsymbol{\varepsilon} } \right) \right\rvert \, \mathrm{d}\boldsymbol{\varepsilon}\quad\text{(changing variable)}\\ &=\int f(\mathbf{z})\cdot p(\boldsymbol{\varepsilon}) \, \mathrm{d}\boldsymbol{\varepsilon} \\ &=\mathbb{E}_{p(\mathbf{\varepsilon})}[f(\mathbf{z})] \end{aligned}

ϕEqϕ(zx)[f(z)]=Ep(ε)[ϕf(z)]\nabla_{\phi}\mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}[f(\mathbf{z})]=E_{p(\boldsymbol{\varepsilon})} [\nabla_{\phi}f(\mathbf{z})]

Recall the ELBO formula, we can substitute f(z)=logqϕ(zx)f(\mathbf{z})=-\log q_{\phi}(\mathbf{z}|\mathbf{x})

ϕEqϕ(zx)[logqϕ(zx)]=1Ll=1Lϕ[logdetz(l)ε(l)]\nabla_{\phi}\mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}[-\log q_\phi(\mathbf{z}|\mathbf{x})]=\dfrac{1}{L}\sum_{l=1}^L \nabla_{\phi}\left[ \log \left\lvert \det \frac{ \partial \mathbf{z}^{(l)}}{ \partial \boldsymbol{\varepsilon}^{(l)} }\right\rvert \right]

Integration by Substitution

Df(z)dz=g1(D)f(g(ε))det(zε)dε\int _{D}f(\mathbf{z}) \, \mathrm{d}\mathbf{z}=\int_{\mathbf{g}^{-1}(D)}f(\mathbf{g}(\varepsilon)) \cdot \left\lvert \det\left( \frac{ \partial \mathbf{z} }{ \partial \boldsymbol{\varepsilon} } \right) \right\rvert \mathrm{d}\boldsymbol{\varepsilon}

Reference

Step-by-Step Diffusion: An Elementary Tutorial
Tutorial on Diffusion Models for Imaging and Vision