Variational Autoencoder: Theory

Dive into the mathematics behind variational autoencoders.

In simple terms, a variational autoencoder is a probabilistic version of autoencoders.

Why?

Because we want to be able to sample from the latent vector (zz) space to generate new data, which is not possible with vanilla autoencoders.

Each latent variable zz that is generated from the input will now represent a probability distribution (or what we call the posterior distribution denoted as p(zx)p(z|x)).

All we need to do is find the posterior p(zx)p(z|x) or solve the inference problem.

In fact, the encoder will try to approximate the posterior by computing another distribution q(zx)q(z|x), known as the variational posterior.

Note that a probability distribution is fully characterized by its parameters. In the case of the Gaussian, these are the mean μ\mu and the standard deviation σ\sigma.

So it is enough to pass the parameters (mean μ\mu and the standard deviation σ\sigma) of the normal probability distribution — denoted as N(μ,σ)N(\mu, \sigma) in the decoder — instead of simply passing the latent vector zz like the simple autoencoder.

Then, the decoder will receive the distribution parameters and try to reconstruct the input x. However, this statement is factually incorrect because you cannot compute the gradients of a constantly changing operation (stochastic). In other words, you cannot backpropagate through a sampling operation. This is exactly the heart of learning to train variational autoencoders.

Let’s see how we can make it possible. (Hint: Check the reparameterization trick section below.)

Train a variational autoencoder

First things first.

Since our goal is for the variational posterior p(zx)p(z|x) to be as close as possible to the true posterior, the following loss function is used to train the model.

Lθ,ϕ(x)=Eqϕ(zx)[logpθ(xz)]KL(qϕ(zx)pθ(z))L_{\theta,\phi}(x) = E_{q_{\phi}(z|x)} [ log p_{\theta}(x|z) ] - KL(q_{\phi}(z |x) || p_{\theta}(z))

You can find it as ELBO if you search the literature, and it can be derived from some tough math.

  • The first term Eqϕ(zx)[logpθ(xz)]E_{q_{\phi}(z|x)} [ log p_{\theta}(x|z) ] controls how well the VAE reconstructs a data point xx from a sample zz of the variational posterior, and it is known as negative reconstruction error.
  • The second term controls how close the variational posterior qϕ(zx)q_{\phi}(z |x) is to the prior pθ(z)p_{\theta}(z).

E is used to denote the expected value or expectation. The expectation of a random variable X is a generalization of the weighted average of X and can be thought of as the arithmetic mean of a large number of X.

KL refers to Kullback–Leibler divergence and, in simple terms, is a measure of how different a probability distribution is from a second one. KL(PQ)=p(x)log(p(x)q(x))dx{KL}(P\parallel Q)=\int _{-\infty }^{\infty }p(x)\log \left({\frac {p(x)}{q(x)}}\right)dx

In practice, we used closed analytical forms to compute the ELBO:

The reconstruction term can be proved to be logpθ(xizi)=j=1n[xijlogpij+(1xij)log(1pij)]\log p_{\theta}(x_i \lvert z_i) = \sum_{j=1}^{n} [ x_{ij} \log p_{ij} + (1-x_{ij}) \log (1-p_{ij}) ] when the data points are binary (follow the Bernoulli distribution). This equation is simply the binary cross entropy and can be implemented using torch.nn.BCELoss(reduction='sum') in Pytorch.

The KL-Divergence also has a closed form if we assume that the prior distribution is a Gaussian. It can be written as 12j=1J(1+logσj2μj2σj2)-\frac{1}{2} \sum_{j=1}^{J} (1 + \log \sigma_j^2 - \mu_j^2 - \sigma_j^2), where μ\mu is the mean and σ\sigma is the variance.

Given that, can you try and implement ELBO from scratch?

Get hands-on with 1400+ tech skills courses.