Variational AutoEncoder

Deep Learning

Notes on VAEs:- Theoretical and Coding

Author

Ritesh Kumar Maurya

Published

May 6, 2026

AutoEncders

They are used for:-
- converting higher dimension to lower dimension
- Representation Learning

The encoded values can be used as representation but deep down they don’t have any meaningful information.
There is no semantic relationship between encoded data points like zebra or cat or tomato. They will be encoded but there will not be any similarity.
Since everything is being encoded to the points, not distribution, we can not sample any new data point

Instead of encoding to data points, encode in to some distribution so that we can generate new data points.
But even if we try to encode to a distribution, the model will again ty to memorise things as it just have to minimze the recinstruction loss.
Get variations in reconstructed images.
So we want is that:
- There should be some sort of latent distribution which is just not encoded to points, it should be wide enough to capture entire latent space so that we can have higher variance and thus variation in reconstructed images.
- And also distributions should not be far apart from each other so that if we sample any latent then it should give us something meaningful not random images.

Gaussian distribution would help in achieving above objectives
- It will try to have mean as close to zero as possible thus keep distributions closer to each other.
- It will try to have unit variance thus ensuring that distributions have higher variance and hence disable the network to collapse distributions to a point.
So basically what we want is that our encoded distribution should look like a gaussian distribution.
We can use some sort of metric that will help in keeping the encoded distribution and gaussian distribution as similar as possible.
And that metric is KL-Divergence, which we can use to do so.

Lets say we have plotted all the weights which gives use mean 32 and std 2.5, now what is the probability of having a mouse which has weight between 32-34 is given by area under the curve
P(weight b/w 32-34 | mean=32, std=2.5)

Considering we have fixed data/samples what is the likelihood of the current mean and std, which is given by the value corresponding to given weight on Y-axis.
Basically in this case we will have a continuous distribution with mean as 32 and std as 2.5 and then we will found out what is the likelihood of having 34 as weight.
L(mean=32, std=2.5 | weight=34)
While training we try to maximize the lileihood for out data points so that our model can represnt the data samples as closely as possible
During inference, we check the probability of given sample based on the params(mean and std) we have learned while maximizing the likelihood.

The expectation \mathbb{E}[X] = \int_{-\infty}^{\infty} x \, f(x) \, dx gives the mean. where x is value and f(x) \, dx is probability.
Simply we can say that it is long run average for continuous random variable x.
Similalry for discrete one is:
- \mathbb{E}[X] = \sum_{x} x \, p(x), where x is value and p(x) is probability.
Chain rule of probability p(x, y) = p(x|y)p(y) = p(y|x)p(x)
Bayes’ Theorem
- P(x \mid y) = \frac{P(y \mid x) \, P(x)}{P(y)}
Kullback Leibler Diveregence
- It allows to measure the distance between two probability distributions.
- D_{\text{KL}}(P \| Q) = \int_{-\infty}^{\infty} p(x) \log \frac{p(x)}{q(x)} \, dx
- Properties:-
  - Not symettric i.e. $D_{}(P | Q) D_{}(Q | P) $
  - D_{\text{KL}}(P \| Q) \geq 0
  - D_{\text{KL}}(P \| Q) = 0 if P=Q

We can define the likelihood of our data as the marginalization over the joint probability with respect to the latent variable
- p(x) = \int p(x, z) dz
But it is intractable becuase we would need to evaluate this integral over all latent variables z or we can use the chain rule of probability
- p(x) = \frac{p(x, z)}{p(z|x)}
We dont have ground truth p(z|x) which is also what we are trying to find.
Intractable Problem:- A problem that can be solved in theory (e.g. given large but finite resources, especially time), but for which in practice any solution takes too many resources to be useful is known as intractable problem.
- Like guessing the wi-fi password of your neighbour. In theory you can generate all the possible combinations but it will take years.