Variational AutoEncoder
- This is mostly based on Understanding Variational Autoencoder and Variational Autoencoder - Model, ELBO, loss function and maths explained easily
AutoEncders
- They are used for:-
- converting higher dimension to lower dimension
- Representation Learning
Cons of AutoEncoders
- The encoded values can be used as representation but deep down they don’t have any meaningful information.
- There is no semantic relationship between encoded data points like zebra or cat or tomato. They will be encoded but there will not be any similarity.
- Since everything is being encoded to the points, not distribution, we can not sample any new data point
What we actually want?
- Instead of encoding to data points, encode in to some distribution so that we can generate new data points.
- But even if we try to encode to a distribution, the model will again ty to memorise things as it just have to minimze the recinstruction loss.
- Get variations in reconstructed images.
- So we want is that:
- There should be some sort of latent distribution which is just not encoded to points, it should be wide enough to capture entire latent space so that we can have higher variance and thus variation in reconstructed images.
- And also distributions should not be far apart from each other so that if we sample any latent then it should give us something meaningful not random images.
How do we achieve above requirements?
- Gaussian distribution would help in achieving above objectives
- It will try to have mean as close to zero as possible thus keep distributions closer to each other.
- It will try to have unit variance thus ensuring that distributions have higher variance and hence disable the network to collapse distributions to a point.
- So basically what we want is that our encoded distribution should look like a gaussian distribution.
- We can use some sort of metric that will help in keeping the encoded distribution and gaussian distribution as similar as possible.
- And that metric is KL-Divergence, which we can use to do so.
Before moving forward some pre-requisites
Probability vs Likelihood
- You can checkout In Statistics, Probability is not Likelihood for the same.
- Consider the weighs of mice
Probability
- Lets say we have plotted all the weights which gives use mean 32 and std 2.5, now what is the probability of having a mouse which has weight between 32-34 is given by area under the curve
- P(weight b/w 32-34 | mean=32, std=2.5)
Likelihood
Considering we have fixed data/samples what is the likelihood of the current mean and std, which is given by the value corresponding to given weight on Y-axis.
Basically in this case we will have a continuous distribution with mean as 32 and std as 2.5 and then we will found out what is the likelihood of having 34 as weight.
L(mean=32, std=2.5 | weight=34)
While training we try to maximize the lileihood for out data points so that our model can represnt the data samples as closely as possible
During inference, we check the probability of given sample based on the params(mean and std) we have learned while maximizing the likelihood.
Some maths
The expectation \mathbb{E}[X] = \int_{-\infty}^{\infty} x \, f(x) \, dx gives the mean. where x is value and f(x) \, dx is probability.
Simply we can say that it is long run average for continuous random variable x.
Similalry for discrete one is:
- \mathbb{E}[X] = \sum_{x} x \, p(x), where x is value and p(x) is probability.
Chain rule of probability p(x, y) = p(x|y)p(y) = p(y|x)p(x)
Bayes’ Theorem
- P(x \mid y) = \frac{P(y \mid x) \, P(x)}{P(y)}
Kullback Leibler Diveregence
- It allows to measure the distance between two probability distributions.
- D_{\text{KL}}(P \| Q) = \int_{-\infty}^{\infty} p(x) \log \frac{p(x)}{q(x)} \, dx
- Properties:-
- Not symettric i.e. $D_{}(P | Q) D_{}(Q | P) $
- D_{\text{KL}}(P \| Q) \geq 0
- D_{\text{KL}}(P \| Q) = 0 if P=Q
Lets define our model
We can define the likelihood of our data as the marginalization over the joint probability with respect to the latent variable
- p(x) = \int p(x, z) dz
But it is intractable becuase we would need to evaluate this integral over all latent variables z or we can use the chain rule of probability
- p(x) = \frac{p(x, z)}{p(z|x)}
We dont have ground truth p(z|x) which is also what we are trying to find.
Intractable Problem:- A problem that can be solved in theory (e.g. given large but finite resources, especially time), but for which in practice any solution takes too many resources to be useful is known as intractable problem.
- Like guessing the wi-fi password of your neighbour. In theory you can generate all the possible combinations but it will take years.