# Variational autoencoder (VAE )

[mathjax]

## Why do we need the VAE

An autoencoder reduce an input of many dimensions, to a vector space of less dimension, then it recompute the lossed dimension from that limited number of intermediate vectors. This intermediate dimension is called the latent space.

An autoencoder is good at task like filtering noise, however, it is difficult to make it generate new images because the latent vector space is sparse. That means, we can’t sample randomly from that space and expects a vector producing a valid image. A lot of points in that space produce nothing of value. A variational autoencoder force the latent space to be continuous so that we can pick a random vector and get a meaningful image from it. We are going to explore how to do this.

Beyond that case, this article can also be useful if you want to understand why and how to use alternative loss function in Keras.

## How to fill the latent space ?

Forcing the VAE to only generate humanly interpretable image can be understood as having all dots in the latent space to produce interpretable image. To do that we need to sample all dots of the latent space somehow and to force them to produce some expected result.

Sampling all the dots of the latent space is of course impossible, because to fill the latent space we would need an infinity of them. So we’d prefer to use something with a surface. We could imagine filling the space with squares or disks but as they are discrete shapes, it would be hard to cover the whole space. It would really be like a game of Tetris. They would also be hard to move as this would create conflict with other pieces.

So, instead of a dot or a discrete surface, we’ll use a normal distribution, which is a probability density function. In two dimensions it looks a bit like a disk with a variable density of existence. As we get further from the center of it, it gets softer and softer (less probable ), so it’s easier to stitch other PDF next to it.

We now need to know how to place our PDF across the latent space to cover it entirely. However, the backpropagation optimize the latent space one sample at a time with labels of some surface. We can’t just distribute them across the space. The way to do it is to grow individual elements as large as possible, given it doesn’t break the quality of the results, then we do it again on the next epoch and grows even further the elements if possible until it’s not possible to do better.

Pointing the backpropagation algorithm in the right direction is the role of the loss function.

## Defining a loss function

We will define the loss function as the sum of two positive real numbers. That way it’s easy to see that to minimize A+B, A and B being positive we have to minimize as much as possible A and B.

For the first part, we simply use the crossentropy loss function of keras. This part will grow if the output is different from what is expected. This will cause the latent space to be organized into clusters of dots (it’s the same thing we do for the ordinary autoencoder ).

For the second part, we want to cover as much latent space as possible. In order to do that we have to decide what the latent space will be. Because we are going to sample from a normal distribution, it’s practical to decide that the latent space will be a normal distribution as well. What is the best cover we can imagine to cover a normal distribution ? Of course that’s the distribution itself. The measure how much two distributions are alike is the Kullback–Leibler divergence. It’s expressed as :

$$D_{KL}(P,Q) = E(log(P/Q))$$

We can observe that, the divergence of two same distribution is E(log(1)) = 0.

In practice we are interested about the specific case of the distance between two normal distributions which is :

## The KL divergence for the VAE

KL divergence is the esperance of the difference of the quantity of information of the events under some distribution P with the quantity of information provided by the same events under some other distribution Q. Its seems complicated but what it means it that we measure the difference of what we expected for each case and what we got, then we do the ponderated mean of it.

$$D_{KL}(P\|Q) = E(log(P/Q)) = E(log(P)-log(Q))$$

For the VAE the distribution we use is the normal distribution. We want the NN to optimize the distribution of X so that they are more tightly packed around the origin. So we are going to optimize so that the P distribution look the most like the N(0,1) distribution (a gaussian distribution located around the origin). In the case of the VAE we are interested in the KL divergence of :

$$D_{KL}(N(\mu, \sigma) || N(0,1) )$$

We use the multivariate form of it (found on wikipedia ) :

$$D_{KL}(N((\mu_1,…,\mu_k)^T, diag(\sigma_1^2,…,\sigma_k^2))|N(0,1)) = \frac{1}{2}\sum_{i=1}^k(\sigma_1^2+\mu_i^2-ln(\sigma_i^2)-1]$$

The sample code below use a standard implementation for Keras. It actually implements something a little different. As the NN can learn to produce any function, it is trained to learn $log(\sigma^2)$ instead of $\sigma^2$ for numeric stability because the NN can map values between 0 and 1 to any negative number [stackoverflow].

$$D_{KL} = -1/2\sum_{i=0}^n -e^{log(\sigma^2)}-\mu^2+log(\sigma^2) + 1$$

## The reparameterization trick

The backpropagation need be able to determine how changing the mean and variance of each parameter affects the output of the loss function. However, we can’t just sample from $N(\mu,\sigma)$ as it would not be possible estimate what change to mean and variance would have produced as sample.

Instead of sampling from $N(\mu,\sigma)$, we sample from N(0,1) then reparametrize it to it’s correct position. This change the problem to a linear function which is derivable.

For a fixed epsilon we can now compute what changes to the mean and variance would create as change to the loss function.

## Keras implementation

# reparameterization trick# instead of sampling from Q(z|X), sample eps = N(0,I)z = z_mean + sqrt(var)*epsdef sampling(args):    z_mean, z_log_var = args    batch = K.shape(z_mean)    dim = K.int_shape(z_mean)    # by default, random_normal has mean=0 and std=1.0    epsilon = K.random_normal(shape=(batch, dim))    return z_mean + K.exp(0.5 * z_log_var) * epsilon# MNIST dataset(x_train, y_train), (x_test, y_test) = mnist.load_data()x_train = x_train[0:1000]image_size = x_train.shapeoriginal_dim = image_size * image_sizex_train = np.reshape(x_train, [-1, original_dim])x_test = np.reshape(x_test, [-1, original_dim])x_train = x_train.astype('float32') / 255x_test = x_test.astype('float32') / 255# network parametersinput_shape = (original_dim, )intermediate_dim = 128batch_size = 128latent_dim = 2epochs = 1# create a encoder modelinputs = Input((784,))x = Dense(intermediate_dim, activation='relu')(inputs)z_mean = Dense(latent_dim, name='z_mean')(x)z_log_var = Dense(latent_dim, name='z_log_var')(x)z = Lambda(sampling, output_shape=(latent_dim,), name='z')([z_mean, z_log_var])encoder = Model(inputs=inputs, outputs=[z_mean, z_log_var, z], name='encoder')# create a decoder modeli = Input((latent_dim,))x = Dense(intermediate_dim, activation='relu')(i)x = Dense(784, activation='softmax')(x)decoder = Model(inputs=i, outputs=x,name='decoder')# instantiate VAE modeloutputs = decoder(encoder(inputs))vae = Model(inputs, outputs, name='vae_mlp')# VAE loss = mse_loss or xent_loss + kl_lossreconstruction_loss = binary_crossentropy(inputs, outputs)reconstruction_loss *= original_dimkl_loss = 1 + z_log_var - K.square(z_mean) - K.exp(z_log_var)kl_loss = K.sum(kl_loss, axis=-1)kl_loss *= -0.5vae_loss = K.mean(reconstruction_loss + kl_loss)vae.add_loss(vae_loss)vae.compile(optimizer='adam', loss=None)parser = OptionParser()parser.add_option("-t", "--train", action="store_true")(options, args) = parser.parse_args()# train the autoencodervae.load_weights('vae_mlp_mnist.h5')if options.train:    vae.fit(x_train,            epochs=40,            batch_size=100)    vae.save_weights('mnist_variational_autoencoder.h5')else:    vae.load_weights('mnist_variational_autoencoder.h5')# generate some image from random latent vectorlsv = np.random.normal(size=(5, latent_dim))lsv = np.array([[x/10.0,y/10.0] for x in range(-10,10,2) for y in range(-10,10,2)])imgs = decoder.predict(lsv)print(imgs)iplot = 1        for img in imgs:    img = img.reshape(28,28)    plt.subplot(10,10, iplot)    iplot+=1    plt.imshow(img)plt.show()

## Reminder of probability and information theory

I needed a few reminder about probability and information theory before understanding the intuition about KL divergence. Here they are :

A bit of information theory and probability :

• A event of probability $\frac{1}{2^n}$ provide a quantity of information of $log_2(\frac{1}{2^n}) = n$. If something has a probability of information of $\frac{1}{2^3}$ it’s quantity of information is $log_2(\frac{1}{2^3})=3$. The generalization of this is :

$$I = -log(p(x))$$

• Mean and variance of continuous random variables

The mean is the sum of x * density probability function at that point.

$$\mu = \int_{-\infty}^{\infty} xp(x)$$

The variance is the sum of distance to the mean times the density probability function at that point.

$$\sigma^2 = \int_{-\infty}^{\infty} (x-\mu)^2p(x)$$

• Esperance of a probability distribution is the value we could expect to have if we repeat some experiment a lot of times. It’s the sum of the value of the events ponderated by the probability of the event to occur.

$$E(X)=\sum_{0}^{n} p(x_n) x_n$$

• The information entropy is the esperance of the information of a random variable.

$$H(X)=-E(log(P(X))))=\sum_{i=0}^{n} log_n(x_i)p(x_i)$$

We use the special case of a diagonal multivariate normal, and a standard normal distribution as stated in : https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence#Multivariate_normal_distributions

• The normal distribution

$$f(x)=\frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{1}{2}\left (\frac{x-\mu}{\sigma}\right)^2}$$

• Multivariate normal gaussian

$$p(x)=\frac{1}{(2\pi)^{n/2}det(\sigma)^{1/2} }e\left(-1/2(x-\mu)^T)\sigma^{-1}(x-\mu)\right)$$

The standard normal distribution N(0,1) (the formula above with $\mu=0$ and $\sigma=0$

A few readings that were interesting in understanding the VAE :