Humble YOLO implementation in Keras

This article talks about details of implementations of the YOLO algorithm with Keras. I’m not talking about the exact YOLO implementation, but rather of how we come to YOLO from ML principles. We’ll stay close to the YOLO v1 implementation.

YOLO is a deep learning model that can predict object classes and location. It belongs to the group of classifications algorithm. Compared to other methods it is simple, fast, and robust. YOLO is simple because it is a regressional model only, which means that it only use deep learning to directly predict results. It is fast, thanks for its single pass model. It is robust, because it use the whole context of the image to do a prediction, so it is less sensible to patterns inside a subpart of the image.

The plan

YOLO being a deep learning model, we will first need a set of labeled of images to train our network. From my own experience, I felt that it’s often problematic how to get good data at the beginning. We’ll examine different ways to obtain one.

Next we’ll dig more into the YOLO architecture. The most important thing we need to learn is how YOLO encode it’s predictions. The encoding of the data is key to the understanding the model. Then, we need to evaluate the confidence of the predicted boxes.

Finally we’ll see how to practically combine all the above into the final loss function with Keras. The final code is available on my github. The implementation is done with the minimum of code for better comprehension. I’ll also discuss the code at the end of that article.

Creating a good dataset

A good dataset determine how well a network will learn. If the dataset ignore some object, the network will learn to ignore them as well. A good dataset for YOLO is one that has accurate bounding boxes for all objects that appear. I had an extra constraint of processing power as I only train model on CPU alone. I don’t recommend it, but it’s still a doable task if you can accepts some concession with your datasets.

The usual options to get a dataset are : downloading one, building one, generating one. Let’s take a tour of them.

Downloading a dataset is the easiest option. Kaggle has a lot of them too. Google can help you locate one too. It’s the simplest way, but it’s not always possible to find a good dataset. They are often quite large (in Gigabytes ), so they are not suitable for little tests. With limited power, it’s possible to reprocess one by reducing image size.

The second possibility is to create one. It’s a rather long and manual process. Experienced people say you should aim to have at least 500 examples of objects for each class to train YOLO with good generalisation. More is better, but it’ll also take more time to train. Once you have collected images, you’ll need to label them. BBox-Label-Tool. LabelImg , Yolo Mark and CVAT are all software that can be used for labelling. It’s a time consuming solution, but the best one if you have unique data. Don’t rush it, the quality of the learning of the network depends of the quality of the dataset.

The third possibility, and the one I used here, is to generate a dataset programmatically. This solution certainly produce less realistic dataset, but it is still a good choice because you can control how complex your data can be : how many items, how dense, etc… I used the python object library for this. I generate images with the texts chat and rat randomly placed inside an image. As we are generating images, we write bounding boxes of objects in a text file named the same way as their image.

Images generated by the generator :

Data file generated by the generator : each cell predict class, one box and confidence of the presence.

; cell (0,0)
0 1 0 ; class of box (the type of the object )
14 22 18 10 ; bounding box of the object
1 ; confidence of an object here
; cell (1,0)
1 0 0
35 23 24 10
; cell (0,1)
0 0 1
16 48 32 32
; cell (1,1)
1 0 0
52 35 24 10

Detecting objects

Let’s dive into into the YOLO algorithm itself now. The structure of the data is very important in the YOLO algorithm comprehension : YOLO divide the image in a grid. For each cell YOLO predict a class of object in the form of a one hot vector, five boxes and a confidence score for each box. So if we had a grid of 2*2 and 3 classes of object, the output of our network will be 2*2*(3+5*(4+1)). YOLO itself use a 15*15 grid. Mauricio Menegaz explains that structure really well. I won’t details that much, so I encourage you to read his article. Instead we’ll see more about details of how to implement that in Keras.

Making a grid mathematically is something that is not described in the YOLO paper and I wasn’t sure of how to implement it. I found another YOLO Keras implementation that made that clear.

The trick is to have the network to predict coordinates that are limited in their range to a single cell. To limit the values that the network can output we use sigmoid on it’s output. Sigmoid is a common activation function in deep learning but here we use it for it’s property of only taking values between 0 and 1.

Since we only predict value between 0 and 1, we will consider width and height to be in the system of coordinate of the image. A box of image width will have a width of 1.

For positional coordinates, we don’t want the network to predict coordinates outside of the box so we use the cell system of coordinate. X=0 will place the center of the box on the left of the cell. X=1 will place the center of the box on the right of the cell. During the training we’ll need to add the offset of the cell so that we are in image space. We do this by adding a tensor of offset to the network prediction in the loss function.

The bounding box inside the image relative to YOLO cells

A simplified YOLO backend

YOLO use a backed of conv2D, leaky relu and max pooling for pattern detection, then a prediction layer composed of two densely connected layers. The first one large and the second one smaller that returns the predictions for each of our cells.

My data were too simple for the full YOLO architecture, so I used a simpler model, similar to YOLO with less layers, that was good enough and train faster.

i = Input(shape=(img_w,img_h,3))
x = Conv2D(16, (1, 1))(i)
x = Conv2D(32, (3, 3))(x)
x = keras.layers.LeakyReLU(alpha=0.3)(x)
x = MaxPooling2D(pool_size=(2, 2))(x)
x = Conv2D(16, (3, 3))(x)
x = Conv2D(32, (3, 3))(x)
x = keras.layers.LeakyReLU(alpha=0.3)(x)
x = MaxPooling2D(pool_size=(2, 2))(x)
x = Flatten()(x)
x = Dense(256, activation='sigmoid')(x)
x = Dense(grid_w*grid_h*(3+nb_boxes*5), activation='sigmoid')(x)
x = Reshape((grid_w*grid_h,(3+nb_boxes*5)))(x)
model = Model(i, x)

The code above looked like a bunch of magical code at the beginning (and YOLO is way deeper ). I understood mathematical 2D convolution, but I had some misunderstanding in their interpretation as deep learning layers. I found these notes from the Stanford CS class to be a very good explanation of Convolution layers in image recognition. The most important thing to understand is that 2D convolution in Keras actually use 3D kernels. There is one 3D kernel for each output channel we want. Each kernel does the convolution on all channels at once and produce one new channel.

Now, let’s see next why YOLO picked that specific set of layers.

Strategies of layers combinations

Beyond individual layers, YOLO use three main techniques to capture informations inside images : Multiple stacked conv2D, max pooling and conv2D with 1×1 kernels. Those are respectively used by YOLO in order to split images in features, reduce image and select best channels. I didn’t found a good explanation of why this specific architecture is the best. The reasons described after for picking each type of layer below are my best guess for YOLO :

Multiple stacked Conv2D purpose is to improve the range of captation of information without increasing the computation too much. As an example, two stacked Conv2D with 3×3 kernels have a range of 5 pixels around the current pixel. We could use a 5×5 kernel for the same capture range, but that would require more computation for the convolution layer. Remember that channels actually are 3x3xnb_output_channels.

Max Pooling has a similar purpose as stacked Conv2D with a very different process. It reduces channels by picking only the most activated values at some point. This helps in reducing the noise and also improve the computation as the channels size are reduced by 4.

Conv2D with 1×1 kernel have a very different purpose than Conv2D with larger kernel, so I take it apart. It acts as a filter that select the best features in a channels set and group them as a new channels set. I had difficulties understanding 1×1 filter at the beginning because I was thinking that Conv2D was using a 2D kernel. It’s not : the kernel is 3D with a third dimension equal to the number of channels. The reason is that when you stack convolution one after the other, the next layer actually has an input that is not one image but a stack of images (also called filters). Conv2D operate in 2 dimension but process a 3 dimension input. Once you know that it’s clear how 1×1 convolution is useful in picking informations inside filter and combining them into a smaller number of layers.

Pack everything in a custom loss function

Now we need to write a function that will produce all the things we described above. That model can be trained with the mean square error loss function and we’ll produce predictions, but it wouldn’t restrict box prediction appropriatly to their cell. Still, YOLO training is mostly MSE with tweaks to improve the the training.

The complete YOLO Loss function equation

The loss function is a function that grows when the result of the network is far from the solution and decrease when it’s close to the solution. The loss function returns a single number no matter how many dimensions are in. It’s just like playing hot and cold in many dimensions.

The equation for YOLO can look intimidating at first. The first thing to note is that we are only summing squares, so as we are studying the growth of the function we can just study each part of the function one at a time.

\lambda_coord, \lambda_obj, \lambda_noobj are cooefficients to adjust given the dataset. Adjusting them might help the network train faster and can improve accuracy a little, but they are not very important.

1_ij^obj is 1 when an object exists in the cell, 0 when there is no object. Nullifying a part of the loss function prevents the network from learning in locations where there is nothing to predict.

Accessing data in the Keras loss function

Keras provide y_true and y_false as input of loss function. A lot of loss function example use the whole y_true and y_pred all at once, but YOLO does not. We need to split data to implement the loss function.

We do it with the python slice operator. When doing this, it’s important to understand that y_true and y_pred do not contain a single solution and a single prediction but a complete batch. Tensors operates on the whole batche at once, so it’s generally not very apparent in example. It makes a lot of sense once you know it as sending a batch to a GPU is way more efficient if the batch is large.

y_true[...,0]   ; the first number in all batches
y_true[...,0:2] ; the first two number in all batches

We must take great care of dimension when doing this. The tensor structure is trying very hard to work for you and it often add in very powerful but very strange ways. It’s very important to always know which dimension you are actually adding, or multiplicating. If you don’t, it’s easy to just sum all features together, which may results in averaging the error of the whole batch, causing the network not learning at all or worse.

One of the tool that help is the tensorflow tf.Print function. Keras generate a derivative of the computation you make in the loss function and doesn’t use it anymore after that, so python print won’t work within it. tf.Print inject a print command inside the graph of the derivative to eval print the content of tensor while training the network (I suppose it works like that ). You can specify how much data and how many sample you want on the print. It has been really helpful to me for debugging dimensions issues.


Basically the loss is the sum of all square differences between desired coordinates and predicted coordinates for all cells and for all boxes. [$]\lambda_coord[/$] is a coefficient to balance the training between size and coordinates. [$]1_ij^obj[/$] is worth 0 if the cell contains no object and 1 if the cell contains an object. It’s role is to avoid training coordinates when there is no object.

Below a raw implementation. I skipped the lambda_coord as it’s not essential. 1_obj is y_true_conf.

grid = np.array([ [[float(x),float(y)]]*nb_boxes   for y in range(grid_h) for x in range(grid_w)])

pred_boxes = K.reshape(y_pred[...,3:], (-1,grid_w*grid_h,nb_boxes,5))
true_boxes = K.reshape(y_true[...,3:], (-1,grid_w*grid_h,nb_boxes,5))
y_true_conf = true_boxes[...,4]

y_pred_xy   = pred_boxes[...,0:2] + K.variable(grid)
y_true_xy   = true_boxes[...,0:2]

xy_loss    = K.sum(K.sum(K.square(y_true_xy - y_pred_xy),axis=-1)*y_true_conf, axis=-1)

I first generate a grid of offset. For a 2*2 grid with two boxes for each cell it looks like this : [[[0, 0], [0, 0]], [[1, 0], [1, 0]], [[0, 1], [0, 1]], [[1, 1], [1, 1]]]. The first dimension is the list of images, the second is the list of cells, the third is the list of boxes, the last are individual boxes offsets. As you can see there is an offset for each box.

Then, I reshape features from 3 to last grouping them as boxes inside cells using the K.reshape function. Box prediction contains 5 coordinates : X,Y,width, height and confidence. Reshaping data helps manipulating coordinate. For example pred_boxes[…,0:2] selects x,y coordinates for all boxes.

The end is just a square sum. I sum the last axis until I’m down to the dimension of the list of cells. It’s important that we use a common dimension so that we’ll be able to sum partial loss at the end. Be always careful of dimensions you are summing because tensors can easily sum differents dimensions as long as they are compatible.


The width and height loss is similar to the position loss. The only difference is that we are using the square root of the loss instead of just the width and height. The intuition for this is that large errors on large box are less a problem that relatively smaller errors on much smaller boxes. I’m not sure how YOLO ensure that width and height prediction are always positive so that taking the square root is ok. In my case I used sigmoid as an output of my network so that’s not a problem.

y_pred_wh   = pred_boxes[...,2:4]
y_true_wh   = pred_boxes[...,2:4]
y_true_conf = true_boxes[...,4]

wh_loss    = K.sum(K.sum(K.square(K.sqrt(y_true_wh) - K.sqrt(y_pred_wh)), axis=-1)*y_true_conf, axis=-1) 


Class loss is as simple as possible. We compute the square of the difference. That’s it. 1_obj means as usual.

y_true_class = y_true[...,0:2]
y_pred_class = y_pred[...,0:2]
y_true_conf = true_boxes[...,4]

clss_loss  = K.sum(K.square(y_true_class - y_pred_class)*y_true_conf, axis=-1)

Confidence (IOU)

The confidence predicts how sure the network is that a box accurately predict an object. For true boxes, the confidence is of course 1. They are true boxes, so we have absolute confidence that there is an object here. If the network predict exactly the same box, the confidence is 1. If the network predict some other box we want the network to produce the proportion of intersection between the good box and the predicted box. If there is no object, we want the confidence to drop to 0.

IOU = the surface of intersection/the surface of the union

In order to train the network for this, we compute the IOU, the intersection of the currently predicted box with the true box. I take the difference between the two boxes center minus half the size of width and height. We can do both at once because operations are the same horizontally and vertically. Actually we even do this for the whole batch at once. We can then compute the union, and the intersection.

intersect_wh = K.maximum(K.zeros_like(y_pred_wh), (y_pred_wh + y_true_wh)/2 - K.abs(y_pred_xy - y_true_xy) )
intersect_area = intersect_wh[...,0] * intersect_wh[...,1]
true_area = y_true_wh[...,0] * y_true_wh[...,1]
pred_area = y_pred_wh[...,0] * y_pred_wh[...,1]
union_area = pred_area + true_area - intersect_area
iou = intersect_area / union_area

conf_loss = K.sum(K.square(y_true_conf*iou - y_pred_conf)*y_true_conf, axis=-1) 

Summing partial loss

And now we are done ! We just sum all our partial losses together and we return it. The only issue at this point if that you didn’t adjust correct dimension for some losses. If you have some issues, try incorporating them one loss at a time and fix eventual issues.

loss =  clss_loss + xy_loss + wh_loss + conf_loss

Predictions of the YOLO algorithm


I hope this article made details of implementation of YOLO in Keras more clear and has been helpful to you. As I said in the introduction the code for this humble YOLO implementation is on github.

Variational autoencoder (VAE )

Why do we need the VAE

An autoencoder reduce an input of many dimensions, to a vector space of less dimension, then it recompute the lossed dimension from that limited number of intermediate vectors. This intermediate dimension is called the latent space.

An autoencoder is good at task like filtering noise, however, it is difficult to make it generate new images because the latent vector space is sparse. That means, we can’t sample randomly from that space and expects a vector producing a valid image. A lot of points in that space produce nothing of value. A variational autoencoder force the latent space to be continuous so that we can pick a random vector and get a meaningful image from it. We are going to explore how to do this.

The sparse latent space has coordinates that represents mix of numbers

Beyond that case, this article can also be useful if you want to understand why and how to use alternative loss function in Keras.

How to fill the latent space ?

Forcing the VAE to only generate humanly interpretable image can be understood as having all dots in the latent space to produce interpretable image. To do that we need to sample all dots of the latent space somehow and to force them to produce some expected result.

Sampling all the dots of the latent space is of course impossible, because to fill the latent space we would need an infinity of them. So we’d prefer to use something with a surface. We could imagine filling the space with squares or disks but as they are discrete shapes, it would be hard to cover the whole space. It would really be like a game of Tetris. They would also be hard to move as this would create conflict with other pieces.

We want the latent space to be dense, without any space left (inside a circle )

So, instead of a dot or a discrete surface, we’ll use a normal distribution, which is a probability density function. In two dimensions it looks a bit like a disk with a variable density of existence. As we get further from the center of it, it gets softer and softer (less probable ), so it’s easier to stitch other PDF next to it. 

We now need to know how to place our PDF across the latent space to cover it entirely. However, the backpropagation optimize the latent space one sample at a time with labels of some surface. We can’t just distribute them across the space. The way to do it is to grow individual elements as large as possible, given it doesn’t break the quality of the results, then we do it again on the next epoch and grows even further the elements if possible until it’s not possible to do better.

Pointing the backpropagation algorithm in the right direction is the role of the loss function.

Defining a loss function

We will define the loss function as the sum of two positive real numbers. That way it’s easy to see that to minimize A+B, A and B being positive we have to minimize as much as possible A and B.

For the first part, we simply use the crossentropy loss function of keras. This part will grow if the output is different from what is expected. This will cause the latent space to be organized into clusters of dots (it’s the same thing we do for the ordinary autoencoder ).

For the second part, we want to cover as much latent space as possible. In order to do that we have to decide what the latent space will be. Because we are going to sample from a normal distribution, it’s practical to decide that the latent space will be a normal distribution as well. What is the best cover we can imagine to cover a normal distribution ? Of course that’s the distribution itself. The measure how much two distributions are alike is the Kullback–Leibler divergence. It’s expressed as :

$$D_{KL}(P,Q) = E(log(P/Q))$$

We can observe that, the divergence of two same distribution is E(log(1)) = 0.

In practice we are interested about the specific case of the distance between two normal distributions which is :

KL divergence grows when the mean or the variance are different.

The KL divergence for the VAE

 KL divergence is the esperance of the difference of the quantity of information of the events under some distribution P with the quantity of information provided by the same events under some other distribution Q. Its seems complicated but what it means it that we measure the difference of what we expected for each case and what we got, then we do the ponderated mean of it.

$$D_{KL}(P\|Q) = E(log(P/Q)) = E(log(P)-log(Q))$$

For the VAE the distribution we use is the normal distribution. We want the NN to optimize the distribution of X so that they are more tightly packed around the origin. So we are going to optimize so that the P distribution look the most like the N(0,1) distribution (a gaussian distribution located around the origin). In the case of the VAE we are interested in the KL divergence of :

$$D_{KL}(N(\mu, \sigma) || N(0,1) )$$

We use the multivariate form of it (found on wikipedia ) :

$$D_{KL}(N((\mu_1,…,\mu_k)^T, diag(\sigma_1^2,…,\sigma_k^2))|N(0,1)) = \frac{1}{2}\sum_{i=1}^k(\sigma_1^2+\mu_i^2-ln(\sigma_i^2)-1]$$

The sample code below use a standard implementation for Keras. It actually implements something a little different. As the NN can learn to produce any function, it is trained to learn \(log(\sigma^2)\) instead of \(\sigma^2\) for numeric stability because the NN can map values between 0 and 1 to any negative number [stackoverflow].

$$D_{KL} = -1/2\sum_{i=0}^n -e^{log(\sigma^2)}-\mu^2+log(\sigma^2) + 1$$

The reparameterization trick

The backpropagation need be able to determine how changing the mean and variance of each parameter affects the output of the loss function. However, we can’t just sample from \(N(\mu,\sigma)\) as it would not be possible estimate what change to mean and variance would have produced as sample.

Instead of sampling from \(N(\mu,\sigma)\), we sample from N(0,1) then reparametrize it to it’s correct position. This change the problem to a linear function which is derivable.

For a fixed epsilon we can now compute what changes to the mean and variance would create as change to the loss function.

Keras implementation

# reparameterization trick
# instead of sampling from Q(z|X), sample eps = N(0,I)
z = z_mean + sqrt(var)*eps
def sampling(args):
z_mean, z_log_var = args
batch = K.shape(z_mean)[0]
dim = K.int_shape(z_mean)[1]
# by default, random_normal has mean=0 and std=1.0
epsilon = K.random_normal(shape=(batch, dim))
return z_mean + K.exp(0.5 * z_log_var) * epsilon

# MNIST dataset
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train = x_train[0:1000]
image_size = x_train.shape[1]
original_dim = image_size * image_size
x_train = np.reshape(x_train, [-1, original_dim])
x_test = np.reshape(x_test, [-1, original_dim])
x_train = x_train.astype('float32') / 255
x_test = x_test.astype('float32') / 255

# network parameters
input_shape = (original_dim, )
intermediate_dim = 128
batch_size = 128
latent_dim = 2
epochs = 1

# create a encoder model
inputs = Input((784,))
x = Dense(intermediate_dim, activation='relu')(inputs)
z_mean = Dense(latent_dim, name='z_mean')(x)
z_log_var = Dense(latent_dim, name='z_log_var')(x)
z = Lambda(sampling, output_shape=(latent_dim,), name='z')([z_mean, z_log_var])
encoder = Model(inputs=inputs, outputs=[z_mean, z_log_var, z], name='encoder')

# create a decoder model
i = Input((latent_dim,))
x = Dense(intermediate_dim, activation='relu')(i)
x = Dense(784, activation='softmax')(x)
decoder = Model(inputs=i, outputs=x,name='decoder')

# instantiate VAE model
outputs = decoder(encoder(inputs)[2])
vae = Model(inputs, outputs, name='vae_mlp')

# VAE loss = mse_loss or xent_loss + kl_loss
reconstruction_loss = binary_crossentropy(inputs, outputs)
reconstruction_loss *= original_dim
kl_loss = 1 + z_log_var - K.square(z_mean) - K.exp(z_log_var)
kl_loss = K.sum(kl_loss, axis=-1)
kl_loss *= -0.5
vae_loss = K.mean(reconstruction_loss + kl_loss)

vae.compile(optimizer='adam', loss=None)

parser = OptionParser()
parser.add_option("-t", "--train", action="store_true")
(options, args) = parser.parse_args()

# train the autoencoder
if options.train:,

# generate some image from random latent vector
lsv = np.random.normal(size=(5, latent_dim))
lsv = np.array([[x/10.0,y/10.0] for x in range(-10,10,2) for y in range(-10,10,2)])

imgs = decoder.predict(lsv)
iplot = 1
for img in imgs:
img = img.reshape(28,28)
plt.subplot(10,10, iplot)

Random samples from the latent space produce valid numbers.

Reminder of probability and information theory

I needed a few reminder about probability and information theory before understanding the intuition about KL divergence. Here they are :

A bit of information theory and probability :

  • A event of probability \(\frac{1}{2^n}\) provide a quantity of information of \(log_2(\frac{1}{2^n}) = n\). If something has a probability of information of \(\frac{1}{2^3}\) it’s quantity of information is \(log_2(\frac{1}{2^3})=3\). The generalization of this is :

$$I = -log(p(x))$$

  • Mean and variance of continuous random variables

The mean is the sum of x * density probability function at that point.

$$\mu = \int_{-\infty}^{\infty} xp(x)$$

The variance is the sum of distance to the mean times the density probability function at that point.

$$\sigma^2 = \int_{-\infty}^{\infty} (x-\mu)^2p(x)$$

  • Esperance of a probability distribution is the value we could expect to have if we repeat some experiment a lot of times. It’s the sum of the value of the events ponderated by the probability of the event to occur.

$$E(X)=\sum_{0}^{n} p(x_n) x_n$$

  • The information entropy is the esperance of the information of a random variable.

$$H(X)=-E(log(P(X))))=\sum_{i=0}^{n} log_n(x_i)p(x_i)$$

We use the special case of a diagonal multivariate normal, and a standard normal distribution as stated in :

  • The normal distribution 

$$f(x)=\frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{1}{2}\left (\frac{x-\mu}{\sigma}\right)^2}$$

  • Multivariate normal gaussian

$$p(x)=\frac{1}{(2\pi)^{n/2}det(\sigma)^{1/2} }e\left(-1/2(x-\mu)^T)\sigma^{-1}(x-\mu)\right)$$

The standard normal distribution N(0,1) (the formula above with \(\mu=0\) and \(\sigma=0\)

A few readings that were interesting in understanding the VAE :

Generating MNIST images from an autoencoder model in Keras

Autoencoder are a type of model that are trained by recontructing an output identical to the input after reducing it to lower dimensions inside the model. That lower dimension vector is called latent space. Lower dimensions reduction allows autoencoder to be good at generalizing data tasks like removing noise from images.

A Dense based autoencoder

Mnist images are 28×28 black and white image containing hand drawn numbers.

Images are best handled by convolution layers, but autoencoder are useful in more that one way. Dense NN also train faster, so I tried a dense layer network first.

Using only sequential Keras models we can build an autoencoder by stacking a encoder and a decoder with a sequential model. Model can be used just as a custom layer in keras.

As nour model is an autoencoder we use x_train as input and as the expected output.

(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train = x_train.reshape(-1,784)
x_test = x_test.reshape(-1,784)

# create a encoder model
encoder = Sequential()
encoder.add(Dense(128, activation='relu', input_shape=(784,)))
encoder.add(Dense(64, activation='relu'))
encoder.add(Dense(64, activation='relu'))

# create a decoder model
decoder = Sequential()
decoder.add(Dense(64, activation='relu', input_shape=(64,)))
decoder.add(Dense(128, activation='relu'))
decoder.add(Dense(784, activation='softmax'))

# models behave like layers. We can build a model from other models
autoencoder = Sequential()

metrics=['accuracy']), x_train,

Above I presented how to define the autoencoder using the Keras Sequential API as this is what the Keras documentation explains first and I find it slightly more readable at the beginning.

However most autoencoder tutorials will use the functional API to define models. Let’s see how to to it with the functional API. These two codes defines the same model, just in different way. Writing the code both way was helpful for me in understand how the functional API works in keras.

(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train = x_train.reshape(-1,784)
x_test = x_test.reshape(-1,784)

# create a encoder model
i = Input((784,))
x = Dense(128, activation='relu', input_shape=(784,))(i)
x = Dense(64, activation='relu')(x)
x = Dense(64, activation='relu')(x)
encoder = Model(inputs=i, outputs=x)

# create a decoder model
i = Input((64,))
x = Dense(64, activation='relu', input_shape=(64,))(i)
x = Dense(128, activation='relu')(x)
x = Dense(784, activation='softmax')(x)
decoder = Model(inputs=i, outputs=x)

# models behave like layers. We can build a model from other models
i = Input((784,))
x = Encoder()(i)
x = Decoder()(x)
autoencoder = Model(inputs=i, outputs=x)

metrics=['accuracy']), x_train,

Here is what the autoencoder produce as image. The sequential and the functional API of course provide similar results. The top image are the input data, which are then reduced to low dimension latent space by the encoder, then reconstructed to the bottom images by the decoder. If we have some latent space vector (64 numbers ) we can reconstruct all 784 pixels from those 64 numbers.

A convolutional based autoencoder

In the dense model above, we converted all images to 784 numbers without dimension. By doing that we removed all information about the spatial organisation of pixels in mnist images. Effectively the network ignore that some pixel is above some other pixel, or to the right, left of it. The output is a function of each pixel of the input.

Convolutional layers are a way to include spatial information into the network. Remember that a network can only predict something if the input contains information about the prediction.

Here is the model with the functional API :

(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train = x_train.reshape(-1, 28,28,1)
x_test = x_test.reshape(-1,28,28,1)

i = Input(shape=(28, 28, 1))
x = Conv2D(16, (3, 3), activation='relu', padding='same')(i)
x = MaxPooling2D((2, 2), padding='same')(x)
x = Conv2D(8, (3, 3), activation='relu', padding='same')(x)
x = MaxPooling2D((2, 2), padding='same')(x)
x = Conv2D(8, (3, 3), activation='relu', padding='same')(x)
x = MaxPooling2D((2, 2), padding='same')(x)
Encoder = Model(i, x)

i = Input(shape=(4, 4, 8)) # 8 conv2d features
x = Conv2D(8, (3, 3), activation='relu', padding='same')(i)
x = UpSampling2D((2, 2))(x)
x = Conv2D(8, (3, 3), activation='relu', padding='same')(x)
x = UpSampling2D((2, 2))(x)
x = Conv2D(16, (3, 3), activation='relu')(x)
x = UpSampling2D((2, 2))(x)
x = Conv2D(1, (3, 3), activation='sigmoid', padding='same')(x)
Decoder = Model(i, x)

# define input to the model:
x = Input(shape=(28, 28, 1))

# make the model
autoencoder = Model(x, Decoder(Encoder(x)))

# compile the model:
autoencoder.compile(optimizer='adadelta', loss='binary_crossentropy'), x_train,
validation_data=(x_test, x_test))

Here are what the convolutional autoencoder produce as output. The principle is the same as the dense autoencoder, but it produce number that are more strongly defined. The numbers are still a bit fuzzy which is a tendency of autoencoder due to the simplication.

Randomly sampling numbers

Having done that it seems that it could be possible to sample some random vector from the latent space and to use that to generate an image.

Sadly this is not possible, because the latent space is not dense, which means that a random vector will be unlikely to produce a valid image.

To do that we need to find a way to pack the latent space more densely so that any random vector will produce a valid number. This is what a variational autoencoder does.

Understanding Keras tensors

I had a hard time understanding what Keras tensors really were. They are used in a lot of more advanced use of Keras but I couldn’t find a simple explanation of what they mean inside Keras. I write the following this as a way to clarify my understanding.

I could not find a description of Keras tensor however Keras is implemented over Tensorflow and share the same concepts. This paragraph on the Tensorflow website was what made tensors clearer :

tensor is a generalization of vectors and matrices to potentially higher dimensions

That one was clear from the beginning. Tensor are matrices of many dimensions. All right. Then I read :

tf.Tensorobject represents a partially defined computation that will eventually produce a value.

That one was what I missed. The tf.Tensor object is the result of a function that is not yet evaluated.

I feel that it is like f(x) in the mathematical function f(x) = x². If we provide some value to the input x, then f(x) would evaluate to something. As long as there is no value, it is just f(x) a partially defined computation (a function).

TensorFlow programs work by first building a graph of tf.Tensor objects, detailing how each tensor is computed based on the other available tensors and then by running parts of this graph to achieve the desired results.

The tf.Tensor object is not just a matrix of many dimensions, it also link to other tensors by the way it is computed. The way a tf.Tensor is computed is a function that transform a tensor A to a tensor B. I suppose we recurse from the output tensor until we reach all necessary inputs, then we evaluate everything forward.

All of this is about Tensorflow, but I feel that this is a correct for Keras as well.


Identity function

from keras import backend as K

i = K.placeholder(shape=(4,), name=”input”)
f = K.function([i], [i])
ival = np.ones((4,))
print( f([ival]) )

> [array([ 1.,  1.,  1.,  1.], dtype=float32)]

Useless function that takes an input and returns it. i is a tensor. f is a function. It takes input and outputs as tensor. When we evaluate it with f([ival]), the tensor graph is walked from the i output to the i input. Quite easy here :). i is evaluated with it’s value, then the output is returned by the function.

Square function

from keras import backend as K

i = K.placeholder(shape=(4,), name="input")
square = K.square(i)
f = K.function([i], [square])
ival = np.ones((4,))*2
print( f([ival]) )

> [array([ 4.,  4.,  4.,  4.], dtype=float32)]

A function that returns the square of each value of the input. It is just the same as precedently but we walk the graph through the square function before getting to the input values.

Multiple outputs function

from keras import backend as K

i = K.placeholder(shape=(4,), name=”input”)
square = K.square(i)
mean = K.mean(i)
mean_of_square = K.mean(K.square(i))
f = K.function([i], [i, square, mean, mean_of_square])
ival = np.ones((4,))

print( f([ival]) )

> [array([ 2.,  2.,  2.,  2.], dtype=float32), array([ 4.,  4.,  4.,  4.], dtype=float32), 2.0, 4.0]

A function that returns the input, the square, the mean and the mean of the square. We can compose functions.

Gradient function

from keras import backend as K

i = K.placeholder(shape=(4,), name=”input”)
square = K.square(i)
grad = K.gradients([square], [i])
f = K.function([i], [i,square] + grad)
ival = np.ones((4,))*3
print( f([ival]) )

> [[array([ 3.,  3.,  3.,  3.], dtype=float32), array([ 9.,  9.,  9.,  9.], dtype=float32), array([ 6.,  6.,  6.,  6.], dtype=float32)]

A function that compute the gradient of square relative to the variable i. Gradient compute a tensor that is the composition of all functions between square and i. square(x) = x² so square(x)/dx = 2x. For the input 3, the derivative is 6. We can compute the derivative between two related tensors.


I think a few of those examples might be useful to the Keras documentation. I’ll do it when my understanding would have improved a bit. At least this article will help me avoid the curse of knowledge latter and reminds me of difficulties along the way.

A simple natural language category classifier with Keras

NN are a very flexible tool to allow the processing of natural languages, images. They are very versatile and allows to do things that imperative code can’t do. I wanted to learn a bit about them. I choosed to use Keras as it seems the best library to learn about NN. Keras is very close to the high layer NN paper are talking about.

So for testing, I looked for what could be a simple enough project. At work, I had a lot of news sorted by categories. It seemed nice to create a NN that would be able to automatically classify individual news into each category.

It would provide a little bit of value by possibly proposing automatically appropriate category for a feed or by proposing alternative classification for individual news for better navigation.

I had read a few NN tutorials but going from explanations to a working example on my own data was not trivial. As often when you learn a new topic, there is a lot of vocabulary and articles are a bit painful to read until you master enough of it. One of the thing that was difficult too was preparing the data for the NN.

In that article I’m going quickly to implementation. This is a very beginner oriented article written mostly to consolidate my own understanding. I’ll start with the NN itself, then we’ll prepare the data for training the NN itself.

Description of the NN

The NN we are going to use is defined with Keras that way :

from keras.models import Sequential
from keras.layers import Dense, Activation

model = Sequential()
model.add(Dense(512, input_dim=10000))
model.compile(loss=’categorical_crossentropy’, optimizer=’adam’, metrics=[‘accuracy’])

I didn’t included inputs and output so that we have a look at how the NN is built first. So what do we have here ?

Sequential model

model = Sequential()

Sequential is the simplest way to define a NN with Keras. It means we are going to stack each layer over the previous one.

First Dense

model.add(Dense(512, input_dim=10000))

Dense is a classical NN layer. Because it is the first layer, it’s also considered the input layer. The input_dim parameter is the dimension of the data the NN is going to expect as input. That’s also called the dimension of the data.

That layer has 512 neurons. Those 512 neurons will output 512 values which are going to be the inputs of the next layer. We don’t need to specify the size of the input of next layer because Keras does it automatically.

512 is also the dimension that the NN is going to use to represent your data internally. Larger values learns more slowly. They are also more subject to overfitting which means that the network instead of generalizing the problem will ultraspecialise it for the case provided reducing the capability of the network to predict good output for unknown inputs. Smaller values may not be enough to generalize the problem.

Eventually, everybody tries different values to find something that perform well.

Role of the first Activation


Activation is what is called the activation function. NN use activation function because stacking linear operation over linear operation ends up being a linear operation itself resulting in the network being equivalent to a single layer. Activation take the type of function used as activation. For internal layers reluis often a good pick.

ConvNetJS demo is great to see how activation function changes how the activation function change the shape of the learned data. Try changing the tanh to relu to experiment.

ConvNetJS demo: Classify toy 2D data
Feel free to change this, the text area above gets eval()’d when you hit the button and the network gets reloaded…

The second Dense layer


This is the last layer of the NN, so that’s it’s output. As I have 8 categories, the NN needs to ouput 8 values, so that layer has 8 neurons. It’ll output an array such as [2,6,0,1,0,0,0.5,0].

Softmax Activation


The previous Dense layer outputs real numbers. It’s not easy looking for one output to tell what means a 6 for example. The interpretation of it depends of the other values.

Softmax squashes the value of each output between 0 and 1 and normalise the whole vector so that the sum of the probability of the classes is 1. For a single specific output Softmax tells you the probability that that class is true.

Building the NN

model.compile(loss=’categorical_crossentropy’, optimizer=’adam’, metrics=[‘accuracy’])

We are done defining our NN. categorical_crossentropy is the loss function we use when sorting in category. It expects output to be a vector of probability of category such as softmax output does.

I didn’t research much into the other parameters. Some optimisers may work best on different types of data but for my use case, most did worked well.


I found model.summary() to be very helpful at the beginning. I didn’t saw a lot of tutorial use it, probably because it doesn’t provide a lot of information.

Still, it does help understanding to see how parameters affect the dimension of the layers of the NN. It also helps with debugging because when you have an error, Keras will use the name of the layer to tell you where the issue is. I always have difficulty when I see a 1 or 2 index to know if the count starts at 0 or 1. It made me scratching my head until I found summary().

Layer (type) Output Shape Param #
dense_1 (Dense) (None, 512) 5120512
activation_1 (Activation) (None, 512) 0
dense_2 (Dense) (None, 8) 4104
activation_2 (Activation) (None, 8) 0
Total params: 5,124,616
Trainable params: 5,124,616
Non-trainable params: 0

Training the NN

Training is the part where all the coefficients are chosen to have the NN predict the closest output given an input. Keras does it in one line but it’s actually where most of the NN magic happens., y_train, 

The training itself is done by an algorithm named backpropagation. A very clear explanation of backpropagation works can be found on the Matt Mazur blog.A Step by Step Backpropagation Example
Background Backpropagation is a common method for training a neural network. There is no shortage of papers online that…

x_train and y_train are the inputs and outputs of the NN used for training. Both are multidimensional array. The first row of x_train is the first example of input. The first row of y_train is the first example of expected output. Each row of x_train is given to the NN and neurons coefficient are corrected toward y_train corresponding value. Doing that once for all data is called an epoch.

epochs=2 means to do that process twice. If you have a lot of data, you may put only 1, but most of the time using a larger epoch make a better use of the data.

batch_size is how many data are used to do a forward pass and a backward pass (from the Keras description ). I find that a little bit unclear, here is my guess, it might be wrong. Batch exists to solve a performance problem : updating weight of each neuron after each data is slow. It may not be very slow on CPU, but on GPU that would means transferring weights to the model after each forward pass. Also on many cores architecture, it allows to parallelize passes.

My guess is that Keras does many forward passes, compute the new weights, but only update them each batch_size pass. The nicest explication I found was on machinelearningmastery. It actually call the algorithm mini-batch.A Gentle Introduction to Mini-Batch Gradient Descent and How to Configure Batch Size
Stochastic gradient descent is the dominant method used to train deep learning models. There are three main variants of…

Reading CSV

CSV is often used as an input format for NN. My data format is basically composed of title and text of the article in the first field and category in the second field. It’s possible to read them that way :

import pandas

dataframe = pandas.read_csv(‘articles.csv’, header=None, escapechar=’\\’, na_filter=False)
dataset = dataframe.values
texts = dataset[:9000,0]
categories = dataset[9000:,1]
test_texts = dataset[:9000,0]
test_categories = dataset[9000:,1]

For evaluation purpose of the training, ML often split a part of the data to verify against them how good the trained NN is at predicting a result.

With python it’s really easy to do. dataset[9000:] returns the 9000 first rows and dataset[:9000] returns all rows after the 9000 one. The first one are the training data, the last one are the validation data.

The input : representing natural language numerically

This was my most head scratching issue at the beginning. NN expects numbers as inputs, but wasn’t evident for me how to convert a list of words to a list of numbers.

We could pick anything like a dictionary and assign a index to each word, but that wouldn’t be a great idea (or at least not in that case ). All transformations are not equals.

What we actually want is a transform that preserve some meaning of the information. We also want a transformation that is of fixed input dimension. Sentences as they are can’t be used as input because their length vary. Preserving the order of the words in the sentence also is only useful if we are going to exploit this order. The NN we defined doesn’t know how to use that, so preserving the information is useless.

The presentation I chose is to present sentence as a matrix where each column represent one word. In the most simple understanding of this, we could flag 1 if the word is present and 0 if the word isn’t present. Here we use a tfidfrepresentation which is a statistically more representative version of the data. I chose that one because it gave better result.

tk = Tokenizer(num_words=10000)
x_train = tk.texts_to_matrix(x_train, mode=”tfidf”)
x_test = tk.texts_to_matrix(x_test, mode=”tfidf”)

What is important to understand is that the way we present data to the NN is going to orient how the NN is going to learn. There are different ways of representing data. I’ll try to write an article about a few of them. Each representation of the data also requires an appropriate NN structure.

The output : representing categories numerically

As we have exactly one label for each text, we can represent this as a vector as well. It’s usually called a one hot vector. Which means it’s a vector where a category is represented as a column. Here is a one hot vector for the category 2


It is very close to what the keras tokenizer and text_to_matrix, but keras tokenizer reserve the 0 which adds an extra useless column. Most people prefer to use LabelEncoder from sklearn.

from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
num_categories = encoder.transform(text_categories)
y_train = keras.utils.to_categorical(num_categories)


That NN as simple as it is performed quite well. It is able to classify articles category with 75% accuracy which is not too bad given that there is 8 categories. It could probably be improved by giving it more articles for each category.

One thing I didn’t thought about was that differences in classification between what I had and what the NN ouputted would be even more interesting than a perfect 100% match. 100% wouldn’t actually learn me anything. Correct matchs were often very close to 100% in the correct category. Invalid match were often more average score, such as news being part actuality, part health for example if it was talking about a new drug. The network actually provide via the softmax the probability for a news to be in some category and it could be helpful in proposing alternative category.

Securing API routes with OAuth2 and nginx auth_request

When an API is build in parts, it’s nice to be able to build services without embedding authorization access inside each of them. It allows a better separation of concerns. In that pattern, the services shall call the auth server to check if user is authorized to access the services.

Nginx allows to do that with auth_request.

I had a bit of pain to connect all the dots but eventually it ends up adding only two lines of configuration to any route that you want to secure.

  1. Checking if the request is authorized with auth_request.
  2. Returning the error message from the auth_request by proxying that request on a 401 by displaying a custom error page with error_page 401.

Both request are done on a /auth route that proxy the auth server.

Authorize request

auth_request checks the authentification status and returns 401 unauthorized or 200 authorized. We need to pass Authorization header so that the bearer token is provided to the auth server. We do this with :

proxy_pass_header Authorization;
proxy_set_header Authorization $http_authorization;

We probably don’t need post data when checking auth so we do not forward this content to the auth server with :

proxy_pass_request_body off;
proxy_set_header Content-Length "";
proxy_set_header X-Original-URI $request_uri;

Returning authorization response

If the auth_request response was a 401 then we show the result of the /auth request as an error page.

error_page 401 =401 /auth;

The full nginx config (with PHP-FPM )

server {
listen 80 default_server;
server_name _;

    index index.php;
rewrite_log on;
root /var/www/html/public;

location / {
auth_request /auth;
error_page 401 =401 /auth;
try_files $uri $uri/ /index.php?$query_string;

location ~ \.php$ {
# With php5-fpm:
fastcgi_index index.php;
fastcgi_connect_timeout 3s;
fastcgi_read_timeout 60s;
fastcgi_pass unix:/var/run/php5-fpm.sock;
fastcgi_param SCRIPT_NAME $fastcgi_script_name;
fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;
include /etc/nginx/fastcgi_params;

location = /auth {
proxy_pass http://api/v1/oauth2/user;
proxy_pass_header Authorization;
proxy_set_header Authorization $http_authorization;
proxy_pass_request_body off;
proxy_set_header Content-Length "";
proxy_set_header X-Original-URI $request_uri;

Redirecting NGINX output on docker-compose up when embedded in supervisor

Two things are actually required. First we need to redirect access.log to /dev/stdout and error.log to /dev/stderr. An easy way to do it in docker is by aliasing the nginx logs to stdout and stderr. We can do it by adding this to our Dockerfile :

RUN ln -sf /dev/stdout /var/log/nginx/access.log \
&& ln -sf /dev/stderr /var/log/nginx/error.log

However this is not sufficient, because supervisor capture output of processes it manages. Nginx logs now ends up in /var/logs/supervisor/nginx-*.log.

We need to tell supervisor to redirect logs it captures to /dev/stdout.

[program: nginx]
command=nginx -g ‘daemon off;’

We can now run docker-compose up and watch nginx output

New syntax of ES6

Here are a few notes on You’ll get much more information reading the source than this probably. This is mostly a personal summary.


let allows to describe scope based variables

Scoped function

function declaration are limited in scope to the block where they are defined

Spread operator

spread operator expand an array as individual values :

f(…[1,2,3]); expand to f(1,2,3)

Default arguments

function f(x=1,y=2) { return x+y; }
f(undefined,3); returns 4


With array :

var [x,y,z] = [1,2,3,4]
var [y,x] = [x,y]
var [,y,z] = [1,2,3,4]
var [x,…y] = [1,2,3,4]

With objects :

var {x,y,z} = {x:1,y:2,z:3}
var {x:a,y:b,z:c} = {x:1,y:2,z:3}
var {x:o.a,y:o.b,z:o.c} = {x:1,y:2,z:3}
var {x:a[0],y:a[1],z:a[2]} = {x:1,y:2,z:3}
var {x=10,y=11,z=12} = {x:1}

Nested :

var [a,{x,y}] = [1,{x:1,y:2}]

Destructuring as function arguments

function f({x,y}) {}

Concise methods

var o = {
x() {},
y() {}

Take care that this can’t work (not because of obvious recursion problem ) but just because x is not defined inside function scope :

var o = {
x() { x(); }

Getter and Setter

var o = {
get id() { return __id; },
set id(id) { __id = id; }

Computed properties name

var o = {
[prefix+”_var”] : 1,
[prefix+”_func”() {}



Template literal

`my name is ${name}`
`my name is ${f(name)}`

Arrow functions

Arrow functions are great for short inline function but as the function grows it less and less readable.

var f1 = () => 1
var f2 = x => x
var f2 = (x,y) => { return x+y; }

This is kept as pointing to the original object like a bind(this) would, so they are quite practical as callbacks.

var o = {
listen: function() {
btn.addEventListener(() => { this.hello() })
hello:function() {

But be careful if you chain arrow function notation as this will be the upper original object.

For…of loop

for(var v of [“a”,”b”,”c”]) { console.log(v); } // will print “a”,”b”,”c”

for…of also works with iterators, strings, destructured objects


Regex match start can be manually positioned on string with lastIndex, so you don’t have to split a string in piece to test a regex against a long text. It works with the sticky flag y

Flags can be queried on a regex

var r = /test/gi
console.log(r.flags); // will print “gi”.

The order of the flags is always “gimuy”

Octal is explicit in strict mode

0o52 = 42


Unicode now works in regex, in strings, in variables. String.codePointAt is the equivalent of String.charCodeAt on unicode characters. Position in string will be correctly handled.


Symbols are meant to be used for constants. They are like the :label syntax of Ruby if you know that.


The text inside Symbol is just for the description of the symbol. It’s possible to recall a constant from another place in code with :

var CONSTANT = Symbol.for(“APP.CONSTANT”)

Exposing port 80 on OSX with VirtualBox

VirtualBox on OSX has that annoying limitation :

Forwarding host ports < 1024 impossible:
 On Unix-based hosts (e.g. Linux, Solaris, Mac OS X) it is not possible to bind to ports below 1024 from applications that are not run by root. As a result, if you try to configure such a port forwarding, the VM will refuse to start.

A quick way to expose a website is to expose the 8080 on the VirtualBox NAT instead as it is above 1024 and it is not a problem, then to redirect the port 80 with the OSX NAT to the port 8080. I can be done simply that way :

echo “
 rdr pass inet proto tcp from any to any port 80 -> port 8080
 “ | sudo pfctl -ef –

You can then disable it with :

sudo pfctl -F all -f /etc/pf.conf