Humble YOLO implementation in Keras

This article talks about details of implementations of the YOLO algorithm with Keras. I’m not talking about the exact YOLO implementation, but rather of how we come to YOLO from ML principles. We’ll stay close to the YOLO v1 implementation.

YOLO is a deep learning model that can predict object classes and location. It belongs to the group of classifications algorithm. Compared to other methods it is simple, fast, and robust. YOLO is simple because it is a regressional model only, which means that it only use deep learning to directly predict results. It is fast, thanks for its single pass model. It is robust, because it use the whole context of the image to do a prediction, so it is less sensible to patterns inside a subpart of the image.

The plan

YOLO being a deep learning model, we will first need a set of labeled of images to train our network. From my own experience, I felt that it’s often problematic how to get good data at the beginning. We’ll examine different ways to obtain one.

Next we’ll dig more into the YOLO architecture. The most important thing we need to learn is how YOLO encode it’s predictions. The encoding of the data is key to the understanding the model. Then, we need to evaluate the confidence of the predicted boxes.

Finally we’ll see how to practically combine all the above into the final loss function with Keras. The final code is available on my github. The implementation is done with the minimum of code for better comprehension. I’ll also discuss the code at the end of that article.

Creating a good dataset

A good dataset determine how well a network will learn. If the dataset ignore some object, the network will learn to ignore them as well. A good dataset for YOLO is one that has accurate bounding boxes for all objects that appear. I had an extra constraint of processing power as I only train model on CPU alone. I don’t recommend it, but it’s still a doable task if you can accepts some concession with your datasets.

The usual options to get a dataset are : downloading one, building one, generating one. Let’s take a tour of them.

Downloading a dataset is the easiest option. Kaggle has a lot of them too. Google can help you locate one too. It’s the simplest way, but it’s not always possible to find a good dataset. They are often quite large (in Gigabytes ), so they are not suitable for little tests. With limited power, it’s possible to reprocess one by reducing image size.

The second possibility is to create one. It’s a rather long and manual process. Experienced people say you should aim to have at least 500 examples of objects for each class to train YOLO with good generalisation. More is better, but it’ll also take more time to train. Once you have collected images, you’ll need to label them. BBox-Label-Tool. LabelImg , Yolo Mark and CVAT are all software that can be used for labelling. It’s a time consuming solution, but the best one if you have unique data. Don’t rush it, the quality of the learning of the network depends of the quality of the dataset.

The third possibility, and the one I used here, is to generate a dataset programmatically. This solution certainly produce less realistic dataset, but it is still a good choice because you can control how complex your data can be : how many items, how dense, etc… I used the python object library for this. I generate images with the texts chat and rat randomly placed inside an image. As we are generating images, we write bounding boxes of objects in a text file named the same way as their image.

Images generated by the generator :

Data file generated by the generator : each cell predict class, one box and confidence of the presence.

; cell (0,0)
0 1 0 ; class of box (the type of the object )
14 22 18 10 ; bounding box of the object
1 ; confidence of an object here
; cell (1,0)
1 0 0
35 23 24 10
; cell (0,1)
0 0 1
16 48 32 32
; cell (1,1)
1 0 0
52 35 24 10

Detecting objects

Let’s dive into into the YOLO algorithm itself now. The structure of the data is very important in the YOLO algorithm comprehension : YOLO divide the image in a grid. For each cell YOLO predict a class of object in the form of a one hot vector, five boxes and a confidence score for each box. So if we had a grid of 2*2 and 3 classes of object, the output of our network will be 2*2*(3+5*(4+1)). YOLO itself use a 15*15 grid. Mauricio Menegaz explains that structure really well. I won’t details that much, so I encourage you to read his article. Instead we’ll see more about details of how to implement that in Keras.

Making a grid mathematically is something that is not described in the YOLO paper and I wasn’t sure of how to implement it. I found another YOLO Keras implementation that made that clear.

The trick is to have the network to predict coordinates that are limited in their range to a single cell. To limit the values that the network can output we use sigmoid on it’s output. Sigmoid is a common activation function in deep learning but here we use it for it’s property of only taking values between 0 and 1.

Since we only predict value between 0 and 1, we will consider width and height to be in the system of coordinate of the image. A box of image width will have a width of 1.

For positional coordinates, we don’t want the network to predict coordinates outside of the box so we use the cell system of coordinate. X=0 will place the center of the box on the left of the cell. X=1 will place the center of the box on the right of the cell. During the training we’ll need to add the offset of the cell so that we are in image space. We do this by adding a tensor of offset to the network prediction in the loss function.

The bounding box inside the image relative to YOLO cells

A simplified YOLO backend

YOLO use a backed of conv2D, leaky relu and max pooling for pattern detection, then a prediction layer composed of two densely connected layers. The first one large and the second one smaller that returns the predictions for each of our cells.

My data were too simple for the full YOLO architecture, so I used a simpler model, similar to YOLO with less layers, that was good enough and train faster.

i = Input(shape=(img_w,img_h,3))
x = Conv2D(16, (1, 1))(i)
x = Conv2D(32, (3, 3))(x)
x = keras.layers.LeakyReLU(alpha=0.3)(x)
x = MaxPooling2D(pool_size=(2, 2))(x)
x = Conv2D(16, (3, 3))(x)
x = Conv2D(32, (3, 3))(x)
x = keras.layers.LeakyReLU(alpha=0.3)(x)
x = MaxPooling2D(pool_size=(2, 2))(x)
x = Flatten()(x)
x = Dense(256, activation='sigmoid')(x)
x = Dense(grid_w*grid_h*(3+nb_boxes*5), activation='sigmoid')(x)
x = Reshape((grid_w*grid_h,(3+nb_boxes*5)))(x)
model = Model(i, x)

The code above looked like a bunch of magical code at the beginning (and YOLO is way deeper ). I understood mathematical 2D convolution, but I had some misunderstanding in their interpretation as deep learning layers. I found these notes from the Stanford CS class to be a very good explanation of Convolution layers in image recognition. The most important thing to understand is that 2D convolution in Keras actually use 3D kernels. There is one 3D kernel for each output channel we want. Each kernel does the convolution on all channels at once and produce one new channel.

Now, let’s see next why YOLO picked that specific set of layers.

Strategies of layers combinations

Beyond individual layers, YOLO use three main techniques to capture informations inside images : Multiple stacked conv2D, max pooling and conv2D with 1×1 kernels. Those are respectively used by YOLO in order to split images in features, reduce image and select best channels. I didn’t found a good explanation of why this specific architecture is the best. The reasons described after for picking each type of layer below are my best guess for YOLO :

Multiple stacked Conv2D purpose is to improve the range of captation of information without increasing the computation too much. As an example, two stacked Conv2D with 3×3 kernels have a range of 5 pixels around the current pixel. We could use a 5×5 kernel for the same capture range, but that would require more computation for the convolution layer. Remember that channels actually are 3x3xnb_output_channels.

Max Pooling has a similar purpose as stacked Conv2D with a very different process. It reduces channels by picking only the most activated values at some point. This helps in reducing the noise and also improve the computation as the channels size are reduced by 4.

Conv2D with 1×1 kernel have a very different purpose than Conv2D with larger kernel, so I take it apart. It acts as a filter that select the best features in a channels set and group them as a new channels set. I had difficulties understanding 1×1 filter at the beginning because I was thinking that Conv2D was using a 2D kernel. It’s not : the kernel is 3D with a third dimension equal to the number of channels. The reason is that when you stack convolution one after the other, the next layer actually has an input that is not one image but a stack of images (also called filters). Conv2D operate in 2 dimension but process a 3 dimension input. Once you know that it’s clear how 1×1 convolution is useful in picking informations inside filter and combining them into a smaller number of layers.

Pack everything in a custom loss function

Now we need to write a function that will produce all the things we described above. That model can be trained with the mean square error loss function and we’ll produce predictions, but it wouldn’t restrict box prediction appropriatly to their cell. Still, YOLO training is mostly MSE with tweaks to improve the the training.

The complete YOLO Loss function equation

The loss function is a function that grows when the result of the network is far from the solution and decrease when it’s close to the solution. The loss function returns a single number no matter how many dimensions are in. It’s just like playing hot and cold in many dimensions.

The equation for YOLO can look intimidating at first. The first thing to note is that we are only summing squares, so as we are studying the growth of the function we can just study each part of the function one at a time.

\lambda_coord, \lambda_obj, \lambda_noobj are cooefficients to adjust given the dataset. Adjusting them might help the network train faster and can improve accuracy a little, but they are not very important.

1_ij^obj is 1 when an object exists in the cell, 0 when there is no object. Nullifying a part of the loss function prevents the network from learning in locations where there is nothing to predict.

Accessing data in the Keras loss function

Keras provide y_true and y_false as input of loss function. A lot of loss function example use the whole y_true and y_pred all at once, but YOLO does not. We need to split data to implement the loss function.

We do it with the python slice operator. When doing this, it’s important to understand that y_true and y_pred do not contain a single solution and a single prediction but a complete batch. Tensors operates on the whole batche at once, so it’s generally not very apparent in example. It makes a lot of sense once you know it as sending a batch to a GPU is way more efficient if the batch is large.

y_true[...,0]   ; the first number in all batches
y_true[...,0:2] ; the first two number in all batches

We must take great care of dimension when doing this. The tensor structure is trying very hard to work for you and it often add in very powerful but very strange ways. It’s very important to always know which dimension you are actually adding, or multiplicating. If you don’t, it’s easy to just sum all features together, which may results in averaging the error of the whole batch, causing the network not learning at all or worse.

One of the tool that help is the tensorflow tf.Print function. Keras generate a derivative of the computation you make in the loss function and doesn’t use it anymore after that, so python print won’t work within it. tf.Print inject a print command inside the graph of the derivative to eval print the content of tensor while training the network (I suppose it works like that ). You can specify how much data and how many sample you want on the print. It has been really helpful to me for debugging dimensions issues.


Basically the loss is the sum of all square differences between desired coordinates and predicted coordinates for all cells and for all boxes. [$]\lambda_coord[/$] is a coefficient to balance the training between size and coordinates. [$]1_ij^obj[/$] is worth 0 if the cell contains no object and 1 if the cell contains an object. It’s role is to avoid training coordinates when there is no object.

Below a raw implementation. I skipped the lambda_coord as it’s not essential. 1_obj is y_true_conf.

grid = np.array([ [[float(x),float(y)]]*nb_boxes   for y in range(grid_h) for x in range(grid_w)])

pred_boxes = K.reshape(y_pred[...,3:], (-1,grid_w*grid_h,nb_boxes,5))
true_boxes = K.reshape(y_true[...,3:], (-1,grid_w*grid_h,nb_boxes,5))
y_true_conf = true_boxes[...,4]

y_pred_xy   = pred_boxes[...,0:2] + K.variable(grid)
y_true_xy   = true_boxes[...,0:2]

xy_loss    = K.sum(K.sum(K.square(y_true_xy - y_pred_xy),axis=-1)*y_true_conf, axis=-1)

I first generate a grid of offset. For a 2*2 grid with two boxes for each cell it looks like this : [[[0, 0], [0, 0]], [[1, 0], [1, 0]], [[0, 1], [0, 1]], [[1, 1], [1, 1]]]. The first dimension is the list of images, the second is the list of cells, the third is the list of boxes, the last are individual boxes offsets. As you can see there is an offset for each box.

Then, I reshape features from 3 to last grouping them as boxes inside cells using the K.reshape function. Box prediction contains 5 coordinates : X,Y,width, height and confidence. Reshaping data helps manipulating coordinate. For example pred_boxes[…,0:2] selects x,y coordinates for all boxes.

The end is just a square sum. I sum the last axis until I’m down to the dimension of the list of cells. It’s important that we use a common dimension so that we’ll be able to sum partial loss at the end. Be always careful of dimensions you are summing because tensors can easily sum differents dimensions as long as they are compatible.


The width and height loss is similar to the position loss. The only difference is that we are using the square root of the loss instead of just the width and height. The intuition for this is that large errors on large box are less a problem that relatively smaller errors on much smaller boxes. I’m not sure how YOLO ensure that width and height prediction are always positive so that taking the square root is ok. In my case I used sigmoid as an output of my network so that’s not a problem.

y_pred_wh   = pred_boxes[...,2:4]
y_true_wh   = pred_boxes[...,2:4]
y_true_conf = true_boxes[...,4]

wh_loss    = K.sum(K.sum(K.square(K.sqrt(y_true_wh) - K.sqrt(y_pred_wh)), axis=-1)*y_true_conf, axis=-1) 


Class loss is as simple as possible. We compute the square of the difference. That’s it. 1_obj means as usual.

y_true_class = y_true[...,0:2]
y_pred_class = y_pred[...,0:2]
y_true_conf = true_boxes[...,4]

clss_loss  = K.sum(K.square(y_true_class - y_pred_class)*y_true_conf, axis=-1)

Confidence (IOU)

The confidence predicts how sure the network is that a box accurately predict an object. For true boxes, the confidence is of course 1. They are true boxes, so we have absolute confidence that there is an object here. If the network predict exactly the same box, the confidence is 1. If the network predict some other box we want the network to produce the proportion of intersection between the good box and the predicted box. If there is no object, we want the confidence to drop to 0.

IOU = the surface of intersection/the surface of the union

In order to train the network for this, we compute the IOU, the intersection of the currently predicted box with the true box. I take the difference between the two boxes center minus half the size of width and height. We can do both at once because operations are the same horizontally and vertically. Actually we even do this for the whole batch at once. We can then compute the union, and the intersection.

intersect_wh = K.maximum(K.zeros_like(y_pred_wh), (y_pred_wh + y_true_wh)/2 - K.abs(y_pred_xy - y_true_xy) )
intersect_area = intersect_wh[...,0] * intersect_wh[...,1]
true_area = y_true_wh[...,0] * y_true_wh[...,1]
pred_area = y_pred_wh[...,0] * y_pred_wh[...,1]
union_area = pred_area + true_area - intersect_area
iou = intersect_area / union_area

conf_loss = K.sum(K.square(y_true_conf*iou - y_pred_conf)*y_true_conf, axis=-1) 

Summing partial loss

And now we are done ! We just sum all our partial losses together and we return it. The only issue at this point if that you didn’t adjust correct dimension for some losses. If you have some issues, try incorporating them one loss at a time and fix eventual issues.

loss =  clss_loss + xy_loss + wh_loss + conf_loss

Predictions of the YOLO algorithm


I hope this article made details of implementation of YOLO in Keras more clear and has been helpful to you. As I said in the introduction the code for this humble YOLO implementation is on github.