A simple natural language category classifier with Keras

NN are a very flexible tool to allow the processing of natural languages, images. They are very versatile and allows to do things that imperative code can’t do. I wanted to learn a bit about them. I choosed to use Keras as it seems the best library to learn about NN. Keras is very close to the high layer NN paper are talking about.

So for testing, I looked for what could be a simple enough project. At work, I had a lot of news sorted by categories. It seemed nice to create a NN that would be able to automatically classify individual news into each category.

It would provide a little bit of value by possibly proposing automatically appropriate category for a feed or by proposing alternative classification for individual news for better navigation.

I had read a few NN tutorials but going from explanations to a working example on my own data was not trivial. As often when you learn a new topic, there is a lot of vocabulary and articles are a bit painful to read until you master enough of it. One of the thing that was difficult too was preparing the data for the NN.

In that article I’m going quickly to implementation. This is a very beginner oriented article written mostly to consolidate my own understanding. I’ll start with the NN itself, then we’ll prepare the data for training the NN itself.

Description of the NN

The NN we are going to use is defined with Keras that way :

from keras.models import Sequential
from keras.layers import Dense, Activation

model = Sequential()
model.add(Dense(512, input_dim=10000))
model.add(Activation(‘relu’))
model.add(Dense(8))
model.add(Activation(‘softmax’))
model.compile(loss=’categorical_crossentropy’, optimizer=’adam’, metrics=[‘accuracy’])

I didn’t included inputs and output so that we have a look at how the NN is built first. So what do we have here ?

Sequential model

model = Sequential()

Sequential is the simplest way to define a NN with Keras. It means we are going to stack each layer over the previous one.

First Dense

model.add(Dense(512, input_dim=10000))

Dense is a classical NN layer. Because it is the first layer, it’s also considered the input layer. The input_dim parameter is the dimension of the data the NN is going to expect as input. That’s also called the dimension of the data.

That layer has 512 neurons. Those 512 neurons will output 512 values which are going to be the inputs of the next layer. We don’t need to specify the size of the input of next layer because Keras does it automatically.

512 is also the dimension that the NN is going to use to represent your data internally. Larger values learns more slowly. They are also more subject to overfitting which means that the network instead of generalizing the problem will ultraspecialise it for the case provided reducing the capability of the network to predict good output for unknown inputs. Smaller values may not be enough to generalize the problem.

Eventually, everybody tries different values to find something that perform well.

Role of the first Activation

model.add(Activation(‘relu’))

Activation is what is called the activation function. NN use activation function because stacking linear operation over linear operation ends up being a linear operation itself resulting in the network being equivalent to a single layer. Activation take the type of function used as activation. For internal layers reluis often a good pick.

ConvNetJS demo is great to see how activation function changes how the activation function change the shape of the learned data. Try changing the tanh to relu to experiment.

ConvNetJS demo: Classify toy 2D data
Feel free to change this, the text area above gets eval()’d when you hit the button and the network gets reloaded…cs.stanford.edu

The second Dense layer

model.add(Dense(8))

This is the last layer of the NN, so that’s it’s output. As I have 8 categories, the NN needs to ouput 8 values, so that layer has 8 neurons. It’ll output an array such as [2,6,0,1,0,0,0.5,0].

Softmax Activation

model.add(Activation(‘softmax’))

The previous Dense layer outputs real numbers. It’s not easy looking for one output to tell what means a 6 for example. The interpretation of it depends of the other values.

Softmax squashes the value of each output between 0 and 1 and normalise the whole vector so that the sum of the probability of the classes is 1. For a single specific output Softmax tells you the probability that that class is true.

Building the NN

model.compile(loss=’categorical_crossentropy’, optimizer=’adam’, metrics=[‘accuracy’])

We are done defining our NN. categorical_crossentropy is the loss function we use when sorting in category. It expects output to be a vector of probability of category such as softmax output does.

I didn’t research much into the other parameters. Some optimisers may work best on different types of data but for my use case, most did worked well.

Summarizing

I found model.summary() to be very helpful at the beginning. I didn’t saw a lot of tutorial use it, probably because it doesn’t provide a lot of information.

Still, it does help understanding to see how parameters affect the dimension of the layers of the NN. It also helps with debugging because when you have an error, Keras will use the name of the layer to tell you where the issue is. I always have difficulty when I see a 1 or 2 index to know if the count starts at 0 or 1. It made me scratching my head until I found summary().

Layer (type) Output Shape Param #
=================================================================
dense_1 (Dense) (None, 512) 5120512
_________________________________________________________________
activation_1 (Activation) (None, 512) 0
_________________________________________________________________
dense_2 (Dense) (None, 8) 4104
_________________________________________________________________
activation_2 (Activation) (None, 8) 0
=================================================================
Total params: 5,124,616
Trainable params: 5,124,616
Non-trainable params: 0

Training the NN

Training is the part where all the coefficients are chosen to have the NN predict the closest output given an input. Keras does it in one line but it’s actually where most of the NN magic happens.

model.fit(x_train, y_train, 
batch_size=batch_size,
epochs=2,
verbose=1,
validation_split=0.1)

The training itself is done by an algorithm named backpropagation. A very clear explanation of backpropagation works can be found on the Matt Mazur blog.A Step by Step Backpropagation Example
Background Backpropagation is a common method for training a neural network. There is no shortage of papers online that…mattmazur.com

x_train and y_train are the inputs and outputs of the NN used for training. Both are multidimensional array. The first row of x_train is the first example of input. The first row of y_train is the first example of expected output. Each row of x_train is given to the NN and neurons coefficient are corrected toward y_train corresponding value. Doing that once for all data is called an epoch.

epochs=2 means to do that process twice. If you have a lot of data, you may put only 1, but most of the time using a larger epoch make a better use of the data.

batch_size is how many data are used to do a forward pass and a backward pass (from the Keras description ). I find that a little bit unclear, here is my guess, it might be wrong. Batch exists to solve a performance problem : updating weight of each neuron after each data is slow. It may not be very slow on CPU, but on GPU that would means transferring weights to the model after each forward pass. Also on many cores architecture, it allows to parallelize passes.

My guess is that Keras does many forward passes, compute the new weights, but only update them each batch_size pass. The nicest explication I found was on machinelearningmastery. It actually call the algorithm mini-batch.A Gentle Introduction to Mini-Batch Gradient Descent and How to Configure Batch Size
Stochastic gradient descent is the dominant method used to train deep learning models. There are three main variants of…machinelearningmastery.com

Reading CSV

CSV is often used as an input format for NN. My data format is basically composed of title and text of the article in the first field and category in the second field. It’s possible to read them that way :

import pandas

dataframe = pandas.read_csv(‘articles.csv’, header=None, escapechar=’\\’, na_filter=False)
dataset = dataframe.values
texts = dataset[:9000,0]
categories = dataset[9000:,1]
test_texts = dataset[:9000,0]
test_categories = dataset[9000:,1]

For evaluation purpose of the training, ML often split a part of the data to verify against them how good the trained NN is at predicting a result.

With python it’s really easy to do. dataset[9000:] returns the 9000 first rows and dataset[:9000] returns all rows after the 9000 one. The first one are the training data, the last one are the validation data.

The input : representing natural language numerically

This was my most head scratching issue at the beginning. NN expects numbers as inputs, but wasn’t evident for me how to convert a list of words to a list of numbers.

We could pick anything like a dictionary and assign a index to each word, but that wouldn’t be a great idea (or at least not in that case ). All transformations are not equals.

What we actually want is a transform that preserve some meaning of the information. We also want a transformation that is of fixed input dimension. Sentences as they are can’t be used as input because their length vary. Preserving the order of the words in the sentence also is only useful if we are going to exploit this order. The NN we defined doesn’t know how to use that, so preserving the information is useless.

The presentation I chose is to present sentence as a matrix where each column represent one word. In the most simple understanding of this, we could flag 1 if the word is present and 0 if the word isn’t present. Here we use a tfidfrepresentation which is a statistically more representative version of the data. I chose that one because it gave better result.

tk = Tokenizer(num_words=10000)
tk.fit_on_texts(texts)
x_train = tk.texts_to_matrix(x_train, mode=”tfidf”)
x_test = tk.texts_to_matrix(x_test, mode=”tfidf”)

What is important to understand is that the way we present data to the NN is going to orient how the NN is going to learn. There are different ways of representing data. I’ll try to write an article about a few of them. Each representation of the data also requires an appropriate NN structure.

The output : representing categories numerically

As we have exactly one label for each text, we can represent this as a vector as well. It’s usually called a one hot vector. Which means it’s a vector where a category is represented as a column. Here is a one hot vector for the category 2

[0,1,0,0,0,0,0,0]

It is very close to what the keras tokenizer and text_to_matrix, but keras tokenizer reserve the 0 which adds an extra useless column. Most people prefer to use LabelEncoder from sklearn.

from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
encoder.fit(text_categories)
num_categories = encoder.transform(text_categories)
y_train = keras.utils.to_categorical(num_categories)

Results

That NN as simple as it is performed quite well. It is able to classify articles category with 75% accuracy which is not too bad given that there is 8 categories. It could probably be improved by giving it more articles for each category.

One thing I didn’t thought about was that differences in classification between what I had and what the NN ouputted would be even more interesting than a perfect 100% match. 100% wouldn’t actually learn me anything. Correct matchs were often very close to 100% in the correct category. Invalid match were often more average score, such as news being part actuality, part health for example if it was talking about a new drug. The network actually provide via the softmax the probability for a news to be in some category and it could be helpful in proposing alternative category.