In my recent experimenting with Keras I decided to try my hand at the classical handwritten digit dataset MNIST. MNIST ("Modified National Institute of Standards and Technology") is considered the standard "hello world" dataset of computer vision. The Digit Recognizer Kaggle competition tasks us with correctly identifying images of handwritten digits.
MNIST dataset
The training set train.csv
contains 42000 images of 28x28 pixels (for a total
of 784 pixels). Each image is labeled with which digit it represents (i.e., we
have 10 classes; 0-9). The test set test.csv
is made up of 28000 images of
the same dimensions, but of course without labels.
Let's first load these files into Python using pandas
:
import pandas as pd
train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')
We'll start by splitting up the training images and labels. We will convert the
images into a numpy
array with dimensions (samples, height, width,
channels)
. Since the images are grayscale, we only have one channel. The
labels should be of dimension (samples, label)
. While we're at it, we'll also
reshape the test set to identical dimensions as the training set.
x_train = train.drop(columns=['label']).values.reshape(-1, 28, 28, 1)
y_train = train['label'].values.reshape(-1, 1)
x_test = test.values.reshape(-1, 28, 28, 1)
Before we go any further, we'll also normalize the training and test images to have zero mean and unit standard deviation to make sure we're dealing with normalized data when training our model. Clipping is performed to ensure we don't run into errors caused by dividing by zero:
import numpy as np
x_train = (x_train - x_train.mean(axis=0)) / np.clip(x_train.std(axis=0), 1e-6, None)
y_train = (y_train - y_train.mean(axis=0)) / np.clip(y_train.std(axis=0), 1e-6, None)
Since we're dealing with categorical class labels, we will one-hot encode the training labels. This is necessary for our Keras model, and fortunately Keras provides a function to do this for us:
from keras.utils import to_categorical
y_train = to_categorical(y_train)
Easy enough! Now, finally, we split our training set up into a training portion
and a validation portion; this is used to monitor the performance of the model.
Since the number of training samples is so large, we will only use 5% as
validation samples; this still gives us 2100 validation images, approximately
200 from each class. We will use sklearn
for this:
from sklearn.model_selection import train_test_split
x_train, x_val, y_train, y_val = train_test_split(x_train, y_train, test_size=.05)
CNN
Okay, time to define our model! We will define a simple convolutional neural
network (CNN) with 6 convolution layers, dropout and pooling after each set of
two convolution layers, and two densely connected layers on top. The last
layers will use a softmax
activation function to do our classification; the
rest of the layers use relu
as their activation function. For the optimizer
we will use rmsprop
, and as a metric we will use accuracy
. The loss
function we use is categorical_crossentropy
, since we're dealing with a
categorical classification problem where we have more than 2 classes.
import keras.models as km
import keras.layers as kl
optimizer = 'rmsprop'
loss = 'categorical_crossentropy'
metrics = ['accuracy']
model = km.Sequential()
# First convolution pool
model.add(kl.Conv2D(32, 3, padding='same', input_shape=x_train.shape[1:]))
model.add(kl.Activation('relu'))
model.add(kl.Conv2D(32, 3))
model.add(kl.Activation('relu'))
model.add(kl.MaxPooling2D())
model.add(kl.Dropout(.25))
# Second convolution pool
model.add(kl.Conv2D(64, 3, padding='same'))
model.add(kl.Activation('relu'))
model.add(kl.Conv2D(32, 3))
model.add(kl.Activation('relu'))
model.add(kl.MaxPooling2D())
model.add(kl.Dropout(.25))
# Third convolution pool
model.add(kl.Conv2D(64, 3, padding='same'))
model.add(kl.Activation('relu'))
model.add(kl.Conv2D(64, 3))
model.add(kl.Activation('relu'))
model.add(kl.MaxPooling2D())
model.add(kl.Dropout(.25))
# Dense top layers
model.add(kl.Flatten())
model.add(kl.Dense(256))
model.add(kl.Activation('relu'))
model.add(kl.Dense(y_train.shape[1]))
model.add(kl.Activation('softmax'))
model.compile(optimizer, loss, metrics=metrics)
model.summary()
model.summary()
will print a layout of the compiled network.
Data augmentation
In order to reduce overfitting and improve the generalization of our network,
we will use Keras' ImageDataGenerator
to provide us with augmented training
data. We can specify various augmentation parameters such as random rotation
range, zoom range, width/height shift range and more. Once fitted to our
training data, the generator will supply the network with batches of training
data augmented on the fly.
from keras.preprocessing.image import ImageDataGenerator
data_gen = ImageDataGenerator(
rotation_range=20,
width_shift_range=.3,
height_shift_range=.3,
zoom_range=.3
)
data_gen.fit(x_train)
Training the model
Time to train the network! We create two callbacks to be used during training. The first callback will only save the best network (by validation accuracy) acquired during training to disk. The second callback will reduce the learning rate of the network during training when a plateau in the validation loss score is encountered.
import keras.callbacks as kc
checkpoint = kc.ModelCheckpoint('best_model.h5', monitor='val_acc', save_best_only=True)
reduce_lr = kc.ReduceLROnPlateau(monitor='val_loss')
Now we define some learning parameters and call the fit_generator
method on
our model, using the training data generator as input:
n_epoch = 50
batch_size = 32
results = model.fit_generator(data_gen.flow(x_train, y_train, batch_size),
validation_data=(x_val, y_val),
callbacks=[checkpoint, reduce_lr],
steps_per_epoch=500,
epochs=n_epoch,
verbose=1)
After training, we can use the obtained results
object to visualize our accuracy
and loss curves:
import matplotlib.pyplot as plt
plt.figure(figsize=[10, 4]
plt.plot(results.history['acc'], 'r')
plt.plot(results.history['loss'], 'g')
plt.plot(results.history['val_acc'], 'b')
plt.plot(results.history['val_loss'], 'y')
plt.legend(['acc', 'loss', 'val_acc', 'val_loss'])
plt.suptitle('Training curves')
plt.xlabel('Number of epochs')
plt.ylabel('Score')
plt.show()
This is exactly what we like to see! Training and validation accuracy increase, while both losses decrease. This indicates that little overfitting occurs and the network is working well.
Submission time!
Once training is done, the best model found is saved to best_model.h5
as we
specified in the checkpoint callback. Now we can load the model, and predict
our test set. We write the results into a CSV file which we can submit to
Kaggle!
model = km.load_model('best_model.h5')
predictions = model.predict(x_test)
# Pick the highest probability class label
y_pred = predictions.argmax(axis=1)
with open('submission.csv', 'w') as f:
f.write('ImageId,Label\n')
idxs = range(y_pred.shape[0])
f.write('\n'.join([f'{i+1},{y_pred[i]}' for i in idxs]))
Submitting these predictions should already give you an accuracy upwards of 99%. There is room for improvement; the best submissions on Kaggle have 100% accuracy! This is where you get creative. :)