Date

This week I've been working on the Facial Keypoint Detection competition hosted by Kaggle. The objective of the competition is to predict keypoint positions on face images.

We start out with three files: the training set training.csv, the test set test.csv and a list of 27214 keypoints to predict in IdLookupTable.csv.

Let's first load these files into Python:

import numpy as np
import pandas as pd

df = pd.read_csv('training.csv')
testdf = pd.read_csv('test.csv')
lookup = pd.read_csv('IdLookupTable.csv')

Inspecting our training set using df.info(), df.shape and df.describe() gives us some useful information. We see that there are 31 columns, 30 of which are keypoint coordinates. The last column is a space-separated string of pixel values, ordered by rows. The next important thing is that there are quite some keypoints missing; we'll have to do something about that.

Given this info, we can do some more preprocessing on our data:

# Make image column a numpy array
for d in [df, testdf]:
    d['Image'] = df['Image'].apply(lambda im: np.fromstring(im, sep=' '))

# fill up missing values with the mean of respective column
df = df.fillna(df.mean())

# create training samples/labels for model fitting
X = np.vstack(df.ix[:, 'Image']).astype(np.float)
y = df.drop('Image', axis=1).values

Voila, we can now use X and y to train a model on our data. We have several options here: neural networks could potentially be very accurate. However, I'm not very proficient with those yet, so I chose to use a relatively simple regression model, k-nearest neighbors. This model predicts a value by finding it's k-nearest neighbors and returning the mean of those as a prediction.

from sklearn.neighbors import KNeighborsRegressor as KNR

estim = KNR()
estim.fit(X, y)

This train a k-nearest neighbors regression model with the default values scikit-learn provides. Using this model, we can do our first submission to the competition! We need to predict keypoints specified in the lookup data. Relevant fields are ImageId (points to a row in test.csv) and FeatureName (specifies the feature we need to predict).

I simply predicted every keypoint for the whole test set and then molded the results into the submission format RowId, Location.

# stack all test images into one numpy array
Y = np.vstack(testdf.ix[:, 'Image']).astype(np.float)

# predict all keypoints for the images in Y
predictions = estim.predict(Y)

# now create the result data and write to csv
preddf = pd.DataFrame(predictions, columns=df.columns[:-1])
results = pd.DataFrame(columns=['RowId', 'Location'])

for i in range(lookup.shape[0]):
    d = lookup.ix[i, :]
    r = pd.Series([d['RowId'], preddf.ix[d['ImageId']-1, :][d['FeatureName']]],
                  index=results.columns)
    results = result.append(r, ignore_index=True)

results['RowId'] = results['RowId'].astype(int)
results.to_csv('predictions.csv', index=False)

And we're done! Submitting this placed me halfway up the leaderboard. Some ways to increase accuracy include experimenting with k (default value is 5), and being smarter about data imputation (we only filled in empty fields with the column mean; manually labeling the images or otherwise getting more precise values would be way more accurate).