This post proposes an idea for rating the performance of a fuzzy binary classifier (yes, I made that name up myself). A friend asked me if I knew of a way to measure accuracy for a classifier that predicted the emotion showed by pictures of human faces with a probability distribution (i.e. 60% sad, 30% angry, 10% happy). The labels for the dataset are however binary (if the emotion of face n is sad, the label set will have a 1 value for the sad column and zeroes in all other columns).

My idea is actually very simple; sum the probabilities reported for the actual emotions, and check how large that sum is compared to the total number of pictures in the dataset. This way of rating performance ensures that the weight of the prediction correctness (in other words, the probability reported for the actual emotion) is taken into account.

I wrote a little snippet of Python code to demonstrate my thinking. I generated some random test data, and generated labels for that test data based on the following rule: with probability .7, the highest probability in the test row is the actual emotion, and otherwise a random emotion is chosen. This makes the test labels a little bit more realistic (since the classifier is not going to be 100% correct everytime).

# Generate classifier output.
X = np.ones((100, 10))
for i in range(X.shape[0]):
    # Dirichlet distribution sums to 1
    X[i, :] = np.random.dirichlet(X[i, :], size=1)

# Generate labels: probability that largest score in output is correct is
# .7 and with probability .3 a random label is chosen.
y = np.ones(100)
for i in range(y.shape[0]):
    p = np.random.rand()
    if p >= 0.7:
        y[i] = np.argmax(X[i, :])
        y[i] = np.random.randint(10)

So, in summary: X is a 100x10 matrix with classifier predictions (100 pictures with probabilities for 10 emotions per picture). y is a list of labels for the test data, where each label is stored as the index of the correct emotion (some index in the range 0-10).

Now we can measure the performance on these two matrices using the following code:

# Calculate performance by summing probabilities of actual emotions
# reported by the classifier.
score = 0
for i in range(y.shape[0]):
    score += X[i, y[i].astype(int)]

print('Performance: {:.2f}/{}'.format(score, 100))

Fairly straightforward! This is not the most sophisticated way of doing things, but it might at least give you an idea of how accurate your classifier actually is.