Confusion Matrix Deep Dive

Binary Classification Evaluation

Posted by Fan Gong on Jan 25, 2020

1. Confusion Matrix

Confusion matrix is a table that used to measure the performance of your binary classification results. It is so confusing (especially the name) that I had hard time remembering it at first time. So here I will try to make a clear review.

Let's make a specific example here. Suppose our binary classifier tries to classify patients into healthy (negative) and unhealthy (positive) group, and here is the result:

Predicted Yes Predicted No
Actual Yes TP 1 FN 39
Actual No FP 15 TN 100

1.1 {True, False} {Positive, Negative}

True/False tells the classification is in fact correct or not; Positive/Negative tells the results our classifier gives. Therefore:

  • True Positive: You predict as positive and in fact it is also positive
  • True Negative: You predict as negative and in fact it is also negative
  • False Positive: You predict as positive but in fact it is negative
  • False Negative: You predict as negative but in fact it is positive

So you can easily see that FP is actually the type I error, where we predicted as positive (H0 is true) but actual it is negative.

In a same way, FN will be the type II error.

1.2 Different Metric

After understanding the basic part of confusion matrix, then let's navigate to some confusing metrics.

Total Level

  • Accuracy: TP+TN/Total = 101/155 = 65.2%
  • Classification Error = FN+FP = 1-Accuracy = 34.8%

These two metrics give us a general idea about how our model performs. It tells the correctness in a total level. We use it when we know that our data is kind of balanced.

Since these two metrics have exact opposite meaning and always sum up to one. We can just pick one to use.

Actual group level

  • True Positive Rate: TP/TP+FN = 1/1+39 = 2.5% Also called recall or sensitivity
  • False Negative Rate: FN/TP+FN = 1 - TPR = 97.5%
  • True Negative Rate: TN/TN+FP = 100/115 = 87.0% also called specificity
  • False Positive Rate: 1 - TNR = 13.0%

These four metrics give us a more detailed idea about how the classifier performs. It tells the correctness in actual group level, meaning how is your result in actual positive group (TPR, FNR) and actual negative group (TNR, FPR).

Same as in total level, we only need one metric for each group to cover the performance. Here we normally choose TPR which is recall/sensitivity and TNR which is specificity.

Predicted group level

  • Precision = TP/TP+FP = 100/100+39 = 71.9%

2. ROC Curve:

An ROC curve is the most commonly used way to visualize the performance of a binary classifier based on different threshold, and AUC is a good way to summarize its performance in a single number.

ROC curve is just the plot of true positive rate and false positive rate

where the cost of one error outweighs the cost of other types of errors

ROC curve is insensitive to changes in class distribution. In that case it is better to use precision-recall curve.

AUC

The AUC has an important statistical property: the AUC of a classifier is equivalent to the probability that the classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance

Precision-Recall Curve:

ROC Curves summarize the trade-off between the true positive rate and false positive rate for a predictive model using different probability thresholds. Precision-Recall curves summarize the trade-off between the true positive rate and the positive predictive value for a predictive model using different probability thresholds. ROC curves are appropriate when the observations are balanced between each class, whereas precision-recall curves are appropriate for imbalanced datasets.

F1 score

F1 score is the hormonic mean of precision and recall: \[F1 = \frac{2}{\frac{1}{precision}+\frac{1}{recall}}\]

3. Summary

I think after understanding all the metrics, their difference is quite clear now: which metric to use highly depends on your data and your business goal. For example:

  • Total level metrics will give your the overall performance for your classifier but we are not sure the performance for each class
  • If the data is imbalanced then f1 score gives a good overall measure
  • Group level metrics tells you the performance for each group. It is easily to understand that it has trade-off between the performance of each groups. So which metric to use depend on the cost of false positive and false negative. For example, in our example above, the cost of false positive (we predicted as unhealthy but actually is healthy) is far cheaper than false negative (we predicted as healthy but actually unhealthy), so in that case we want false negative rate to be as low as possible or to say we want recall to be as high as possible. Another example is that if we do Spam detection and spam is positive. Here false positive means we falsely classify important email to spam and it cost much more than false negative which deem some spam as normal emails. In that case we can use FPR or precision.
  • ROC or PR can help you choose threshold based on your business logic and
  • AUC is also a good way to measure your model performance