Confusion matrix, AUC and ROC curve and Gini clearly explained

Understanding the confusion matrix, AUC and ROC curve with their implementations

8 min readMar 18, 2021

Machine learning classification metrics are not that hard to think about if the data are quite clean, neat and balanced. We can just compute the accuracy with the division of the true predicted observations by the total observation. This is not the case in general. In fact, a lot of problems in machine learning have imbalanced data (spam detection, fraud detection, detection of rare diseases …).

Confusion matrix and ROC curve by Hosein Kazazi

Say we want to create a model to detect spams and our dataset has 1000 emails where 10 are spams and 990 are not. So, we have chosen Logistic Regression to do this task and we’ve got 99% accuracy. This is a very high accuracy score right? Well, let me tell you that in terms of the model performance it’s NOT. Unfortunately, this number isn’t telling much information. How ? 95% or 99% are very high. In fact, just for fun, you and I right now are going to build a 99% accurate spam detection system. It’s a very simple rule. As the email comes through, look at its properties and features and no matter what they are, say it’s not spam, 99 times out of a 100 you’ll be correct. Yet this model is completely useless. So imbalanced data are very tricky in machine learning and there are good ways to account for in this problem, one of which are the confusion matrix, ROC curve, AUC and the Gini.

To get things started, I have included a working example in Github where I treated a dataset to predict customer churn where the classes are churned (1) and didn’t churn (0).

Confusion matrix

Th confusion matrix is a metric(a performance measurement) for machine learning classification in both binary and multi-class classification. In this article we’ll tackle the binary one. So we’ll have a table with 2 rows and 2 columns that express how well the model did. While it’s super easy to understand, its terminology can be a bit confusing. Thus, keeping this premise under consideration, this article aims to clear the fog around this model evaluation system.

To well understand the matrix columns and rows we need to understand what every column and row means. The caption below shows it.

Let’s decode this matrix:

Since we are working with a binary classification values positive and negative correspond to the target value.
Columns represent the actual values of the target and rows represent the predicted value of the target.
TP(True positive): the predicted value(positive) matches the actual one.
TN(True negative): the predicted value(negative) matches the actual one.
FP(False positive or Type 1 error): the predicted value is positive but the actual one is negative.
FN(False negative or Type 2 error): the predicted value is negative but the actual one is positive. This is usually the error we need to decrease the most.

Note: To comply with global convention, usually the positive label is the bad one or the rare one. For example, if you’re working on spam detecting you give label 1 to spams and 0 for others, if you’re working on cancer detecting, you attach 1 to current cancer patient and 0 to patient who don’t have cancer, etc. So Type 2 error, which is equivalent to saying to someone who has cancer that he hasn’t, is the real danger and we must decrease it as possible.

To well demystify this, there is a well known example on the internet where we’ll understand the particularity of every term.

Image from Angers university in this course.

Metrics

Now that we understood the meaning of each term let’s combine them to well define accuracy, precision, recall(sensitivity), specificity and F1-score.

Let’s start with an easy one: the accuracy metric

Accuracy: out of all observations, how many we predicted correctly. This metric doesn’t work for imbalanced data, but it gives the first impression of the global performance.

Precision: out of the positive predicted cases, how many are actually positive. It’s the ability of a classifier to not label a positive case as negative. This metric is important if the importance of false positives is greater than that of false negatives (ex: Video or music recommendation, ads, etc.)

Recall: out of all positive cases, how many we predicted correctly. It’s also called sensitivity or TPR (true positive rate). It’s the ability of a classifier to find all positive instances, and this metric is important if the importance of false negatives is greater than that of false positives. This is the case for our problem.

F1-score: is the harmonic mean of recall and precision. In practice, we choose to maximize precision or recall but not the two, because if one increased the other decreases. So F1-score tries to capture the two so it can give us the best mean if the importance of the precision and recall are the same for us. The maximum value would be when the precision equals to recall.

ROC (Receiver Operating Characteristic)

ROC is one of the most important evaluation metrics for checking any classification model’s performance. It’s plotted with two metrics against each other. TPR (True Positive Rate or Recall) and FPR (False Positive Rate) where the former is on y-axis and the latter is on x-axis.

TPR: is the recall which is, out of all positive cases, how many we predicted correctly.

FPR: out of all negatives cases how many we didn’t predict correctly.

ROC computes TPR and FPR at various thresholds settings. Raising the classification threshold classifies more items as negative, therefore decreasing both false Positives and true Positives, and vice versa. This means the two metrics are correlated positively.

Distribution of two classes (Image by Author)

AUC (Area Under the Curve)

The ROC curve on its own is not a metric to compute because it’s just a curve where we want to see in every threshold TPR and FPR metrics against each other. So, to quantify this curve and compare two models we need a more explicit metric. Here where it comes AUC. As its name indicates, it measures the entire two-dimensional area underneath the ROC curve. Think of it as integral calculus. This provides a measure of performance among all classification thresholds.

AUC tells how much our model, regardless of our chosen threshold, is able to distinguish between the two classes. The higher it is the better the model is. In a nutshell, AUC describes the degree of separability that our model makes. It has a value between 1 and 0.

Another way to interpret AUC is to see it like a probability of an observation to be well predicted. It’s somehow like a sophisticated and complex accuracy.

We should note that it isn’t related to accuracy, precision or recall directly because AUC is classification-threshold-invariant, it means it exists independently of a threshold. AUC is also scale-invariant, it measures how well predictions are ranked, rather than their absolute values and it’s based on the relative predictions, so any transformation that preserves relative order has no effect on AUC.

GINI

Different ROC curves interpretation

The Gini index or coefficient is a way to adjust the AUC so that it can be clearer and more meaningful. It’s more natural for us to see a perfectly random model having 0, reversing models with a negative sign and the perfect model having 1. The range of values now is [-1, 1].

Perfectly reversing model

This model is doing the exact opposite of a perfect model. It’s predicting every positive observation as a negative one and vice-versa. This means if we invert all the outputs we’ll have a perfect model. It has a Gini=-1 and AUC=0. And if you have a model like this, or a model having a negative Gini, you’ve surely done something wrong.

Imperfect model

The imperfect model is the worst model we can have. It means this model has no discrimination ability to distinguish between the two classes. It’s a perfectly random model. It has a Gini=0 and AUC=0.5

Perfect model

The perfect model is the model that predicts every observation correctly for positive and negative classes. It means in every threshold at least one of FPR and TPR is equal to zero. This model has an AUC=1 and a Gini=1.

Conclusion

What you need to keep from this article is:

Accuracy is not enough to know the performance of a model (the case for imbalanced data for example).
The confusion matrix is a crossing between actual values and predicted values.
False positive and false negative are two different errors, we usually work on the latter to decrease it first, but we can work on the former to(like music recommendation)
Precision is a metric that we want to maximize if the false positive error is important. However, we maximize recall if false negative error is.
ROC curve is a graphical representation of the tradeoff between predicting more positive values + having more errors and predicting less positive values + having less errors(type 2 error) for every threshold.
AUC is the area under the ROC curve, it measures how well a model distinguishes between two classes. The higher the better.
AUC is classification-threshold-invariant and scale-invariant.
GINI is just an adjustment to AUC so that a perfectly random model scores 0 and a reversing model has a negative sign.

You can find the source code of this article from scratch here.