Confusion matrix, AUC and ROC curve and Gini clearly explained
Understanding the confusion matrix, AUC and ROC curve with their implementations
Machine learning classification metrics are not that hard to think about if the data are quite clean, neat and balanced. We can just compute the accuracy with the division of the true predicted observations by the total observation. This is not the case in general. In fact, a lot of problems in machine learning have imbalanced data (spam detection, fraud detection, detection of rare diseases …).
Say we want to create a model to detect spams and our dataset has 1000 emails where 10 are spams and 990 are not. So, we have chosen Logistic Regression to do this task and we’ve got 99% accuracy. This is a very high accuracy score right? Well, let me tell you that in terms of the model performance it’s NOT. Unfortunately, this number isn’t telling much information. How ? 95% or 99% are very high. In fact, just for fun, you and I right now are going to build a 99% accurate spam detection system. It’s a very simple rule. As the email comes through, look at its properties and features and no matter what they are, say it’s not spam, 99 times out of a 100 you’ll be correct. Yet this model is completely useless. So imbalanced data are very tricky in machine learning and there are good ways to…