Medium Last updated on May 8, 2022, 11:39 p.m.

In real-world ML applications, it’s not enough to empirically validate models’ performance using accuracy metrics. This is especially true when evaluating algorithms that output probabilities of class values; for example, logistic regression, random forest, etc.

Here in this article, we will go over the following:

- Define the AUC - ROC curve and learn to interpret it.
- Figure out how the ROC curve helps with performance tuning.
- Understand plotting AUC - ROC curve for multi-class models.
- Learn the difference between the PR(Precision-Recall) curve and the ROC curve.

ROC is a good way of visualizing a classifier’s performance in order to select a suitable operating point or decision threshold. However, when comparing a number of different classification schemes it is often desirable to obtain a single figure as a measure of the classifier’s performance; thats’ where AUC is utilized.

In ROC space, one plots the False Positive Rate (FPR) on the x-axis and the True Positive Rate (TPR) on the y-axis. The FPR measures the fraction of negative examples that are misclassified as positive. The TPR measures the fraction of positive examples that are correctly labeled.

AUC(Area Under Curve) score measures the total area underneath the ROC curve. It represents the degree or measure of separability and informs how much the model is capable of distinguishing between classes. Higher the AUC, the better the model is at predicting 0 classes as 0 and 1 classes as 1. By analogy, the Higher the AUC, the better the model is at distinguishing between patients with the disease and no disease.

**Note:** AUC score is scale and threshold invariant.

Combined together, the AUROC (Area Under the Receiver Operating Characteristics) curve is a good performance measurement for the classification problems at various threshold settings.

Before we dive into details, lets’ understand some widely used technical terms:

** 1. True Positive Rate (TPR)** is the same as sensitivity or recall. It can be calculated using the True Positive (TP) and False Negative (FN) values of a confusion matrix. Sensitivity refers to the test’s ability to correctly detect patients who are ill and actually have the condition.

$$ TPR/Sensitivity/Recall = \frac{TP}{TP+FN}$$

** 2. True Negative Rate ** also known as Specificity, refers to the probability of a negative test, conditioned on truly being negative! In other terms, if the scenario is a medical test, specificity can tell us the test’s ability to correctly reject healthy patients without a condition.

$$ TNR/Specificity = \frac{TN}{TN+FP} $$

**Note:** Sensitivity and Specificity are inversely proportional to each other. In simple terms, when we decrease sensitivity, the specificity automatically increases and vice versa.

** 3. False Positive Rate **(fall-out) is easily figured out by using the formula: $ FPR = 1 - TNR/Specificity $ . This is the same as:

$$ FPR = \frac{FP}{TN+FP} $$

** Note:** As, $ FPR = 1 - Specificity $, this means, that when we increase TPR, the FPR will also increase.

Now, let us take a look at how we can understand the AUC - ROC curve in detail.

An efficient classification model will have its AUC closest to 1, i.e; it will have a good measure of separability, Whereas an inefficient model will have an AUC ~ 0 i.e; a poor measure of separability. A poorly functioning model will misclassify all data points’ classes. However, if the AUC is approx 0.5, this means, the model cannot define the separation between the classes. Let’s visualize the above statements to understand them better.

Consider the green represents the negative class(patients who do not have COVID) and the red represents to illustrate the positive class(patients who have COVID).

** Case 1: AUC ~ 1.0 **

As we can see that there is no overlap, which means that the model has good separation capability. In other words, the model can perfectly differentiate between the negative and positive classes

** Case 2: AUC ~ 0.75 **

Here, the distributions are overlapping. When this happens, we come across Type 1 and Type 2 errors. For more details and concise explanation of Type 1 and Type 2 errors, please check out this blog - What are Type I and Type II errors? How to avoid them?.

** Case 3: AUC ~ 0.5 **

In this case, the model cannot differentiate between the positive and the negative class.

** Case 4: AUC ~ 0.0 **

In this case, the model will predict a positive class as a negative class and vice versa.

In the case of a multi-class classification model, we will be plotting multiple AUC - ROC curves depending on the number of classes. For example, if we have 3 classes namely A, B, and C, we will have to plot a ROC curve for A classified against B and C. Another ROC curve for B is classified against A and C and the third ROC plot where C is classified against B and A. This methodology is also called `one vs all`

in the domain of applied statistics and can be used for any ‘n’ number of classes.

In theory, its always suggested using ROC curves when the observations are balanced between each class, whereas precision-recall curves are appropriate for skewed datasets.

As we have observed above, the ROC curve is essentially True Positive Rate vs False Positive Rate, and in the case of an imbalanced dataset, the ROC curve will need a huge change in the number of false positives to capture a small change in the false positive rate. On the other hand, using the PR curve, precision metric captures the effect of a large number of negative examples by comparing false positives to true positives rather than true negatives and therefore will be a good measure.

In conclusion, we can suggest using the PR curve in case the objective values False Positive Rate more, whereas if the classifier is expected to perform in general, at a variety of different baseline probabilities, use the ROC curve.

** I hope this understanding of the AUC - ROC curve helps you in your journey of acing interviews! **

Also, consider reading the following articles as well:

- What is F1-Score? Define it in terms of Precision and Recall.
- What is bias variance trade-off? How does it impact model performance?
- What loss functions can be used for regression? Which one is better for outliers?
- What is the 0-1 loss function? Why can’t the 0-1 loss function or classification error be used as a loss function for optimizing a deep neural network?

** References **

1. The Relationship Between Precision-Recall and ROC Curves

2. Mathematics behind ROC-AUC interpretation

3. Receiver operating characteristic

4. A Zero-Math Intuitive Understanding of the ROC-AUC Metric

Frequently Asked Questions by

Amazon Microsoft