Performance Evaluation of Prediction Models with Binary Outcome

True Positives

Infected and Predicted as Infected - Good

💊
🤢

False Positives

Not-Infected and Predicted as Infected - BAD

💊
🤨

False Negatives

Infected and Predicted as Not-Infected - BAD

🤢

True Negatives

Not-Infected and Predicted as Not-Infected - GOOD

🤨

Probability Threshold:

When the intervention carries a potential risk and there is a trade-off of risks between the intervention and the outcome we will use probability threshold in order to classify each probability to Predicted Negative (Do not Treat) or Predicted Positive (Treat 💊).
This type of dichotomization is related to individuals with different preferences.

Probability Threshold:

p̂	0.11	0.15	0.18	0.29	0.31	0.33	0.45	0.47	0.63	0.72
Y	0	0	0	0	1	0	1	0	1	1
	🤨	🤨	🤨	🤨	🤢	🤨	🤢	🤨	🤢	🤢

Low Probability Threshold:

Low Probability Threshold means that I’m worried about the outcome:

I’m worried about Prostate Cancer 🦀
I’m worried about Heart Disease 💔
I’m worried about Infection 🤢

Probability Threshold of 0.25

p̂	0.11	0.15	0.18	0.29	0.31	0.33	0.45	0.47	0.63	0.72
Ŷ	0	0	0	1	1	1	1	1	1	1
Y	0	0	0	0	1	0	1	0	1	1
	🤨	🤨	🤨	💊 🤨	💊 🤢	💊 🤨	💊 🤢	💊 🤨	💊 🤢	💊 🤢
	TN	TN	TN	FP	TP	FP	TP	FP	TP	TP

High Probability Threshold:

High Probability Threshold means that I’m worried about the Intervention:

I’m worried about Biopsy 💉
I’m worried about Statins 💊
I’m worried about Antibiotics 💊

Probability Threshold of 0.55

p̂	0.11	0.15	0.18	0.29	0.31	0.33	0.45	0.47	0.63	0.72
Ŷ	0	0	0	0	0	0	0	0	1	1
Y	0	0	0	0	1	0	1	0	1	1
	🤨	🤨	🤨	🤨	🤢	🤨	🤢	🤨	💊 🤢	💊 🤢
	TN	TN	TN	TN	FN	TN	FN	TN	TP	TP

Discrimination - Performance Curves

Curve	Sens	Spec	PPV	PPCR	Lift
ROC	y	x
Lift				x	y
Precision- Recall	x		y
Gains	y			x

ROC Curve

Curve	Sens	Spec	PPV	PPCR	Lift
ROC	y	x
Lift				x	y
Precision- Recall	x		y
Gains	y			x

ROC Curve

The most famous form of Performance Metrics Visualization
Displays Sensitivity (also known as True Positive Rate or Percision) on the y axis
Displays 1 - Specificity (also known as False Positive Rate) on the x axis.

Why I don’t like ROC Curve 😤

Why 1 - Specificity? Why not just Specificity? 🙃

Honestly, I didn’t find anywhere why 1 - Specificity is more insightful than just Specificity.

Why I don’t like ROC Curve 😤

Why 1 - Specificity? Why not just Specificity? 🙃

Honestly, I didn’t find anywhere why 1 - Specificity is more insightful than just Specificity.

Why I don’t like ROC Curve 😤

Sensitivity and Specificity do not respect the flow of time 🕰️

Why I don’t like ROC Curve 😤

Sensitivity and Specificity do not respect the flow of time 🕰️

Sensitivity:

Specificity:

We do not know the condition of the conditional probability: Not the number of future Real Positives nor the number of Real Negatives in the future.

Why I don’t like ROC Curve 😤

Sensitivity and Specificity do not respect the flow of time 🕰️

PPV:

NPV:

We know the condition of the Conditional Probability: The number of Predicted Positives and the number of Predicted Negatives.

Why I don’t like ROC Curve 😤

You don’t care about AUROC, you care about the c-statistic

Generally speaking more area under a curve with two “Good” performance metrics means a better model. Other than that, there is no context and performance metrics with no context might lead to ambiguity and bad decisions.
Another Curve: Precision-Recall is made of PPV (Precision) and Sensitivity (Recall). How much PRAUC is enough?

Why I don’t like ROC Curve 😤

You don’t care about AUROC, you care about the c-statistic

Why not calculating GAINSAUC? Or any combination of two good performance metrics? We can get with Sensitivity, Specificity, NPV, PPV 6 AUC metrics. Do they provide any meaningful insight besides a vague the more the better?

What is the AUROC of the following Models?

Why I don’t like ROC Curve 😤

You don’t care about AUROC, you care about the c-statistic

High Ink-to-information ratio 😵
One might suggest that the visual aspect is useful, but as human beings we are really bad at interpreting round things (That’s why pie-charts are considered to be bad practice).

Yet, the AUROC is valuable because of the equivalence to the c-statistic and it might provide good intuition about the performance of the model.

Why I don’t like ROC Curve 😤

You don’t care about AUROC, you care about the c-statistic

If you’ll take randomly one event and one non-event, the probability that the event will be estimated with higher probability than the non-event is exactly the AUROC.
AUROC = p( p̂(🤨) < p̂(🤢) )

p̂	Y
0.72	1	🤢
0.63	1	🤢
0.47	0	🤨
0.45	1	🤢
0.33	0	🤨
0.31	1	🤢
0.29	0	🤨
0.18	0	🤨
0.15	0	🤨
0.11	0	🤨

p̂	Y
0.72	1	🤢
0.63	1	🤢
0.45	1	🤢
0.31	1	🤢

p̂	Y
0.47	0	🤨
0.33	0	🤨
0.29	0	🤨
0.18	0	🤨
0.15	0	🤨
0.11	0	🤨

p̂	Y
0.72	1	🤢
0.63	1	🤢
0.45	1	🤢
0.31	1	🤢

p̂		p̂
0.72	🤢 🤨	0.47
0.72	🤢 🤨	0.33
0.72	🤢 🤨	0.29
0.72	🤢 🤨	0.18
0.72	🤢 🤨	0.15
0.72	🤢 🤨	0.11

p̂	Y
0.47	0	🤨
0.33	0	🤨
0.29	0	🤨
0.18	0	🤨
0.15	0	🤨
0.11	0	🤨

p̂	Y
0.72	1	🤢
0.63	1	🤢
0.45	1	🤢
0.31	1	🤢

p̂		p̂	🖖
0.72	🤢 > 🤨	0.47	👍
0.72	🤢 > 🤨	0.33	👍
0.72	🤢 > 🤨	0.29	👍
0.72	🤢 > 🤨	0.18	👍
0.72	🤢 > 🤨	0.15	👍
0.72	🤢 > 🤨	0.11	👍

p̂	Y
0.47	0	🤨
0.33	0	🤨
0.29	0	🤨
0.18	0	🤨
0.15	0	🤨
0.11	0	🤨

p̂	Y
0.72	1	🤢
0.63	1	🤢
0.45	1	🤢
0.31	1	🤢

p̂		p̂
0.63	🤢 🤨	0.47
0.63	🤢 🤨	0.33
0.63	🤢 🤨	0.29
0.63	🤢 🤨	0.18
0.63	🤢 🤨	0.15
0.63	🤢 🤨	0.11

p̂	Y
0.47	0	🤨
0.33	0	🤨
0.29	0	🤨
0.18	0	🤨
0.15	0	🤨
0.11	0	🤨

p̂	Y
0.72	1	🤢
0.63	1	🤢
0.45	1	🤢
0.31	1	🤢

p̂		p̂
0.63	🤢 🤨	0.47
0.63	🤢 🤨	0.33
0.63	🤢 🤨	0.29
0.63	🤢 🤨	0.18
0.63	🤢 🤨	0.15
0.63	🤢 🤨	0.11

p̂	Y
0.47	0	🤨
0.33	0	🤨
0.29	0	🤨
0.18	0	🤨
0.15	0	🤨
0.11	0	🤨

p̂	Y
0.72	1	🤢
0.63	1	🤢
0.45	1	🤢
0.31	1	🤢

p̂		p̂	🖖
0.63	🤢 > 🤨	0.47	👍
0.63	🤢 > 🤨	0.33	👍
0.63	🤢 > 🤨	0.29	👍
0.63	🤢 > 🤨	0.18	👍
0.63	🤢 > 🤨	0.15	👍
0.63	🤢 > 🤨	0.11	👍

p̂	Y
0.47	0	🤨
0.33	0	🤨
0.29	0	🤨
0.18	0	🤨
0.15	0	🤨
0.11	0	🤨

p̂	Y
0.72	1	🤢
0.63	1	🤢
0.45	1	🤢
0.31	1	🤢

p̂		p̂
0.45	🤢 🤨	0.47
0.45	🤢 🤨	0.33
0.45	🤢 🤨	0.29
0.45	🤢 🤨	0.18
0.45	🤢 🤨	0.15
0.45	🤢 🤨	0.11

p̂	Y
0.47	0	🤨
0.33	0	🤨
0.29	0	🤨
0.18	0	🤨
0.15	0	🤨
0.11	0	🤨

p̂	Y
0.72	1	🤢
0.63	1	🤢
0.45	1	🤢
0.31	1	🤢

p̂		p̂	🖖
0.45	🤢 < 🤨	0.47	👎
0.45	🤢 > 🤨	0.33	👍
0.45	🤢 > 🤨	0.29	👍
0.45	🤢 > 🤨	0.18	👍
0.45	🤢 > 🤨	0.15	👍
0.45	🤢 > 🤨	0.11	👍

p̂	Y
0.47	0	🤨
0.33	0	🤨
0.29	0	🤨
0.18	0	🤨
0.15	0	🤨
0.11	0	🤨

p̂	Y
0.72	1	🤢
0.63	1	🤢
0.45	1	🤢
0.31	1	🤢

p̂		p̂
0.31	🤢 🤨	0.47
0.31	🤢 🤨	0.33
0.31	🤢 🤨	0.29
0.31	🤢 🤨	0.18
0.31	🤢 🤨	0.15
0.31	🤢 🤨	0.11

p̂	Y
0.47	0	🤨
0.33	0	🤨
0.29	0	🤨
0.18	0	🤨
0.15	0	🤨
0.11	0	🤨

p̂	Y
0.72	1	🤢
0.63	1	🤢
0.45	1	🤢
0.31	1	🤢

p̂		p̂	🖖
0.31	🤢 < 🤨	0.47	👎
0.31	🤢 < 🤨	0.33	👎
0.31	🤢 > 🤨	0.29	👍
0.31	🤢 > 🤨	0.18	👍
0.31	🤢 > 🤨	0.15	👍
0.31	🤢 > 🤨	0.11	👍

p̂	Y
0.47	0	🤨
0.33	0	🤨
0.29	0	🤨
0.18	0	🤨
0.15	0	🤨
0.11	0	🤨

p̂	Y
0.72	1	🤢
0.63	1	🤢
0.45	1	🤢
0.31	1	🤢

p̂	Y
0.47	0	🤨
0.33	0	🤨
0.29	0	🤨
0.18	0	🤨
0.15	0	🤨
0.11	0	🤨

Why I don’t like ROC Curve 😤

You don’t care about AUROC, you care about the c-statistic

probs <- c(0.11, 0.15, 0.18, 0.29, 0.31, 0.33, 0.45, 0.47, 0.63, 0.72)
reals <- c(0, 0, 0, 0, 1, 0, 1, 0, 1, 1)

pROC::auc(reals, probs)

Area under the curve: 0.875

probs_events <- probs[reals == 1]
probs_nonevents <- probs[reals == 0]

prop.table(
  table(
    sample(probs_events, replace = TRUE, size = 10000) >
    sample(probs_nonevents, replace = TRUE, size = 10000)
  )
)


FALSE  TRUE 
0.121 0.879

Why I don’t like ROC Curve 😤

You don’t care about AUROC, you care about the c-statistic

import numpy as np
import random

probs = np.array([0.11, 0.15, 0.18, 0.29, 0.31, 0.33, 0.45, 0.47, 0.63, 0.72])
reals = np.array([0, 0, 0, 0, 1, 0, 1, 0, 1, 1])

probs_events = probs[reals == 1]
probs_nonevents = probs[reals == 0]

event_prob_greater_than_nonevent_prob = np.greater(
  random.choices(sorted(probs_events), 
  k = 10000),
  random.choices(sorted(probs_nonevents), 
  k = 10000)
)

unique_elements, counts_elements = np.unique(
  event_prob_greater_than_nonevent_prob, return_counts=True)

counts_elements / 10000

array([0.13, 0.87])

Good AUROC does not necessarily mean a Good model

Age	7	6	49	56	64	54	72	68	91	86
p̂	0.11	0.15	0.18	0.29	0.31	0.33	0.45	0.47	0.63	0.72
Y	0	0	0	0	1	0	1	0	1	1
	🤨	🤨	🤨	🤨	🤢	🤨	🤢	🤨	🤢	🤢

AUROC shows how well your model discriminates between events and non-events given a target population.

Good AUROC does not necessarily mean a Good model

Age	7	6	49	56	64	54	72	68	91	86
p̂	0.11	0.15	0.18	0.29	0.31	0.33	0.45	0.47	0.63	0.72
Y	0	0	0	0	1	0	1	0	1	1
	🧒	🧒	🤨	🤨	🤢	🤨	🤢	🤨	👵	👵

This model has AUROC = 0875, but the number is misleading:
The Target Population is not well defined.

Bad AUROC does not necessarily mean a Bad model

Age	49	56	64	54	72	68
p̂	0.18	0.29	0.31	0.33	0.45	0.47
Y	0	0	1	0	1	0
	🤨	🤨	🤢	🤨	🤢	🤨

This model has AUROC = 0.625, but the number is misleading:
The Target Population is well defined.

p̂	Y
0.72	1	🤢
0.63	1	🤢
0.45	1	🤢
0.31	1	🤢

p̂	Y
0.47	0	🤨
0.33	0	🤨
0.29	0	🤨
0.18	0	🤨
0.15	0	🤨
0.11	0	🤨

AGE	p̂	Y
86	0.72	1	👵
91	0.63	1	👵
72	0.45	1	🤢
64	0.31	1	🤢

AGE	p̂
68	0.47	🤨
54	0.33	🤨
56	0.29	🤨
49	0.18	🤨
6	0.15	🧒
7	0.11	🧒

AGE	p̂	Y
72	0.45	1	🤢
64	0.31	1	🤢

AGE	p̂
68	0.47	🤨
54	0.33	🤨
56	0.29	🤨
49	0.18	🤨

Lift Curve

Curve	Sens	Spec	PPV	PPCR	Lift
ROC	y	x
Lift				x	y
Precision- Recall	x		y
Gains	y			x

Prevalence

	Predicted Positives	Predicted Negatives
Real Positives			4 (0.4%)
Real Negatives			6 (0.6%)
			10 (1%)

1	1	0	1	0	1	0	0	0	0
🤢	🤢	🤨	🤢	🤨	🤢	🤨	🤨	🤨	🤨

PPCR (Predicted Positives Conditional Rate):

Sometimes we will classify each observation according to the ranking of the risk In order to prioritize high-risk patients regardless their absolute risk.
The implied assumption is that the highest risk patients will gain the highest benefit from the treatment and that the treatment does not carry a significant potential risk.
This type of dichotomization is being used when the organization face resource constraint, In healthcare we call it also risk percentile.

PPCR

	Predicted Positives	Predicted Negatives
Real Positives
Real Negatives
	2 (0.2%)	8 (0.8%)	10 (1%)

1	1	0	0	0	0	0	0	0	0
💊	💊
😷	😷	😷	😷	😷	😷	😷	😷	😷	😷

PPCR of 0.2

p̂	0.11	0.15	0.18	0.29	0.31	0.33	0.45	0.47	0.63	0.72
R
Ŷ
Y	0	0	0	0	1	0	1	0	1	1
	🤨	🤨	🤨	🤨	🤢	🤨	🤢	🤨	🤢	🤢

PPCR of 0.2

p̂	0.11	0.15	0.18	0.29	0.31	0.33	0.45	0.47	0.63	0.72
R	10	9	8	7	6	5	4	3	2	1
Ŷ	0	0	0	0	0	0	0	0	1	1
Y	0	0	0	0	1	0	1	0	1	1
	🤨	🤨	🤨	🤨	🤢	🤨	🤢	🤨	💊 🤢	💊 🤢
	TN	TN	TN	TN	FN	TN	FN	TN	TP	TP

Lift Curve

Lift Curve displays Lift on the Y axis and PPCR (Predicted Positives Conditional Rate) on the X axis.
In other words, lift shows how much the prediction is doing better than a random guess in terms of PPV.
The reference line stands for a random guess: the Lift is equal to 1 (PPV = Prevalence).
The Curve is not defined if there are no Predicted Positives (probability threshold is too high or PPCR = 0).

	Predicted Negatives
Real Positives	4 (0.4%)	4 (0.4%)
Real Negatives	6 (0.6%)	6 (0.6%)
	10 (1%)	10 (1%)

	Predicted Positives	Predicted Negatives
Real Positives	1 (0.1%)	3 (0.3%)	4 (0.4%)
Real Negatives	0 (0%)	6 (0.6%)	6 (0.6%)
	1 (0.1%)	9 (0.9%)	10 (1%)

	Predicted Positives	Predicted Negatives
Real Positives	2 (0.2%)	2 (0.2%)	4 (0.4%)
Real Negatives	0 (0%)	6 (0.6%)	6 (0.6%)
	2 (0.2%)	8 (0.8%)	10 (1%)

	Predicted Positives	Predicted Negatives
Real Positives	2 (0.2%)	2 (0.2%)	4 (0.4%)
Real Negatives	1 (0.1%)	5 (0.5%)	6 (0.6%)
	3 (0.3%)	7 (0.7%)	10 (1%)

	Predicted Positives	Predicted Negatives
Real Positives	3 (0.3%)	1 (0.1%)	4 (0.4%)
Real Negatives	1 (0.1%)	5 (0.5%)	6 (0.6%)
	4 (0.4%)	6 (0.6%)	10 (1%)

	Predicted Positives	Predicted Negatives
Real Positives	3 (0.3%)	1 (0.1%)	4 (0.4%)
Real Negatives	2 (0.2%)	4 (0.4%)	6 (0.6%)
	5 (0.5%)	5 (0.5%)	10 (1%)

	Predicted Positives	Predicted Negatives
Real Positives	4 (0.4%)	0 (0%)	4 (0.4%)
Real Negatives	2 (0.2%)	4 (0.4%)	6 (0.6%)
	6 (0.6%)	4 (0.4%)	10 (1%)

	Predicted Positives	Predicted Negatives
Real Positives	4 (0.4%)	0 (0%)	4 (0.4%)
Real Negatives	3 (0.3%)	3 (0.3%)	6 (0.6%)
	7 (0.7%)	3 (0.3%)	10 (1%)

	Predicted Positives	Predicted Negatives
Real Positives	4 (0.4%)	0 (0%)	4 (0.4%)
Real Negatives	4 (0.4%)	2 (0.2%)	6 (0.6%)
	8 (0.8%)	2 (0.2%)	10 (1%)

	Predicted Positives	Predicted Negatives
Real Positives	4 (0.4%)	0 (0%)	4 (0.4%)
Real Negatives	5 (0.5%)	1 (0.1%)	6 (0.6%)
	9 (0.9%)	1 (0.1%)	10 (1%)

	Predicted Positives
Real Positives	4 (0.4%)	4 (0.4%)
Real Negatives	6 (0.6%)	6 (0.6%)
	10 (1%)	10 (1%)

Precision-Recall

Curve	Sens	Spec	PPV	PPCR	Lift
ROC	y	x
Lift				x	y
Precision- Recall	x		y
Gains	y			x

Precision Recall

Precision-Recall Curve displays PPV on the y axis and Sensitivity on the x axis.
The reference line stands for a random guess: the PPV is equal to the Prevalence, the Sensitivity depends on the Probability Threshold or PPCR.
The Curve is not defined if there are no Predicted Positives (probability threshold is too high or PPCR = 0).

Gains Curve

Curve	Sens	Spec	PPV	PPCR	Lift
ROC	y	x
Lift				x	y
Precision- Recall	x		y
Gains	y			x

Gains Curve

Gains Curve displays Sensitivity on the y axis and PPCR on the x axis.
Gains shows the Sensitivity for a given PPCR.
Reference Line for a Random Guess: The sensitivity is equal to the proportion of predicted positives.
Reference Line for a Perfect Prediction: All Predicted Positives are Real Positives until there are no more Real Positives (PPCR = Prevalence, Sensitivity = 1).