24 Classification
- Classification
- Sensitivity and specificiy
- Area under the ROC curve
We may use the logistic regression model as a technique for classification. For example:
- To predict whether a loan will be returned (1) or not (0)
- To predict whether an email is spam (1) or not (0)
- To predict whether a tumor is malignant (1) or not (0)
Let us suppose we want to build a model to classify loan applicants as eligible or not eligible for a loan.
We have a dataset with information of past applicants with the following information. A response variable loan returned (1 if loan returned and 0 if not), and covariates income, loan amount, loan term, credit history, gender, married, dependents, education, self-employed and property area.
Let \(Y=1\) if loan is returned and 0 otherwise, and let \(x_1, \ldots, x_p\) be covariates to predict the probability \(\pi\) of returning the loan. We can fit a logistic model to predict the probability \(\pi\) of returning the loan using information of the covariates.
\[Y \sim Binomial(1 ,\pi)\] \[logit(\pi) = \beta_0 + \beta_1 x_1, \ldots + \beta_p x_p\]
Then, we can fit the model using the data of past applicants and obtain \(\hat \beta_0, \ldots \hat \beta_p\) and have a model to predict probabilities of returning the loan for new applicants.
Then, we can use a probability threshold \(c\) to classify individuals as \(\hat Y = 1\) if \(\hat \pi > c\) and \(\hat Y = 0\) if \(\hat \pi \leq c\).
We would like our model to have high discrimination ability. This means that observations with \(Y=1\) should be predicted high probabilities, and those with \(Y=0\) should be assigned low probabilities.
In this example with a single predictor x, observations are classified as 1 if predicted probabilities \(\geq 0.5\). In this example, we see the predicted probability curve crosses the horizontal line at 0.5 at an x value of 1.95. Thus, any data point with an x value at or above 1.95 will have a predicted probability greater than or equal to 0.5 and be classified as a 1. Conversely, data points with x values below 1.95 will be classified as 0s.
24.1 Sensitivity and specificity
Sensitivity (True Positive rate/Recall) measures the proportion of positives that are correctly identified (i.e. the proportion of those who have some condition (affected, Y=1) who are correctly identified as having the condition).
Specificity (True Negative rate) measures the proportion of negatives that are correctly identified (i.e. the proportion of those who do not have the condition (unaffected, Y=0) who are correctly identified as not having the condition).
Sensitivity = TP/(TP + FN) = (Number of true positive assessment)/(Number of all positives)
Specificity = TN/(TN + FP) = (Number of true negative assessment)/(Number of all negatives)
(1 - specificity = FP/(TN + FP) ) called False Positive Rate
PPV: positive predictive value (or precision) = proportion of those with a POSITIVE test that have the condition
NPV: negative predictive value = proportion of those with a NEGATIVE test that do not have the condition
Accuracy = (TN + TP)/(TN+TP+FN+FP) = (Number of correct assessments)/Number of all assessments)
Our model is perfect at classifying observations if it has 100% sensitivity and 100% specificity. Unfortunately in practice this is (usually) not attainable.
We want our model to have high sensitivity and low values of (1 - specificity).
24.2 Exercise
For our data set picking a threshold of 0.75 gives us the following results:
FN = 340, TP = 27
TN = 3545, FP = 9
What are the sensitivity and specificity for this particular decision rule?
Sensitivity = TP/(TP + FN) = 27/(27 + 340) = 0.073
Specificity = TN/(FP + TN) = 3545/(9 + 3545) = 0.997
24.3 Example
PDF https://uk.cochrane.org/news/sensitivity-and-specificity-explained-cochrane-uk-trainees-blog
Sensitivity includes all individuals with disease so it is not affected by the prevalence of the disease.
Specificity includes all individuals without disease so it is not affected by the prevalence of the disease.
The calculation of PPV and NPV includes observations with \(Y=1\) and \(Y=0\). Therefore, it is affected by the prevalence of the disease. Therefore, we need to ensure that the same population is used (or incidence is the same between populations) when comparing PPV and NPV for different tests.
24.4 Classification rule
We can create classification rule by choosing a threshold value \(c\), and classify those observations with a fitted probability above \(c\) as positive and those at or below \(c\) as negative.
For a particular threshold value \(c\), we can estimate the sensitivity by the proportion of observations with Y=1 which have a predicted probability above \(c\), and similarly we can estimate specificity by the proportion of Y=0 observations with a predicted probability at or below \(c\).
If we increase the threshold value \(c\), fewer observations will be predicted as positive. This will mean that fewer of the Y=1 observations will be predicted as positive (reduced sensitivity), but more of the Y=0 observations will be predicted as negative (increased specificity). In picking the threshold value, there is thus an intrinsic trade off between sensitivity and specificity.
If we increase threshold \(c\), fewer observations will be classified as positive. Sensitivity = TP/(TP+FN) decreases. Numerator TP decreases as we classify fewer observations as positive. Denominator TP+FN is total of positive observations and stays as it is (TP decreases and FN increases).
24.5 The receiver operating characteristic (ROC) curve
The ROC curve, is a plot of the values of sensitivity (y-axis) against 1-specificity (x-axis), evaluated at values of the threshold value \(c\) from 0 to 1.
A model with high discrimination ability will have high sensitivity and specificity, leading to an ROC curve which goes close to the top left corner of the plot. A model with no discrimination ability (random classifier) will have an ROC curve which is the 45 degree diagonal line.
A perfect classifier has Sensitivity = 1, Specificity = 1, 1-Specificity = 0 for all thresholds.
A poor model will have a ROC curve close to the bottom right corner (AUC near 0). This has the worst measure of discrimination as is predicting 0s as 1s and 1s as 0s.
24.6 Area under the ROC curve
A way of summarizing the discrimination ability of a model is to report the area under the ROC curve.
AUC provides an aggregate measure of performance of a model across all possible classification thresholds.
The area under the curve ranges from 1, corresponding to perfect discrimination (ROC curve to the top left hand corner), to 0.5, corresponding to a model with no discrimination ability (ROC curve which is the 45 degree diagonal line).
The higher the AUC, the better the performance of the model at distinguishing between the positive and negative classes.
24.7 Multiclass classification
ROC curves are for binary classification problems (0/1).
This can be extended to a multiclass classification problem by using the one-versus-rest technique.
For example, if we have three classes 0/1/2. The ROC for class 0 will be generated as classifying 0 against not 0 (i.e., 1 or 2). The ROC for class 1 will be generated as classifying 1 against not 1 (i.e. 0 or 2) and so on.
By using this one-versus-rest technique for each class, we will have the same number of curves as classes. The AUC can also be calculated for each class individually.
25 Example 1
The data aSAH
summarizes several clinical and one laboratory variable of 113 patients with an aneurysmal subarachnoid hemorrhage
library(pROC)
data(aSAH)
head(aSAH)
gos6 outcome gender age wfns s100b ndka
29 5 Good Female 42 1 0.13 3.01
30 5 Good Female 37 1 0.14 8.54
31 5 Good Female 42 1 0.10 8.09
32 5 Good Female 27 1 0.04 10.42
33 1 Poor Female 42 3 0.13 17.40
34 1 Poor Male 48 2 0.10 12.75
<- glm(outcome ~ gender + age + s100b + ndka, family = binomial(link = "logit"), data = aSAH)
res summary(res)
Call:
glm(formula = outcome ~ gender + age + s100b + ndka, family = binomial(link = "logit"),
data = aSAH)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.33222 1.01104 -3.296 0.000981 ***
genderFemale -1.10249 0.50492 -2.183 0.029001 *
age 0.03350 0.01826 1.834 0.066631 .
s100b 4.99552 1.27576 3.916 9.01e-05 ***
ndka 0.02904 0.01679 1.729 0.083739 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 148.04 on 112 degrees of freedom
Residual deviance: 113.08 on 108 degrees of freedom
AIC: 123.08
Number of Fisher Scoring iterations: 5
<- pROC::roc(aSAH$outcome ~ res$fitted) resROC
Sensitivity and specificity values for different threshold values. We see sensitivity decreases and specificity increases as the cutoff increases.
<- cbind(resROC$thresholds, resROC$sensitivities, resROC$specificities, 1-resROC$specificities)[c(5, 10, 45, 50, 100, 110),]
tt colnames(tt) <- c("threshold", "sensitivity", "specificiy", "1-specificity")
tt
threshold sensitivity specificiy 1-specificity
[1,] 0.08014323 1.00000000 0.05555556 0.94444444
[2,] 0.09506841 0.97560976 0.11111111 0.88888889
[3,] 0.21858875 0.85365854 0.52777778 0.47222222
[4,] 0.24394243 0.82926829 0.58333333 0.41666667
[5,] 0.76931946 0.31707317 0.98611111 0.01388889
[6,] 0.90552529 0.09756098 1.00000000 0.00000000
plot(resROC, print.auc = TRUE, legacy.axes = TRUE)
$auc resROC
Area under the curve: 0.8096
Optimal threshold. Wiki Youden’s J statistic
coords(resROC, "best")
threshold specificity sensitivity
1 0.3242963 0.7916667 0.7317073
coords(resROC, x = "best", input = "threshold", best.method = "youden") # Same than last line
threshold specificity sensitivity
1 0.3242963 0.7916667 0.7317073
26 Example 2
Default
data from the ISLR
package. Goal is to classify individuals as defaulters based on student status, credit card balance, and income.
library(ISLR)
library(tibble)
as_tibble(Default)
# A tibble: 10,000 × 4
default student balance income
<fct> <fct> <dbl> <dbl>
1 No No 730. 44362.
2 No Yes 817. 12106.
3 No No 1074. 31767.
4 No No 529. 35704.
5 No No 786. 38463.
6 No Yes 920. 7492.
7 No No 826. 24905.
8 No Yes 809. 17600.
9 No No 1161. 37469.
10 No No 0 29275.
# ℹ 9,990 more rows
We use cross-validation to avoid overfitting. We split our data in test and training datasets. We fit the model with the training data, and assess the predictive ability of the model with the validation data that we have not used to fit the model.
set.seed(42)
<- sample(nrow(Default), 5000)
default_idx <- Default[default_idx, ]
default_trn <- Default[-default_idx, ] default_tst
Write a function which allows use to make predictions based on different probability thresholds (cut
).
<- function(mod, data, pos = 1, neg = 0, cut = 0.5) {
get_logistic_pred <- predict(mod, newdata = data, type = "response")
probs ifelse(probs > cut, pos, neg)
}
Let us use this to obtain predictions using a low, medium, and high thresholds (0.1, 0.5, and 0.9).
\(\hat C(x) = 1\) if \(\hat p(x) > c\) and \(\hat C(x) = 0\) if \(\hat p(x) \leq c\)
# Fit model with training data
<- glm(default ~ balance, data = default_trn, family = "binomial")
model_glm
# Assess accuracy, sensitivity, and specificity with test data.
# Classify using three thresholds (0.1, 0.5, 0.9)
<- get_logistic_pred(model_glm, data = default_tst, pos = "Yes", neg = "No", cut = 0.1)
test_pred_10 <- get_logistic_pred(model_glm, data = default_tst, pos = "Yes", neg = "No", cut = 0.5)
test_pred_50 <- get_logistic_pred(model_glm, data = default_tst, pos = "Yes", neg = "No", cut = 0.9) test_pred_90
Now we evaluate accuracy, sensitivity, and specificity for these classifiers.
<- table(predicted = test_pred_10, actual = default_tst$default)
test_tab_10 <- table(predicted = test_pred_50, actual = default_tst$default)
test_tab_50 <- table(predicted = test_pred_90, actual = default_tst$default)
test_tab_90 test_tab_10
actual
predicted No Yes
No 4547 46
Yes 290 117
test_tab_50
actual
predicted No Yes
No 4814 112
Yes 23 51
test_tab_90
actual
predicted No Yes
No 4837 156
Yes 0 7
library(caret)
# confusionMatrix() calculates a cross-tabulation of observed and predicted classes with associated statistics.
<- confusionMatrix(test_tab_10, positive = "Yes")
test_con_mat_10 <- confusionMatrix(test_tab_50, positive = "Yes")
test_con_mat_50 <- confusionMatrix(test_tab_90, positive = "Yes") test_con_mat_90
<- rbind(
metrics
c(test_con_mat_10$overall["Accuracy"],
$byClass["Sensitivity"],
test_con_mat_10$byClass["Specificity"]),
test_con_mat_10
c(test_con_mat_50$overall["Accuracy"],
$byClass["Sensitivity"],
test_con_mat_50$byClass["Specificity"]),
test_con_mat_50
c(test_con_mat_90$overall["Accuracy"],
$byClass["Sensitivity"],
test_con_mat_90$byClass["Specificity"])
test_con_mat_90
)
rownames(metrics) <- c("c = 0.10", "c = 0.50", "c = 0.90")
metrics
Accuracy Sensitivity Specificity
c = 0.10 0.9328 0.71779141 0.9400455
c = 0.50 0.9730 0.31288344 0.9952450
c = 0.90 0.9688 0.04294479 1.0000000
We see sensitivity decreases as the cutoff is increased. Conversely, specificity increases as the cutoff increases.
library(pROC)
<- predict(model_glm, newdata = default_tst, type = "response")
test_prob roc(default_tst$default ~ test_prob, plot = TRUE, legacy.axes = TRUE, print.auc = TRUE)
Setting levels: control = No, case = Yes
Setting direction: controls < cases
Call:
roc.formula(formula = default_tst$default ~ test_prob, plot = TRUE, legacy.axes = TRUE, print.auc = TRUE)
Data: test_prob in 4837 controls (default_tst$default No) < 163 cases (default_tst$default Yes).
Area under the curve: 0.9493
27 Exercise
Datasets
We will use cancer data found in wisc-trn.csv
and wisc-tst.csv
which contain train and test data respectively. This is a modification of the Breast Cancer Wisconsin (Diagnostic) dataset from the UCI Machine Learning Repository.
Consider an additive logistic regression that considers only two predictors, radius
and symmetry
to predict the probability of cancer.
Report test sensitivity, test specificity, and test accuracy for three classifiers, each using a different cutoff for predicted probability:
\(\hat C(x) =\) Malignant if \(\hat p(x) > c\) and \(\hat C(x) =\) Benign if \(\hat p(x) \leq c\)
\(c = 0.1, 0.5, 0.9\)
We will consider class
M
(malignant) to be the “positive” class when calculating sensitivity and specificity. Summarize these results using a single well-formatted table.
# import data
<- read.csv("data/wisc-trn.csv")
wisc_trn <- read.csv("data/wisc-tst.csv")
wisc_tst # coerce to factor
$class <- as.factor(wisc_trn$class)
wisc_trn$class <- as.factor(wisc_tst$class)
wisc_tst# fit model
<- glm(class ~ radius + symmetry, data = wisc_trn, family = "binomial")
wisc_glm # helper function, specific to glm function
<- function(mod, data, pos = 1, neg = 0, cut = 0.5) {
get_pred_glm = predict(mod, newdata = data, type = "response")
probs ifelse(probs > cut, pos, neg)
}# get predicted values for each cutoff
<- get_pred_glm(wisc_glm, wisc_tst, cut = 0.1, pos = "M", neg = "B")
pred_10 <- get_pred_glm(wisc_glm, wisc_tst, cut = 0.5, pos = "M", neg = "B")
pred_50 <- get_pred_glm(wisc_glm, wisc_tst, cut = 0.9, pos = "M", neg = "B")
pred_90 # create table for each cutoff
<- table(predicted = pred_10, actual = wisc_tst$class)
tab_10 <- table(predicted = pred_50, actual = wisc_tst$class)
tab_50 <- table(predicted = pred_90, actual = wisc_tst$class)
tab_90 # coerce tables to be confusion matrices
<- confusionMatrix(tab_10, positive = "M")
con_mat_10 <- confusionMatrix(tab_50, positive = "M")
con_mat_50 <- confusionMatrix(tab_90, positive = "M")
con_mat_90 = rbind(
metrics c(con_mat_10$overall["Accuracy"],
$byClass["Sensitivity"],
con_mat_10$byClass["Specificity"]),
con_mat_10c(con_mat_50$overall["Accuracy"],
$byClass["Sensitivity"],
con_mat_50$byClass["Specificity"]),
con_mat_50c(con_mat_90$overall["Accuracy"],
$byClass["Sensitivity"],
con_mat_90$byClass["Specificity"])
con_mat_90
)
rownames(metrics) <- c("c = 0.10", "c = 0.50", "c = 0.90")
metrics
Accuracy Sensitivity Specificity
c = 0.10 0.86 0.950 0.8000000
c = 0.50 0.91 0.825 0.9666667
c = 0.90 0.82 0.575 0.9833333