1 Introduction

In this session we will learn about benchmarks, which are experiments designed to measure the performance of algorithms, methodologies or statistical models. In particular, we will focus on classification problems. These are problems where data instances are assigned to a set of classes. Note that methods that produce numeric scores can often be disguised as classifiers by adding rules that assign scores to classes. Neurons in neural nets use sigmoid functions for this, see here (in Spanish).

1.1 Definitions

Data instance: individual observation within a data class. Example: a phenotype measurement on a plant, or from a single leaf.

Training set: Set of data used to train a model or algorithm. You can read about how deep learning machines are trained here (in Spanish).

Validation set: Fraction of the data used to check the trained model during cross-validation.

Test set: Independent set of data used to measure the performance of a trained model or algorithm. If only one dataset is available then test and validations sets might be the same.

The next figure from Km121220 (2020) illustrates two possible ways to partition a dataset for benchmarking. On top, dataset A only uses a training set and a test set for testing the trained model. Below, dataset B is split in three parts, with train + validation using during cross-validation and a test set to evaluate the final model:

Cross-validation: model validation procedure based on repeatedly spliting a dataset in training and validation subsets. Partitions can be done in various ways, such as 80% training and %20 validation, or %50/%50. A test set can still be used to independently measure performance on an independent dataset.

Imbalanced dataset: Occurs when the number of data instances in each class is not balanced in the training dataset. Severe imbalance happens when the distribution of instances is uneven by a large amount (e.g. 1:100 or more) and are particularly challenging for training.

Over-training: Occurs when the performance on the training sets improve while getting worse on the validation sets. It is a form of over-fitting the training data. Learn more here (in Spanish).

Positive and negative cases: When classifying data instances, positives are cases that truly belong to a class and negatives those that do not belong to it. The following figure, taken from (Walber 2014), summarizes how predictors or models handle instances and how the four possible outcomes (TP,FP,TN,FN) can be used to compute specificity and sensitivity:

Sensitivity or recall: is also called True Positive Rate (TPR) \[ sens = \frac{TP}{TP+FN} \]

Specificity or precision:

\[ spec = \frac{TN}{TN+FP} \] Note that \[1 - specificity \] is the False Positive Rate (FPR), typically used in ROC curves, as shown below.

Read more about the interpretation of sensitivity and specificity here (in Spanish).

1.2 A benchmark example

In this example, adapted from (MilesH 2018), we will be using a dataset of red wines first published by (Cortez et al. 2009):

wines <- read.csv(file="./test_data/red_wine_quality.csv", sep=";")
kable(wines[1:5,])
fixed.acidity volatile.acidity citric.acid residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol quality
7.4 0.70 0.00 1.9 0.076 11 34 0.9978 3.51 0.56 9.4 5
7.8 0.88 0.00 2.6 0.098 25 67 0.9968 3.20 0.68 9.8 5
7.8 0.76 0.04 2.3 0.092 15 54 0.9970 3.26 0.65 9.8 5
11.2 0.28 0.56 1.9 0.075 17 60 0.9980 3.16 0.58 9.8 6
7.4 0.70 0.00 1.9 0.076 11 34 0.9978 3.51 0.56 9.4 5

The last column (quality) is the dependent variable in this example, the one we will try to predict from the others and use to classify wines as high-quality (HQ) or low-quality (LQ). In order to convert quality estimates to HQ/LQ binary values, the original 3-8 values are assigned 1 when \(quality > 5\) or else 0.

In order to plot performance and to compare different models we will plot Receiver Operating Characteristic (ROC) curves, which show 1-specificity vs sensitivity. Note that ROC plots usually have a diagonal line to mark the performance gain over a random classifier. The Area Under the Curve (AUC) allows comparing alternative models, with higher AUC values meaning better performance.

Two different models are trained in this example, a random forest and a logistic regression. Thus, the code shown below requires the following R packages: + pROC + randomForest

# uncomment to install dependencies
# install.packages(c("pROC", "randomForest"))

# load the data, originally from
# https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/
df = read.csv(file="./test_data/red_wine_quality.csv", sep=";")

# transform the quality column to a 1 (if quality > 5) or a 0
df$quality = ifelse(df$quality>5,1,0)

# change the quality column to a factor type
df$quality = as.factor(df$quality)

# split the dataframe into train and test sets
index = sample(1:nrow(df),size = 0.8*nrow(df))
train <- df[index,]
test <- df[-index,]

# build the random forest model and test it
library(randomForest)
rf_model <- randomForest(quality ~., data = train)
rf_prediction <- predict(rf_model, test, type = "prob")

# build the logistic regression model and test it
lr_model <- glm(quality ~., data = train, family = "binomial")
lr_prediction <- predict(lr_model, test, type = "response")

# ROC curves
library(pROC)
ROC_rf <- roc(test$quality, rf_prediction[,2])
ROC_lr <- roc(test$quality, lr_prediction)

# Area Under Curve (AUC) for each ROC curve 
ROC_rf_auc <- auc(ROC_rf)
ROC_lr_auc <- auc(ROC_lr)

# plot ROC curves
plot(ROC_rf, col="green", legacy.axes=T,
    main="Random Forest (green) vs Logistic Regression (red)")
lines(ROC_lr, col = "red")

# print performance summary
paste("Accuracy % of random forest: ", 
    mean(test$quality == round(rf_prediction[,2], digits = 0)))
[1] "Accuracy % of random forest:  0.84375"
paste("Accuracy % of logistic regression: ", 
    mean(test$quality == round(lr_prediction, digits = 0)))
[1] "Accuracy % of logistic regression:  0.746875"
paste("Area under curve of random forest: ", ROC_rf_auc)
[1] "Area under curve of random forest:  0.90450471605808"
paste("Area under curve of logistic regression: ", ROC_lr_auc)
[1] "Area under curve of logistic regression:  0.82509490822277"

You can read more about ROC curves here.

1.3 Exercise

You shall write a R markdown report with source code and answers to the following questions:

  1. How many data instances are there in the dataset?

  2. After transforming the quality variable to a binary type,

  • how balanced is the dataset, ie how many wines are HQ and LQ?
  • what quality cutff produces the most balanced dataset? By default the example uses 5.
  1. Do a 3-cross-validation benchmark as follows.
  • Randomly split the data in two parts: 80% cross-validation (dataCV) + 20% test (dataT)
  • With dataCV compute 3 iterations of cross-validation
    • 80% training and 20% validation
    • compute AUC for both models
  • With dataT do a final benchmark and produce a table with AUC values from all iterations plus the final test
  1. Test and benchmark the final trained model with the dataset of white wines at test_data/white_wine_quality.csv .
  • How many True and False Positives do you get?
  • Add the obtained AUC value to the table of task 3 and comment on the results.

References

Cortez, Paulo, António Cerdeira, Fernando Almeida, Telmo Matos, and José Reis. 2009. “Modeling Wine Preferences by Data Mining from Physicochemical Properties.” Decision Support Systems 47 (4): 547–53. https://doi.org/https://doi.org/10.1016/j.dss.2009.05.016.
Km121220. 2020. “Training, Validation, and Test Sets.” https://en.wikipedia.org/wiki/Training,_validation,_and_test_sets.
MilesH. 2018. “Receiver Operating Characteristic (ROC) Curve in r.” https://www.kaggle.com/milesh1/receiver-operating-characteristic-roc-curve-in-r.
Walber. 2014. “Sensitivity and Specificity.” https://en.wikipedia.org/wiki/Sensitivity_and_specificity.