## 7.7 Cost-Sensitive Classification

In regular classification the aim is to minimize the misclassification rate and thus all types of misclassification errors are deemed equally severe. A more general setting is cost-sensitive classification. Cost sensitive classification does not assume that the costs caused by different kinds of errors are equal. The objective of cost sensitive classification is to minimize the expected costs.

Imagine you are an analyst for a big credit institution. Let’s also assume that a correct decision of the bank would result in 35% of the profit at the end of a specific period. A correct decision means that the bank predicts that a customer will pay their bills (hence would obtain a loan), and the customer indeed has good credit. On the other hand, a wrong decision means that the bank predicts that the customer’s credit is in good standing, but the opposite is true. This would result in a loss of 100% of the given loan.

Good Customer (truth) Bad Customer (truth)
Good Customer (predicted) + 0.35 - 1.0

Expressed as costs (instead of profit), we can write down the cost-matrix as follows:

costs = matrix(c(-0.35, 0, 1, 0), nrow = 2)
print(costs)
##         truth
##     good -0.35   1
##     bad   0.00   0

An exemplary data set for such a problem is the German Credit task:

library("mlr3")
table(task$truth()) ## ## good bad ## 700 300 The data has 70% customers who are able to pay back their credit, and 30% bad customers who default on the debt. A manager, who doesn’t have any model, could decide to give either everybody a credit or to give nobody a credit. The resulting costs for the German credit data are: # nobody: (700 * costs[2, 1] + 300 * costs[2, 2]) / 1000 ## [1] 0 # everybody (700 * costs[1, 1] + 300 * costs[1, 2]) / 1000 ## [1] 0.055 If the average loan is$20,000, the credit institute would lose more than one million dollar if it would grant everybody a credit:

# average profit * average loan * number of customers
0.055 * 20000 * 1000
## [1] 1100000

Our goal is to find a model which minimizes the costs (and thereby maximizes the expected profit).

### 7.7.1 A First Model

For our first model, we choose an ordinary logistic regression (implemented in the add-on package mlr3learners). We first create a classification task, then resample the model using a 10-fold cross validation and extract the resulting confusion matrix:

library("mlr3learners")
learner = lrn("classif.log_reg")

confusion = rr$prediction()$confusion
print(confusion)
##         truth
##     good  604 154
##     bad    96 146

To calculate the average costs like above, we can simply multiply the elements of the confusion matrix with the elements of the previously introduced cost matrix, and sum the values of the resulting matrix:

avg_costs = sum(confusion * costs) / 1000
print(avg_costs)
## [1] -0.0574

With an average loan of $20,000, the logistic regression yields the following costs: avg_costs * 20000 * 1000 ## [1] -1148000 Instead of losing over$1,000,000, the credit institute now can expect a profit of more than $1,000,000. ### 7.7.2 Cost-sensitive Measure Our natural next step would be to further improve the modeling step in order to maximize the profit. For this purpose we first create a cost-sensitive classification measure which calculates the costs based on our cost matrix. This allows us to conveniently quantify and compare modeling decisions. Fortunately, there already is a predefined measure Measure for this purpose: MeasureClassifCosts: cost_measure = msr("classif.costs", costs = costs) print(cost_measure) ## <MeasureClassifCosts:classif.costs> ## * Packages: - ## * Range: [-Inf, Inf] ## * Minimize: TRUE ## * Properties: requires_task ## * Predict type: response If we now call resample() or benchmark(), the cost-sensitive measures will be evaluated. We compare the logistic regression to a simple featureless learner and to a random forest from package ranger : learners = list( lrn("classif.log_reg"), lrn("classif.featureless"), lrn("classif.ranger") ) cv3 = rsmp("cv", folds = 3) bmr = benchmark(benchmark_grid(task, learners, cv3)) bmr$aggregate(cost_measure)
##    nr      resample_result       task_id          learner_id resampling_id
## 1:  1 <ResampleResult[21]> german_credit     classif.log_reg            cv
## 2:  2 <ResampleResult[21]> german_credit classif.featureless            cv
## 3:  3 <ResampleResult[21]> german_credit      classif.ranger            cv
##    iters classif.costs
## 1:     3      -0.05501
## 2:     3       0.05501
## 3:     3      -0.05115

As expected, the featureless learner is performing comparably bad. The logistic regression and the random forest work equally well.

### 7.7.3 Thresholding

Although we now correctly evaluate the models in a cost-sensitive fashion, the models themselves are unaware of the classification costs. They assume the same costs for both wrong classification decisions (false positives and false negatives). Some learners natively support cost-sensitive classification (e.g., XXX). However, we will concentrate on a more generic approach which works for all models which can predict probabilities for class labels: thresholding.

Most learners can calculate the probability $$p$$ for the positive class. If $$p$$ exceeds the threshold $$0.5$$, they predict the positive class, and the negative class otherwise.

For our binary classification case of the credit data, the we primarily want to minimize the errors where the model predicts “good,” but truth is “bad” (i.e., the number of false positives) as this is the more expensive error. If we now increase the threshold to values $$> 0.5$$, we reduce the number of false negatives. Note that we increase the number of false positives simultaneously, or, in other words, we are trading false positives for false negatives.

# fit models with probability prediction
learner = lrn("classif.log_reg", predict_type = "prob")
p = rr$prediction() print(p) ## <PredictionClassif> for 1000 observations: ## row_ids truth response prob.good prob.bad ## 13 good good 0.84539 0.15461 ## 70 good good 0.85194 0.14806 ## 79 good good 0.74997 0.25003 ## --- ## 973 bad bad 0.05756 0.94244 ## 977 good good 0.93482 0.06518 ## 992 good good 0.70396 0.29604 # helper function to try different threshold values interactively with_threshold = function(p, th) { p$set_threshold(th)
list(confusion = p$confusion, costs = p$score(measures = cost_measure, task = task))
}

with_threshold(p, 0.5)
## $confusion ## truth ## response good bad ## good 601 148 ## bad 99 152 ## ##$costs
## classif.costs
##      -0.06235
with_threshold(p, 0.75)
## $confusion ## truth ## response good bad ## good 468 80 ## bad 232 220 ## ##$costs
## classif.costs
##       -0.0838
with_threshold(p, 1.0)
## $confusion ## truth ## response good bad ## good 1 0 ## bad 699 300 ## ##$costs
## classif.costs
##      -0.00035
# TODO: include plot of threshold vs performance

Instead of manually trying different threshold values, one uses use optimize() to find a good threshold value w.r.t. our performance measure:

# simple wrapper function which takes a threshold and returns the resulting model performance
# this wrapper is passed to optimize() to find its minimum for thresholds in [0.5, 1]
f = function(th) {
with_threshold(p, th)$costs } best = optimize(f, c(0.5, 1)) print(best) ##$minimum
## [1] 0.8476
##
## $objective ## classif.costs ## -0.0872 # optimized confusion matrix: with_threshold(p, best$minimum)$confusion ## truth ## response good bad ## good 372 43 ## bad 328 257 Note that the function optimize() is intended for unimodal functions and therefore may converge to a local optimum here. See below for better alternatives to find good threshold values. ### 7.7.4 Threshold Tuning Before we start, we have load all required packages: library(mlr3) library(mlr3pipelines) library(mlr3tuning) ## Loading required package: paradox ### 7.7.5 Adjusting thresholds: Two strategies Currently mlr3pipelines offers two main strategies towards adjusting classification thresholds. We can either expose the thresholds as a hyperparameter of the Learner by using PipeOpThreshold. This allows us to tune the thresholds via an outside optimizer from mlr3tuning. Alternatively, we can also use PipeOpTuneThreshold which automatically tunes the threshold after each learner is fit. In this blog-post, we’ll go through both strategies. ### 7.7.6 PipeOpThreshold PipeOpThreshold can be put directly after a Learner. A simple example would be: gr = lrn("classif.rpart", predict_type = "prob") %>>% po("threshold") l = as_learner(gr) Note, that predict_type = “prob” is required for po("threshold") to have any effect. The thresholds are now exposed as a hyperparameter of the GraphLearner we created: l$param_set
## <ParamSetCollection>
##                               id    class lower upper nlevels        default
##  1:       classif.rpart.minsplit ParamInt     1   Inf     Inf             20
##  2:      classif.rpart.minbucket ParamInt     1   Inf     Inf <NoDefault[3]>
##  3:             classif.rpart.cp ParamDbl     0     1     Inf           0.01
##  4:     classif.rpart.maxcompete ParamInt     0   Inf     Inf              4
##  5:   classif.rpart.maxsurrogate ParamInt     0   Inf     Inf              5
##  6:       classif.rpart.maxdepth ParamInt     1    30      30             30
##  7:   classif.rpart.usesurrogate ParamInt     0     2       3              2
##  8: classif.rpart.surrogatestyle ParamInt     0     1       2              0
##  9:           classif.rpart.xval ParamInt     0   Inf     Inf             10
## 10:     classif.rpart.keep_model ParamLgl    NA    NA       2          FALSE
## 11:         threshold.thresholds ParamUty    NA    NA     Inf <NoDefault[3]>
##     value
##  1:
##  2:
##  3:
##  4:
##  5:
##  6:
##  7:
##  8:
##  9:     0
## 10:
## 11:   0.5

We can now tune those thresholds from the outside as follows:

Before tuning, we have to define which hyperparameters we want to tune over. In this example, we only tune over the thresholds parameter of the threshold pipeop. you can easily imagine, that we can also jointly tune over additional hyperparameters, i.e. rpart’s cp parameter.

As the Task we aim to optimize for is a binary task, we can simply specify the threshold param:

library(paradox)
ps = ps(threshold.thresholds = p_dbl(lower = 0, upper = 1))

We now create a AutoTuner, which automatically tunes the supplied learner over the ParamSet we supplied above.

at = AutoTuner$new( learner = l, resampling = rsmp("cv", folds = 3L), measure = msr("classif.ce"), search_space = ps, terminator = trm("evals", n_evals = 5L), tuner = tnr("random_search") ) at$train(tsk("german_credit"))
## INFO  [09:38:05.412] [bbotk] Starting to optimize 1 parameter(s) with '<OptimizerRandomSearch>' and '<TerminatorEvals> [n_evals=5]'
## INFO  [09:38:05.463] [bbotk] Evaluating 1 configuration(s)
## INFO  [09:38:05.795] [bbotk] Result of batch 1:
## INFO  [09:38:05.797] [bbotk]  threshold.thresholds classif.ce                                uhash
## INFO  [09:38:05.797] [bbotk]                0.7293     0.3139 361acc72-0230-43e9-acac-2096a20e289e
## INFO  [09:38:05.800] [bbotk] Evaluating 1 configuration(s)
## INFO  [09:38:06.082] [bbotk] Result of batch 2:
## INFO  [09:38:06.084] [bbotk]  threshold.thresholds classif.ce                                uhash
## INFO  [09:38:06.084] [bbotk]                0.3261      0.263 0b44382a-f4a4-48fe-ac82-3a95a0358463
## INFO  [09:38:06.087] [bbotk] Evaluating 1 configuration(s)
## INFO  [09:38:06.366] [bbotk] Result of batch 3:
## INFO  [09:38:06.368] [bbotk]  threshold.thresholds classif.ce                                uhash
## INFO  [09:38:06.368] [bbotk]                 0.693      0.267 a35c4ca0-cd9a-45a5-ba70-641ad3c1d8d3
## INFO  [09:38:06.371] [bbotk] Evaluating 1 configuration(s)
## INFO  [09:38:06.658] [bbotk] Result of batch 4:
## INFO  [09:38:06.659] [bbotk]  threshold.thresholds classif.ce                                uhash
## INFO  [09:38:06.659] [bbotk]                0.3037      0.267 4e6dd9f8-3772-46e3-90a8-460b41e484b1
## INFO  [09:38:06.662] [bbotk] Evaluating 1 configuration(s)
## INFO  [09:38:06.943] [bbotk] Result of batch 5:
## INFO  [09:38:06.945] [bbotk]  threshold.thresholds classif.ce                                uhash
## INFO  [09:38:06.945] [bbotk]                0.4205      0.263 b8c8f998-2eb8-4b0b-9a23-b421c19266b5
## INFO  [09:38:06.955] [bbotk] Finished optimizing after 5 evaluation(s)
## INFO  [09:38:06.956] [bbotk] Result:
## INFO  [09:38:06.957] [bbotk]  threshold.thresholds learner_param_vals  x_domain classif.ce
## INFO  [09:38:06.957] [bbotk]                0.3261          <list[2]> <list[1]>      0.263

Inside the trafo, we simply collect all set params into a named vector via map_dbl and store it in the threshold.thresholds slot expected by the learner.

Again, we create a AutoTuner, which automatically tunes the supplied learner over the ParamSet we supplied above.

One drawback of this strategy is, that this requires us to fit a new model for each new threshold setting. While setting a threshold and computing performance is relatively cheap, fitting the learner is often more computationally demanding. A better strategy is therefore often to optimize the thresholds separately after each model fit.

### 7.7.7 PipeOpTunethreshold

PipeOpTuneThreshold on the other hand works together with PipeOpLearnerCV. It directly optimizes the cross-validated predictions made by this PipeOp. This is done in order to avoid over-fitting the threshold tuning.

A simple example would be:

gr = po("learner_cv", lrn("classif.rpart", predict_type = "prob")) %>>% po("tunethreshold")
l2 = as_learner(gr)

Note, that predict_type = “prob” is required for po("tunethreshold") to work. Additionally, note that this time no threshold parameter is exposed, it is automatically tuned internally.

l2\$param_set
## <ParamSetCollection>
##                                         id    class lower upper nlevels
##  1:        classif.rpart.resampling.method ParamFct    NA    NA       2
##  2:         classif.rpart.resampling.folds ParamInt     2   Inf     Inf
##  3: classif.rpart.resampling.keep_response ParamLgl    NA    NA       2
##  4:                 classif.rpart.minsplit ParamInt     1   Inf     Inf
##  5:                classif.rpart.minbucket ParamInt     1   Inf     Inf
##  6:                       classif.rpart.cp ParamDbl     0     1     Inf
##  7:               classif.rpart.maxcompete ParamInt     0   Inf     Inf
##  8:             classif.rpart.maxsurrogate ParamInt     0   Inf     Inf
##  9:                 classif.rpart.maxdepth ParamInt     1    30      30
## 10:             classif.rpart.usesurrogate ParamInt     0     2       3
## 11:           classif.rpart.surrogatestyle ParamInt     0     1       2
## 12:                     classif.rpart.xval ParamInt     0   Inf     Inf
## 13:               classif.rpart.keep_model ParamLgl    NA    NA       2
## 14:           classif.rpart.affect_columns ParamUty    NA    NA     Inf
## 15:                  tunethreshold.measure ParamUty    NA    NA     Inf
## 16:                tunethreshold.optimizer ParamUty    NA    NA     Inf
## 17:                tunethreshold.log_level ParamUty    NA    NA     Inf
##            default      value
##  1: <NoDefault[3]>         cv
##  2: <NoDefault[3]>          3
##  3: <NoDefault[3]>      FALSE
##  4:             20
##  5: <NoDefault[3]>
##  6:           0.01
##  7:              4
##  8:              5
##  9:             30
## 10:              2
## 11:              0
## 12:             10          0
## 13:          FALSE
## 14:  <Selector[1]>
## 15: <NoDefault[3]> classif.ce
## 16: <NoDefault[3]>      gensa
## 17:  <function[1]>       warn

Note that we can set rsmp("intask") as a resampling strategy for “learner_cv” in order to evaluate predictions on the “training” data. This is generally not advised, as it might lead to over-fitting on the thresholds but can significantly reduce runtime.

For more information, see the post on Threshold Tuning on the mlr3 gallery.