Goal
We will go beyond resampling single learners. We will learn how to
compare a large number of different models using benchmarking. In this
exercise, we will not show you how to tune a learner. Instead, we will
compare identical learners with different hyperparameters that are set
manually. In particular, we will learn how to set up benchmarking
instances in mlr3
.
German Credit Data
We create the task as for the resampling exercise: Again, we make us of our work horse: The German Credit Data set.
Exercise: Benchmark multiple learners
We are going to compare a range of different KNN models ranging from a \(k\) of 3 to 30. Furthermore, we want to assess the performance of a logistic regression.
Create the learners
Create a logistic regression learner and many KNN learners. You
should cover all KNNs with a \(k\)
between 3 and 30. Save all learners in a list. Give the KNN learners an
appropriate id
that reflects their \(k\).
Show Hint 1:
Use thelapply
function or a for-loop to create the list of
learners with \(k\) between 3 and 30.
Don’t forget to also include the logistic regression learner in your
list (the append
function might be helpful here to extend a
created list). The lrn
function has an argument
id
that can be used to change the name of the learner
(here, you should give the KNN learners an appropriate id
that reflects their value of \(k\) to
be able to distinguish the learners).
Show Hint 2:
To create a list of KNN learners, you can use this template:lapply(..., function(i) lrn("classif.kknn", k = i, id = paste0("classif.knn", i))
Create the resampling
Create a 4-fold cross-validation resampling. Create a list that only
contains this resampling (this is needed later for the
benchmark_grid
function).
Show Hint:
See the previous resampling use case.Create a benchmarking design
To design your benchmark experiment consisting of tasks, learners and
resampling technique, you can use the benchmark_grid
function from mlr3
. Here, we will use only one task and one
resampling technique but multiple learners. Use the previously created
task (german credit), learners (the list of many KNN learners and a
single logistic regression learner) and resampling (4 fold CV) as
input.
Show Hint 1:
Also make sure that the task is included in a list as the arguments of thebenchmark_grid
function requires lists as input.
Show Hint 2:
benchmark_grid(...)
Run the benchmark
Now you still need to run all experiments specified in the design. Do
so by using the benchmark
function. This may take some
time. (Still less than a minute.) Make sure to store the benchmark in a
new object called bmr
as you will reuse and inspect the
benchmark result in the subsequent exercises.
Show Hint 1:
bmr = benchmark(...)
Evaluate the benchmark
Choose two appropriate metrics to evaluate the different learners
performance on the task. Compute these metrics and also visualize at
least one of them using the autoplot
function.
Show Hint 1:
The previously stored benchmark object has a method$aggregate(...)
just like the objects created with the
resample
function from the previous use case.
Show Hint 2:
autoplot(..., measure = msr(...))
Solution
Click me:
In case of a credit use case the false negative rate may be interesting to study next to the accuracy.
res = bmr$aggregate(measures = c(msr("classif.fn"), msr("classif.acc")))
head(res)
## nr task_id learner_id resampling_id iters classif.fn classif.acc
## <int> <char> <char> <char> <int> <num> <num>
## 1: 1 german_credit classif.log_reg cv 4 26.00 0.750
## 2: 2 german_credit classif.knn3 cv 4 33.00 0.692
## 3: 3 german_credit classif.knn4 cv 4 33.00 0.692
## 4: 4 german_credit classif.knn5 cv 4 24.00 0.712
## 5: 5 german_credit classif.knn6 cv 4 23.25 0.712
## 6: 6 german_credit classif.knn7 cv 4 23.25 0.712
## Hidden columns: resample_result
autoplot(bmr, measure = msr("classif.acc"))
Interpret the results
Interpret the plot. Which \(k\) seems to work well given the task? Would you prefer a logistic regression over a KNN learner?
Solution
Click me:
A \(k\) of approx. 15 seems to perform best (in terms of accuracy). A too small \(k\) underfits, a large one overfits. Not knowing the true \(k\), a logistic regression seems preferable. If \(k\) is too small, the average performance of the logistic regression is much better. However, with optimal \(k\), the accuracy of KNN is comparable to that of the logistic regression but with a lower variance. (note that this is somewhat seed-dependent)
Extra: Parallelize your efforts
Benchmarking is embarassingly parallel. That means
it is very easy to run the experiments of the benchmarking on different
machines or cores. In many cases (not all!), this can significantly
speed up computation time. We recommend to do this using the
future::plan
function when paralellizing mlr3
benchmarks.
Show Hint 1:
You need to use theplan
function twice. Once to set up a
multisession
, then go back to parallel
.
Summary
We learnt how to set benchmark in mlr3
. While we only
looked at a single task and a single resampling, the procedure easily
applies to more complex benchmarks with many tasks. Additionally, we
learnt how to understand benchmark results. Last but not least, you may
have parallelized your benchmark if you still had some time left.