Goal

Our goal for this exercise sheet is to learn how we can tune a nonlinear pipeline consisting of multiple PipeOps generated with mlr3pipelines. The underlying mechanism is that we transform our pipeline into a proper learner. As such we can tune the hyperparameters of the learner jointly with the hyperparameters of each preprocessing step.

German Credit Dataset (dirty)

We use a dirty version of the German credit dataset of Prof. Dr. Hans Hoffman of the University of Hamburg in 1994, which contains 1000 datapoints reflecting bank customers. The dataset is available at the UCI repository as Statlog (German Credit Data) Data Set. We artificially introduced missing values to the numeric features.

library("mlr3verse")
library("data.table")
task = tsk("german_credit")
dt = task$data()

set.seed(2023)
dt = dt[sample(task$row_ids, size = 100), duration := NA]
dt = dt[sample(task$row_ids, size = 100), amount := NA]
dt = dt[sample(task$row_ids, size = 100), age := NA]
task = as_task_classif(dt, id = "german_credit_NA", target = task$target_names)
task$missings()
##             credit_risk                     age                  amount          credit_history 
##                       0                     100                     100                       0 
##                duration     employment_duration          foreign_worker                 housing 
##                     100                       0                       0                       0 
##        installment_rate                     job          number_credits           other_debtors 
##                       0                       0                       0                       0 
## other_installment_plans           people_liable     personal_status_sex       present_residence 
##                       0                       0                       0                       0 
##                property                 purpose                 savings                  status 
##                       0                       0                       0                       0 
##               telephone 
##                       0

Exercises:

Below, we built a GraphLearner which consists of a ML pipeline that first preprocesses the data by imputing the missing values (with one of three possible imputation methods: constant mean imputation, random sampling imputation, and model-based imputation using the decision tree regr.rpart), filtering features according to the information gain, and then applies a random forest classif.ranger learner. These steps can be reflected in the following ML pipeline (we use branching to select only one imputation method):

library("mlr3verse")
filter = po("filter", filter = flt("information_gain"))
impute = list(
  "imputemean" = po("imputemean"),
  "imputesample" = po("imputesample"),
  "imputerpart" = po("imputelearner", learner = lrn("regr.rpart"))
)
ranger = lrn("classif.ranger", num.trees = 100)
graph = ppl("branch", impute) %>>% filter %>>% ranger
# Visualize the ML pipeline graph
plot(graph)

glrn = as_learner(graph, id = "imputed.filtered.ranger")

Note that the GraphLearner combines also the hyperparameters of the learner and all other preprocessing methods. You can run the code above as we want use it in order to automatically tune the hyperparameters of the GraphLearner and benchmark it with some other learners.

Create a Search Space

The elements of the above graph have different hyperparameters which can be tuned (see the output of glrn$param_set for the names of the hyperparameters). Set up a search space for

the number of features to filter information_gain.filter.nfeat by allowing values between 2L and 20L
the imputation method branch.selection by allowing all 3 values: imputemean, imputesample, imputerpart

Hint 1:

The names of the hyperparameters could be extracted from graph$param_set. Use ps() to create a search space. Use p_int() to define the search range for integer hyperparameters as required by information_gain.filter.nfeat and p_fct() for categorical hyperparameters as required by branch.selection.

Hint 2:

tune_ps = ps(
  ?? = p_int(2L, 20L),
  branch.selection = ??
)

Find the best Hyperparameters

Use the defined search space to automatically tune the number of features for filtering and the imputation method of the GraphLearner by setting up an AutoTuner object with

grid search as the tuner, with a resolution of 8, meaning that if possible up to 8 equidistant values are produced per hyperparameter
the classification error msr("classif.ce") as performance measure
3-fold CV as resampling strategy

Set a seed for reproducibility (e.g., set.seed(2023)).

Recap AutoTuner:

The AutoTuner has the advantage over the tuning via TuningInstanceSingleCrit or TuningInstanceMultiCrit that we do not need to extract information on the best hyperparameter settings at the end. Instead, the learner is automatically trained on the whole dataset with the best hyperparameter setting after tuning.

The AutoTuner wraps a learner and augments it with an automatic tuning process for a given set of hyperparameters. Because the AutoTuner itself inherits from the Learner base class, it can be used like any other learner. The only difference is that train() triggers the whole tuning process.

Hint 1:

With auto_tuner() a new AutoTuner instance can be initialized. The initialization method requires the following as an input

the GraphLearner from the previous exercise
the hyperparameter search space (which we have already set up)
a resampling instance initialized with rsmp() - the 3-fold cross-validation
a performance measure - the classification error
a termination criterion, which is trm("none") in our case since we specify the number of resolutions in the tuner
the tuner, i.e., grid search with its resolution.

Hint 2:

set.seed(2023)
glrn_tuned = auto_tuner(learner = ...,
  search_space = ...,
  resampling = rsmp("...", folds = ...),
  measure = msr("..."),
  terminator = trm("none"),
  tuner = tnr("...", resolution = 8))
glrn_tuned$train(task)

Visualize Tuning Process

Visualize the tuning process using a ggplot for each of the two tuned hyperparameters.

Hint 1:

Performance results of the 3-fold CV for each configuration could be viewed via the achive$data field of the AutoTuner. Use e.g. the ggplot() function to analyze the relationship of the hyperparameter values and the performance values classif.ce.

Hint 2:

library("ggplot2")
archive = glrn_tuned$...
ggplot(archive, aes(x = ..., y = ...)) + 
  geom_boxplot()

ggplot(archive, aes(x = ..., y = classif.ce, col = ...)) + 
  geom_line()

ggplot(archive, aes(x = ..., y = ..., fill = classif.ce)) + 
  geom_tile()

Extract the best HPs

Which of the hyperparameter combination was the best performing one?

Hint:

You can either inspect the plots in the previous exercise or you can have a look on the $tuning_result field of the trained AutoTuner.

Benchmark

Benchmark the previous AutoTuner (which automatically sets the best hyperparameters of the ML pipeline) against a decision tree (using its default hyperparameter values). Use 3-fold cross-validation.

Hint:

lrns = list(...)
design = benchmark_grid(tasks = ..., learners = ..., resamplings = ...)
bmr = benchmark(...)
bmr$aggregate()
autoplot(bmr)

Summary

In this exercise sheet, we learned how to tune the whole pipeline such that preprocessing as well as model fitting can be optimized for the task at hand. We set up an AutoTuner object that combines the GraphLearner with the Tuner and could be used as a proper mlr3 learner. We compared the AutoTuner with some other learners.

Of course, we only saw a selection of the full functionality of the mlr3pipelines package - if you want to learn more have a look in the mlr3book.

Supervised Learning III

Tuning a Pipeline

Goal

German Credit Dataset (dirty)

Exercises:

Create a Search Space

Find the best Hyperparameters

Visualize Tuning Process

Extract the best HPs

Benchmark

Summary