Our goal for this exercise sheet is to learn how we can tune a
nonlinear pipeline consisting of multiple PipeOp
s generated
with mlr3pipelines
. The underlying mechanism is that we
transform our pipeline into a proper learner. As such we can tune the
hyperparameters of the learner jointly with the hyperparameters of each
preprocessing step.
German Credit Dataset (dirty)
We use a dirty version of the German credit dataset of Prof. Dr. Hans Hoffman of the University of Hamburg in 1994, which contains 1000 datapoints reflecting bank customers. The dataset is available at the UCI repository as Statlog (German Credit Data) Data Set. We artificially introduced missing values to the numeric features.
task = tsk("german_credit")
dt = task$data()
dt = dt[sample(task$row_ids, size = 100), duration := NA]
dt = dt[sample(task$row_ids, size = 100), amount := NA]
dt = dt[sample(task$row_ids, size = 100), age := NA]
task = as_task_classif(dt, id = "german_credit_NA", target = task$target_names)
## credit_risk age amount credit_history
## 0 100 100 0
## duration employment_duration foreign_worker housing
## 100 0 0 0
## installment_rate job number_credits other_debtors
## 0 0 0 0
## other_installment_plans people_liable personal_status_sex present_residence
## 0 0 0 0
## property purpose savings status
## 0 0 0 0
## telephone
## 0
Below, we built a GraphLearner
which consists of a ML
pipeline that first preprocesses the data by imputing the missing values
(with one of three possible imputation methods: constant mean
imputation, random sampling imputation, and model-based imputation using
the decision tree regr.rpart
), filtering features according
to the information gain, and then applies a random forest
learner. These steps can be reflected in the
following ML pipeline (we use branching to select only one imputation
filter = po("filter", filter = flt("information_gain"))
impute = list(
"imputemean" = po("imputemean"),
"imputesample" = po("imputesample"),
"imputerpart" = po("imputelearner", learner = lrn("regr.rpart"))
ranger = lrn("classif.ranger", num.trees = 100)
graph = ppl("branch", impute) %>>% filter %>>% ranger
# Visualize the ML pipeline graph
Note that the GraphLearner
combines also the
hyperparameters of the learner and all other preprocessing methods. You
can run the code above as we want use it in order to automatically tune
the hyperparameters of the GraphLearner
and benchmark it
with some other learners.
Create a Search Space
The elements of the above graph have different hyperparameters which
can be tuned (see the output of glrn$param_set
for the
names of the hyperparameters). Set up a search space for
- the number of features to filter
by allowing values between2L
- the imputation method
by allowing all 3 values:imputemean
Hint 1:
The names of the hyperparameters could be extracted from
. Use ps()
to create a search
space. Use p_int()
to define the search range for integer
hyperparameters as required by
and p_fct()
categorical hyperparameters as required by
Find the best Hyperparameters
Use the defined search space to automatically tune the number of
features for filtering and the imputation method of the
by setting up an AutoTuner
- grid search as the tuner, with a resolution of 8, meaning that if possible up to 8 equidistant values are produced per hyperparameter
- the classification error
as performance measure - 3-fold CV as resampling strategy
Set a seed for reproducibility (e.g.,
Recap AutoTuner
The AutoTuner
has the advantage over the tuning via
that we do not need to extract
information on the best hyperparameter settings at the end. Instead, the
learner is automatically trained on the whole dataset with the best
hyperparameter setting after tuning.
The AutoTuner
wraps a learner and augments it with an
automatic tuning process for a given set of hyperparameters. Because the
AutoTuner itself inherits from the Learner base class, it can be used
like any other learner. The only difference is that train()
triggers the whole tuning process.
Hint 1:
With auto_tuner()
a new AutoTuner
can be initialized. The initialization method requires the following as
an input
- the
from the previous exercise - the hyperparameter search space (which we have already set up)
- a resampling instance initialized with
- the 3-fold cross-validation - a performance measure - the classification error
- a termination criterion, which is
in our case since we specify the number of resolutions in the tuner - the tuner, i.e., grid search with its
Visualize Tuning Process
Visualize the tuning process using a ggplot
for each of
the two tuned hyperparameters.
Hint 1:
Performance results of the 3-fold CV for each configuration could be
viewed via the achive$data
field of the
. Use e.g. the ggplot()
function to
analyze the relationship of the hyperparameter values and the
performance values classif.ce
Extract the best HPs
Which of the hyperparameter combination was the best performing one?
You can either inspect the plots in the previous exercise or you can
have a look on the $tuning_result
field of the trained
Benchmark the previous AutoTuner
(which automatically
sets the best hyperparameters of the ML pipeline) against a decision
tree (using its default hyperparameter values). Use 3-fold
In this exercise sheet, we learned how to tune the whole pipeline
such that preprocessing as well as model fitting can be optimized for
the task at hand. We set up an AutoTuner
object that
combines the GraphLearner
with the Tuner
could be used as a proper mlr3
learner. We compared the
with some other learners.
Of course, we only saw a selection of the full functionality of the
package - if you want to learn more have a
look in the mlr3book.