Goals
Learn how to do preprocessing steps directly on a mlr3
Task
object and how to combine a preprocessing with a
learner to create a simple linear ML pipeline that first applies the
preprocessing and then trains a learner.
Recap mlr3 Tasks
A mlr3
Task
encapsulates data with
meta-information, such as the name of the target variable and the type
of the learning problem (in our example this would be a
classification task, where the target is a factor label
with relatively few distinct values).
library(mlr3)
task = tsk("german_credit")
task
## <TaskClassif:german_credit> (1000 x 21): German Credit
## * Target: credit_risk
## * Properties: twoclass
## * Features (20):
## - fct (14): credit_history, employment_duration, foreign_worker, housing, job, other_debtors,
## other_installment_plans, people_liable, personal_status_sex, property, purpose, savings,
## status, telephone
## - int (3): age, amount, duration
## - ord (3): installment_rate, number_credits, present_residence
The print()
method gives a short summary of the task: It
has 1000 observations and 21 columns of which 20 are features. 17
features are categorical (i.e., factors) and 3 features are integer.
By using the $data()
method, we get access to the data
(in the form of a data.table
):
str(task$data())
## Classes 'data.table' and 'data.frame': 1000 obs. of 21 variables:
## $ credit_risk : Factor w/ 2 levels "good","bad": 1 2 1 1 2 1 1 1 1 2 ...
## $ age : int 67 22 49 45 53 35 53 35 61 28 ...
## $ amount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
## $ credit_history : Factor w/ 5 levels "delay in paying off in the past",..: 5 3 5 3 4 3 3 3 3 5 ...
## $ duration : int 6 48 12 42 24 36 24 36 12 30 ...
## $ employment_duration : Factor w/ 5 levels "unemployed","< 1 yr",..: 5 3 4 4 3 3 5 3 4 1 ...
## $ foreign_worker : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ housing : Factor w/ 3 levels "for free","rent",..: 2 2 2 3 3 3 2 1 2 2 ...
## $ installment_rate : Ord.factor w/ 4 levels ">= 35"<"25 <= ... < 35"<..: 4 2 2 2 3 2 3 2 2 4 ...
## $ job : Factor w/ 4 levels "unemployed/unskilled - non-resident",..: 3 3 2 3 3 2 3 4 2 4 ...
## $ number_credits : Ord.factor w/ 4 levels "1"<"2-3"<"4-5"<..: 2 1 1 1 2 1 1 1 1 2 ...
## $ other_debtors : Factor w/ 3 levels "none","co-applicant",..: 1 1 1 3 1 1 1 1 1 1 ...
## $ other_installment_plans: Factor w/ 3 levels "bank","stores",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ people_liable : Factor w/ 2 levels "0 to 2","3 or more": 1 1 2 2 2 2 1 1 1 1 ...
## $ personal_status_sex : Factor w/ 4 levels "male : divorced/separated",..: 3 2 3 3 3 3 3 3 1 4 ...
## $ present_residence : Ord.factor w/ 4 levels "< 1 yr"<"1 <= ... < 4 yrs"<..: 4 2 3 4 4 4 4 2 4 2 ...
## $ property : Factor w/ 4 levels "unknown / no property",..: 1 1 1 2 4 4 2 3 1 3 ...
## $ purpose : Factor w/ 11 levels "others","car (new)",..: 4 4 7 3 1 7 3 2 4 1 ...
## $ savings : Factor w/ 5 levels "unknown/no savings account",..: 5 1 1 1 1 5 3 1 4 1 ...
## $ status : Factor w/ 4 levels "no checking account",..: 1 2 4 1 1 4 4 2 4 2 ...
## $ telephone : Factor w/ 2 levels "no","yes (under customer name)": 2 1 1 1 1 2 1 2 1 1 ...
## - attr(*, ".internal.selfref")=<externalptr>
Note that a mlr3
Task
object comes with
plenty of functionality in the form of fields, methods and active
bindings, see ?Task
, e.g., to get a summary of all feature
names, you can use:
task$feature_names
## [1] "age" "amount" "credit_history" "duration"
## [5] "employment_duration" "foreign_worker" "housing" "installment_rate"
## [9] "job" "number_credits" "other_debtors" "other_installment_plans"
## [13] "people_liable" "personal_status_sex" "present_residence" "property"
## [17] "purpose" "savings" "status" "telephone"
To obtain information about the types of features of the task
(similarly like in the data dictionary above), we can inspect the active
binding fields of the task object (see, ?Task
):
task$feature_types
## Key: <id>
## id type
## <char> <char>
## 1: age integer
## 2: amount integer
## 3: credit_history factor
## 4: duration integer
## 5: employment_duration factor
## 6: foreign_worker factor
## 7: housing factor
## 8: installment_rate ordered
## 9: job factor
## 10: number_credits ordered
## 11: other_debtors factor
## 12: other_installment_plans factor
## 13: people_liable factor
## 14: personal_status_sex factor
## 15: present_residence ordered
## 16: property factor
## 17: purpose factor
## 18: savings factor
## 19: status factor
## 20: telephone factor
## id type
Exercises
Exercise 1: Preprocess a Task (with One-Hot Encoding)
Use the one-hot encoding PipeOp
to convert all
categorical features from the german_credit
task into a
preprocessed task containing 0-1 indicator variables for each category
level instead of categorical features.
Hint 1:
Load the mlr3pipelines
package and get an overview of
possible PipeOp
that can be used for different
preprocessing steps by printing mlr_pipeops
or the first
two columns of the corresponding table
as.data.table(mlr_pipeops)[,1:2]
. Look for a factor
encoding and pass the corresponding key
for factor
encoding to the po()
function (see also the help page
?PipeOpEncode
). Then, use the $train()
method
of the PipeOp
object which expects a list
containing the task to be converted as input and produces a
list containing the converted task.
Exercise 2: Create a Simple ML Pipeline (with One-Hot Encoding)
Some learners cannot handle categorical features such as the the
xgboost
learner (which gives an error message when applied
to a task containing categorical features):
library(mlr3verse)
lrnxg = lrn("classif.xgboost")
lrnxg$train(task)
## Error: <TaskClassif:german_credit> has the following unsupported feature types: factor, ordered
lrnxg$predict(task)
## Error: <TaskClassif:german_credit> has the following unsupported feature types: factor, ordered
Combine the xgboost
learner with a preprocessing step
that applies one-hot encoding to create a ML pipeline that first
converts all categorical features to 0-1 indicator variables and then
applies the xgboost
learner. Train the ML pipeline on the
german_credit
task and make predictions on the training
data.
Hint 1:
You can create a Graph
that combines a
PipeOp
object with a learner object (or further
PipeOp
objects) by concatenating them using the
%>>%
operator. The Graph
contains all
information of a sequential ML pipeline. Convert the Graph
into a GraphLearner
to be able to run the whole ML pipeline
like a usual learner object with which we can train, predict, resample,
and benchmark the GraphLearner
as we have learned. See also
the help page ?GraphLearner
.
Summary
We learned how to apply preprocessing steps such as factor encoding
directly on a task. Furthermore, we have also seen how to create a
GraphLearner
which applies a ML pipeline on a task that
first does all preprocessing steps defined in the Graph
and
then trains a learner on the preprocessed task.