Goals
Learn how to do preprocessing steps directly on a mlr3
Task object and how to combine a preprocessing with a
learner to create a simple linear ML pipeline that first applies the
preprocessing and then trains a learner.
Recap mlr3 Tasks
A mlr3 Task encapsulates data with
meta-information, such as the name of the target variable and the type
of the learning problem (in our example this would be a
classification task, where the target is a factor label
with relatively few distinct values).
library(mlr3)
task = tsk("german_credit")
task
## <TaskClassif:german_credit> (1000 x 21): German Credit
## * Target: credit_risk
## * Properties: twoclass
## * Features (20):
## - fct (14): credit_history, employment_duration, foreign_worker, housing, job, other_debtors,
## other_installment_plans, people_liable, personal_status_sex, property, purpose, savings,
## status, telephone
## - int (3): age, amount, duration
## - ord (3): installment_rate, number_credits, present_residenceThe print() method gives a short summary of the task: It
has 1000 observations and 21 columns of which 20 are features. 17
features are categorical (i.e., factors) and 3 features are integer.
By using the $data() method, we get access to the data
(in the form of a data.table):
str(task$data())
## Classes 'data.table' and 'data.frame': 1000 obs. of 21 variables:
## $ credit_risk : Factor w/ 2 levels "good","bad": 1 2 1 1 2 1 1 1 1 2 ...
## $ age : int 67 22 49 45 53 35 53 35 61 28 ...
## $ amount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
## $ credit_history : Factor w/ 5 levels "delay in paying off in the past",..: 5 3 5 3 4 3 3 3 3 5 ...
## $ duration : int 6 48 12 42 24 36 24 36 12 30 ...
## $ employment_duration : Factor w/ 5 levels "unemployed","< 1 yr",..: 5 3 4 4 3 3 5 3 4 1 ...
## $ foreign_worker : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ housing : Factor w/ 3 levels "for free","rent",..: 2 2 2 3 3 3 2 1 2 2 ...
## $ installment_rate : Ord.factor w/ 4 levels ">= 35"<"25 <= ... < 35"<..: 4 2 2 2 3 2 3 2 2 4 ...
## $ job : Factor w/ 4 levels "unemployed/unskilled - non-resident",..: 3 3 2 3 3 2 3 4 2 4 ...
## $ number_credits : Ord.factor w/ 4 levels "1"<"2-3"<"4-5"<..: 2 1 1 1 2 1 1 1 1 2 ...
## $ other_debtors : Factor w/ 3 levels "none","co-applicant",..: 1 1 1 3 1 1 1 1 1 1 ...
## $ other_installment_plans: Factor w/ 3 levels "bank","stores",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ people_liable : Factor w/ 2 levels "0 to 2","3 or more": 1 1 2 2 2 2 1 1 1 1 ...
## $ personal_status_sex : Factor w/ 4 levels "male : divorced/separated",..: 3 2 3 3 3 3 3 3 1 4 ...
## $ present_residence : Ord.factor w/ 4 levels "< 1 yr"<"1 <= ... < 4 yrs"<..: 4 2 3 4 4 4 4 2 4 2 ...
## $ property : Factor w/ 4 levels "unknown / no property",..: 1 1 1 2 4 4 2 3 1 3 ...
## $ purpose : Factor w/ 11 levels "others","car (new)",..: 4 4 7 3 1 7 3 2 4 1 ...
## $ savings : Factor w/ 5 levels "unknown/no savings account",..: 5 1 1 1 1 5 3 1 4 1 ...
## $ status : Factor w/ 4 levels "no checking account",..: 1 2 4 1 1 4 4 2 4 2 ...
## $ telephone : Factor w/ 2 levels "no","yes (under customer name)": 2 1 1 1 1 2 1 2 1 1 ...
## - attr(*, ".internal.selfref")=<externalptr>Note that a mlr3 Task object comes with
plenty of functionality in the form of fields, methods and active
bindings, see ?Task, e.g., to get a summary of all feature
names, you can use:
task$feature_names
## [1] "age" "amount" "credit_history" "duration"
## [5] "employment_duration" "foreign_worker" "housing" "installment_rate"
## [9] "job" "number_credits" "other_debtors" "other_installment_plans"
## [13] "people_liable" "personal_status_sex" "present_residence" "property"
## [17] "purpose" "savings" "status" "telephone"To obtain information about the types of features of the task
(similarly like in the data dictionary above), we can inspect the active
binding fields of the task object (see, ?Task):
task$feature_types
## Key: <id>
## id type
## <char> <char>
## 1: age integer
## 2: amount integer
## 3: credit_history factor
## 4: duration integer
## 5: employment_duration factor
## 6: foreign_worker factor
## 7: housing factor
## 8: installment_rate ordered
## 9: job factor
## 10: number_credits ordered
## 11: other_debtors factor
## 12: other_installment_plans factor
## 13: people_liable factor
## 14: personal_status_sex factor
## 15: present_residence ordered
## 16: property factor
## 17: purpose factor
## 18: savings factor
## 19: status factor
## 20: telephone factor
## id typeExercises
Exercise 1: Preprocess a Task (with One-Hot Encoding)
Use the one-hot encoding PipeOp to convert all
categorical features from the german_credit task into a
preprocessed task containing 0-1 indicator variables for each category
level instead of categorical features.
Hint 1:
Load the mlr3pipelines package and get an overview of
possible PipeOp that can be used for different
preprocessing steps by printing mlr_pipeops or the first
two columns of the corresponding table
as.data.table(mlr_pipeops)[,1:2]. Look for a factor
encoding and pass the corresponding key for factor
encoding to the po() function (see also the help page
?PipeOpEncode). Then, use the $train() method
of the PipeOp object which expects a list
containing the task to be converted as input and produces a
list containing the converted task.
Exercise 2: Create a Simple ML Pipeline (with One-Hot Encoding)
Some learners cannot handle categorical features such as the the
xgboost learner (which gives an error message when applied
to a task containing categorical features):
library(mlr3verse)
lrnxg = lrn("classif.xgboost")
lrnxg$train(task)
## Error: <TaskClassif:german_credit> has the following unsupported feature types: factor, ordered
lrnxg$predict(task)
## Error: <TaskClassif:german_credit> has the following unsupported feature types: factor, orderedCombine the xgboost learner with a preprocessing step
that applies one-hot encoding to create a ML pipeline that first
converts all categorical features to 0-1 indicator variables and then
applies the xgboost learner. Train the ML pipeline on the
german_credit task and make predictions on the training
data.
Hint 1:
You can create a Graph that combines a
PipeOp object with a learner object (or further
PipeOp objects) by concatenating them using the
%>>% operator. The Graph contains all
information of a sequential ML pipeline. Convert the Graph
into a GraphLearner to be able to run the whole ML pipeline
like a usual learner object with which we can train, predict, resample,
and benchmark the GraphLearner as we have learned. See also
the help page ?GraphLearner.
Summary
We learned how to apply preprocessing steps such as factor encoding
directly on a task. Furthermore, we have also seen how to create a
GraphLearner which applies a ML pipeline on a task that
first does all preprocessing steps defined in the Graph and
then trains a learner on the preprocessed task.