Supervised Learning II

Categorical Feature Encoding

Goals

Learn how to do preprocessing steps directly on a mlr3 Task object and how to combine a preprocessing with a learner to create a simple linear ML pipeline that first applies the preprocessing and then trains a learner.

Recap mlr3 Tasks

A mlr3 Task encapsulates data with meta-information, such as the name of the target variable and the type of the learning problem (in our example this would be a classification task, where the target is a factor label with relatively few distinct values).

library(mlr3)
task = tsk("german_credit")
task
## <TaskClassif:german_credit> (1000 x 21): German Credit
## * Target: credit_risk
## * Properties: twoclass
## * Features (20):
##   - fct (14): credit_history, employment_duration, foreign_worker, housing, job, other_debtors,
##     other_installment_plans, people_liable, personal_status_sex, property, purpose, savings,
##     status, telephone
##   - int (3): age, amount, duration
##   - ord (3): installment_rate, number_credits, present_residence

The print() method gives a short summary of the task: It has 1000 observations and 21 columns of which 20 are features. 17 features are categorical (i.e., factors) and 3 features are integer.

By using the $data() method, we get access to the data (in the form of a data.table):

str(task$data())
## Classes 'data.table' and 'data.frame':   1000 obs. of  21 variables:
##  $ credit_risk            : Factor w/ 2 levels "good","bad": 1 2 1 1 2 1 1 1 1 2 ...
##  $ age                    : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ amount                 : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ credit_history         : Factor w/ 5 levels "delay in paying off in the past",..: 5 3 5 3 4 3 3 3 3 5 ...
##  $ duration               : int  6 48 12 42 24 36 24 36 12 30 ...
##  $ employment_duration    : Factor w/ 5 levels "unemployed","< 1 yr",..: 5 3 4 4 3 3 5 3 4 1 ...
##  $ foreign_worker         : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ housing                : Factor w/ 3 levels "for free","rent",..: 2 2 2 3 3 3 2 1 2 2 ...
##  $ installment_rate       : Ord.factor w/ 4 levels ">= 35"<"25 <= ... < 35"<..: 4 2 2 2 3 2 3 2 2 4 ...
##  $ job                    : Factor w/ 4 levels "unemployed/unskilled - non-resident",..: 3 3 2 3 3 2 3 4 2 4 ...
##  $ number_credits         : Ord.factor w/ 4 levels "1"<"2-3"<"4-5"<..: 2 1 1 1 2 1 1 1 1 2 ...
##  $ other_debtors          : Factor w/ 3 levels "none","co-applicant",..: 1 1 1 3 1 1 1 1 1 1 ...
##  $ other_installment_plans: Factor w/ 3 levels "bank","stores",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ people_liable          : Factor w/ 2 levels "0 to 2","3 or more": 1 1 2 2 2 2 1 1 1 1 ...
##  $ personal_status_sex    : Factor w/ 4 levels "male : divorced/separated",..: 3 2 3 3 3 3 3 3 1 4 ...
##  $ present_residence      : Ord.factor w/ 4 levels "< 1 yr"<"1 <= ... < 4 yrs"<..: 4 2 3 4 4 4 4 2 4 2 ...
##  $ property               : Factor w/ 4 levels "unknown / no property",..: 1 1 1 2 4 4 2 3 1 3 ...
##  $ purpose                : Factor w/ 11 levels "others","car (new)",..: 4 4 7 3 1 7 3 2 4 1 ...
##  $ savings                : Factor w/ 5 levels "unknown/no savings account",..: 5 1 1 1 1 5 3 1 4 1 ...
##  $ status                 : Factor w/ 4 levels "no checking account",..: 1 2 4 1 1 4 4 2 4 2 ...
##  $ telephone              : Factor w/ 2 levels "no","yes (under customer name)": 2 1 1 1 1 2 1 2 1 1 ...
##  - attr(*, ".internal.selfref")=<externalptr>

Note that a mlr3 Task object comes with plenty of functionality in the form of fields, methods and active bindings, see ?Task, e.g., to get a summary of all feature names, you can use:

task$feature_names
##  [1] "age"                     "amount"                  "credit_history"          "duration"               
##  [5] "employment_duration"     "foreign_worker"          "housing"                 "installment_rate"       
##  [9] "job"                     "number_credits"          "other_debtors"           "other_installment_plans"
## [13] "people_liable"           "personal_status_sex"     "present_residence"       "property"               
## [17] "purpose"                 "savings"                 "status"                  "telephone"

To obtain information about the types of features of the task (similarly like in the data dictionary above), we can inspect the active binding fields of the task object (see, ?Task):

task$feature_types
## Key: <id>
##                          id    type
##                      <char>  <char>
##  1:                     age integer
##  2:                  amount integer
##  3:          credit_history  factor
##  4:                duration integer
##  5:     employment_duration  factor
##  6:          foreign_worker  factor
##  7:                 housing  factor
##  8:        installment_rate ordered
##  9:                     job  factor
## 10:          number_credits ordered
## 11:           other_debtors  factor
## 12: other_installment_plans  factor
## 13:           people_liable  factor
## 14:     personal_status_sex  factor
## 15:       present_residence ordered
## 16:                property  factor
## 17:                 purpose  factor
## 18:                 savings  factor
## 19:                  status  factor
## 20:               telephone  factor
##                          id    type

Exercises

Exercise 1: Preprocess a Task (with One-Hot Encoding)

Use the one-hot encoding PipeOp to convert all categorical features from the german_credit task into a preprocessed task containing 0-1 indicator variables for each category level instead of categorical features.

Hint 1:

Load the mlr3pipelines package and get an overview of possible PipeOp that can be used for different preprocessing steps by printing mlr_pipeops or the first two columns of the corresponding table as.data.table(mlr_pipeops)[,1:2]. Look for a factor encoding and pass the corresponding key for factor encoding to the po() function (see also the help page ?PipeOpEncode). Then, use the $train() method of the PipeOp object which expects a list containing the task to be converted as input and produces a list containing the converted task.

Hint 2:
library(mlr3pipelines)
# Create a PipeOp object that applies one-hot encoding
poe = po(...) 
# Apply a created PipeOp to e.g. preprocess an input
encoded_task = poe$train(input = ...)$output
str(...$data())

Exercise 2: Create a Simple ML Pipeline (with One-Hot Encoding)

Some learners cannot handle categorical features such as the the xgboost learner (which gives an error message when applied to a task containing categorical features):

library(mlr3verse)
lrnxg = lrn("classif.xgboost")
lrnxg$train(task)
## Error: <TaskClassif:german_credit> has the following unsupported feature types: factor, ordered
lrnxg$predict(task)
## Error: <TaskClassif:german_credit> has the following unsupported feature types: factor, ordered

Combine the xgboost learner with a preprocessing step that applies one-hot encoding to create a ML pipeline that first converts all categorical features to 0-1 indicator variables and then applies the xgboost learner. Train the ML pipeline on the german_credit task and make predictions on the training data.

Hint 1:

You can create a Graph that combines a PipeOp object with a learner object (or further PipeOp objects) by concatenating them using the %>>% operator. The Graph contains all information of a sequential ML pipeline. Convert the Graph into a GraphLearner to be able to run the whole ML pipeline like a usual learner object with which we can train, predict, resample, and benchmark the GraphLearner as we have learned. See also the help page ?GraphLearner.

Hint 2:
library(mlr3verse)
lrnxg = lrn("classif.xgboost")
poe = po(...)
graph = ...

glrn = as_learner(...) # Alternative: glrn = GraphLearner$new(...) 
...$train(...)
...$predict(...)

Summary

We learned how to apply preprocessing steps such as factor encoding directly on a task. Furthermore, we have also seen how to create a GraphLearner which applies a ML pipeline on a task that first does all preprocessing steps defined in the Graph and then trains a learner on the preprocessed task.