library(mlr3verse) # All the mlr3 things
::get_logger("mlr3")$set_threshold("error")
lgr
# Spam Task setup
<- tsk("spam") spam_task
Feature Selection
Goals of this part:
- Introduce feature selection
- Introduce the
auto_fselector
analogous toauto_tuner
1 Feature Selection
There is a lot more to cover than we have time for here, see e.g.:
Selecting features with {mlr3}
is similar to parameter tuning: We need to set a budget (e.g. 20 evaluations like before) and a criterion (like the AUC) with a resampling strategy (here holdout for simplicity).
The feature seelction instance defines our search:
= fsi(
fselect_instance task = spam_task,
learner = lrn("classif.rpart", predict_type = "prob"),
resampling = rsmp("holdout"),
measure = msr("classif.auc"),
terminator = trm("evals", n_evals = 20)
)
fselect_instance
#> <FSelectInstanceBatchSingleCrit>
#> * State: Not optimized
#> * Objective: <ObjectiveFSelectBatch:classif.rpart_on_spam>
#> * Terminator: <TerminatorEvals>
There are multiple feature selection methods available:
- Random Search (
"random_search
): Randomly try combinations of features until our budget is exhausted - Exhaustive Search (
exhaustive_search
): Try all possible subsets of features. Can take a trillion years. Or 10 minutes - Sequential Search (
sequential
): Forwards- (default) or backwards-selection - Recursive Feature Elimination (
rfe
): Recursively eliminates features with low$importance
score (if theLearner
supports it!)
as.data.table(mlr_fselectors)
#> Key: <key>
#> key label
#> <char> <char>
#> 1: design_points Design Points
#> 2: exhaustive_search Exhaustive Search
#> 3: genetic_search Genetic Search
#> 4: random_search Random Search
#> 5: rfe Recursive Feature Elimination
#> 6: rfecv Recursive Feature Elimination
#> 7: sequential Sequential Search
#> 8: shadow_variable_search Shadow Variable Search
#> properties packages
#> <list> <list>
#> 1: dependencies,single-crit,multi-crit mlr3fselect,bbotk
#> 2: single-crit,multi-crit mlr3fselect
#> 3: single-crit mlr3fselect,genalg
#> 4: single-crit,multi-crit mlr3fselect
#> 5: single-crit,requires_model mlr3fselect
#> 6: single-crit,requires_model mlr3fselect
#> 7: single-crit mlr3fselect
#> 8: single-crit mlr3fselect
As you might be able to imagine, doing an exhaustive search is not often feasible when we’re working with a lot of features. For a dataset with 10 features, examining every possible subset of features would yield over 1000 models to evaluate. You can imagine how feasible that approach would be for genome-wide studies with thousands of variables.
Random search it is, then!
<- fs("random_search")
fselector
$optimize(fselect_instance) fselector
#> address addresses all business capitalAve capitalLong capitalTotal
#> <lgcl> <lgcl> <lgcl> <lgcl> <lgcl> <lgcl> <lgcl>
#> 1: FALSE TRUE TRUE TRUE TRUE TRUE TRUE
#> charDollar charExclamation charHash charRoundbracket charSemicolon
#> <lgcl> <lgcl> <lgcl> <lgcl> <lgcl>
#> 1: FALSE FALSE FALSE TRUE FALSE
#> charSquarebracket conference credit cs data direct edu email
#> <lgcl> <lgcl> <lgcl> <lgcl> <lgcl> <lgcl> <lgcl> <lgcl>
#> 1: TRUE FALSE FALSE FALSE TRUE FALSE TRUE TRUE
#> font free george hp hpl internet lab labs mail make
#> <lgcl> <lgcl> <lgcl> <lgcl> <lgcl> <lgcl> <lgcl> <lgcl> <lgcl> <lgcl>
#> 1: TRUE TRUE TRUE TRUE FALSE TRUE FALSE FALSE TRUE TRUE
#> meeting money num000 num1999 num3d num415 num650 num85 num857 order
#> <lgcl> <lgcl> <lgcl> <lgcl> <lgcl> <lgcl> <lgcl> <lgcl> <lgcl> <lgcl>
#> 1: TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE
#> original our over parts people pm project re receive remove
#> <lgcl> <lgcl> <lgcl> <lgcl> <lgcl> <lgcl> <lgcl> <lgcl> <lgcl> <lgcl>
#> 1: TRUE TRUE TRUE TRUE FALSE TRUE FALSE FALSE TRUE TRUE
#> report table technology telnet will you your
#> <lgcl> <lgcl> <lgcl> <lgcl> <lgcl> <lgcl> <lgcl>
#> 1: TRUE TRUE TRUE TRUE TRUE FALSE TRUE
#> features n_features
#> <list> <int>
#> 1: addresses,all,business,capitalAve,capitalLong,capitalTotal,... 36
#> classif.auc
#> <num>
#> 1: 0.9202298
Here we have picked an selection strategy (ultimitaley also just an optimization problem) and used it on our selection problem.
We can look at the results, also similar to tuning results:
$result_feature_set fselect_instance
#> [1] "addresses" "all" "business"
#> [4] "capitalAve" "capitalLong" "capitalTotal"
#> [7] "charRoundbracket" "charSquarebracket" "data"
#> [10] "edu" "email" "font"
#> [13] "free" "george" "hp"
#> [16] "internet" "mail" "make"
#> [19] "meeting" "money" "num415"
#> [22] "num650" "num85" "original"
#> [25] "our" "over" "parts"
#> [28] "pm" "receive" "remove"
#> [31] "report" "table" "technology"
#> [34] "telnet" "will" "your"
$result_y fselect_instance
#> classif.auc
#> 0.9202298
We can also look at the (somewhat unqieldy) tuning archive which shows us all of the feature combinations that we tried out, wehre TRUE
indicates features that were in this particular evaluation and FALSE
for those omitted.
as.data.table(fselect_instance$archive)[1:5, ]
#> address addresses all business capitalAve capitalLong capitalTotal
#> <lgcl> <lgcl> <lgcl> <lgcl> <lgcl> <lgcl> <lgcl>
#> 1: TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#> 2: FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> 3: TRUE FALSE TRUE FALSE TRUE FALSE TRUE
#> 4: TRUE TRUE FALSE FALSE TRUE TRUE TRUE
#> 5: FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> charDollar charExclamation charHash charRoundbracket charSemicolon
#> <lgcl> <lgcl> <lgcl> <lgcl> <lgcl>
#> 1: TRUE TRUE TRUE TRUE TRUE
#> 2: TRUE FALSE FALSE TRUE FALSE
#> 3: FALSE FALSE TRUE FALSE TRUE
#> 4: FALSE FALSE TRUE FALSE FALSE
#> 5: FALSE FALSE FALSE FALSE FALSE
#> charSquarebracket conference credit cs data direct edu email
#> <lgcl> <lgcl> <lgcl> <lgcl> <lgcl> <lgcl> <lgcl> <lgcl>
#> 1: TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#> 2: FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE
#> 3: TRUE TRUE FALSE TRUE TRUE FALSE FALSE TRUE
#> 4: TRUE TRUE FALSE TRUE FALSE FALSE TRUE FALSE
#> 5: FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
#> font free george hp hpl internet lab labs mail make
#> <lgcl> <lgcl> <lgcl> <lgcl> <lgcl> <lgcl> <lgcl> <lgcl> <lgcl> <lgcl>
#> 1: TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE
#> 2: FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
#> 3: FALSE FALSE FALSE TRUE FALSE TRUE FALSE TRUE FALSE FALSE
#> 4: FALSE TRUE FALSE TRUE TRUE TRUE FALSE FALSE FALSE TRUE
#> 5: FALSE FALSE FALSE TRUE FALSE FALSE TRUE TRUE FALSE FALSE
#> meeting money num000 num1999 num3d num415 num650 num85 num857 order
#> <lgcl> <lgcl> <lgcl> <lgcl> <lgcl> <lgcl> <lgcl> <lgcl> <lgcl> <lgcl>
#> 1: TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#> 2: FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
#> 3: TRUE TRUE FALSE FALSE FALSE TRUE TRUE FALSE TRUE TRUE
#> 4: FALSE TRUE FALSE TRUE TRUE FALSE TRUE TRUE FALSE TRUE
#> 5: FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
#> original our over parts people pm project re receive remove
#> <lgcl> <lgcl> <lgcl> <lgcl> <lgcl> <lgcl> <lgcl> <lgcl> <lgcl> <lgcl>
#> 1: TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE
#> 2: FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE
#> 3: FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE
#> 4: FALSE TRUE TRUE FALSE TRUE FALSE TRUE TRUE TRUE FALSE
#> 5: FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> report table technology telnet will you your classif.auc
#> <lgcl> <lgcl> <lgcl> <lgcl> <lgcl> <lgcl> <lgcl> <num>
#> 1: TRUE TRUE TRUE TRUE TRUE FALSE TRUE 0.9054711
#> 2: FALSE FALSE FALSE FALSE TRUE FALSE FALSE 0.9002972
#> 3: FALSE FALSE TRUE TRUE FALSE FALSE FALSE 0.8629631
#> 4: TRUE TRUE TRUE FALSE TRUE FALSE TRUE 0.8813560
#> 5: FALSE FALSE FALSE TRUE FALSE TRUE FALSE 0.8194007
#> runtime_learners timestamp batch_nr warnings errors
#> <num> <POSc> <int> <int> <int>
#> 1: 0.032 2025-07-15 19:21:04 1 0 0
#> 2: 0.010 2025-07-15 19:21:04 1 0 0
#> 3: 0.021 2025-07-15 19:21:04 1 0 0
#> 4: 0.018 2025-07-15 19:21:04 1 0 0
#> 5: 0.009 2025-07-15 19:21:04 1 0 0
#> features
#> <list>
#> 1: address,addresses,all,business,capitalAve,capitalLong,...
#> 2: charDollar,charRoundbracket,conference,email,george,hp,...
#> 3: address,all,capitalAve,capitalTotal,charHash,charSemicolon,...
#> 4: address,addresses,capitalAve,capitalLong,capitalTotal,charHash,...
#> 5: cs,hp,lab,labs,money,num3d,...
#> n_features resample_result
#> <list> <list>
#> 1: 54 <ResampleResult>
#> 2: 10 <ResampleResult>
#> 3: 25 <ResampleResult>
#> 4: 32 <ResampleResult>
#> 5: 9 <ResampleResult>
Similar to the auto_tuner
we used for parameter tuning, there’s also an auto_fselector
which basically works the same way, giving us a “self-tuning” learner as a result
<- auto_fselector(
fselected_rpart learner = lrn("classif.rpart", predict_type = "prob"),
resampling = rsmp("holdout"),
measure = msr("classif.ce"),
terminator = trm("evals", n_evals = 20),
fselector = fs("random_search")
)
fselected_rpart
#> <AutoFSelector:classif.rpart.fselector>
#> * Model: list
#> * Packages: mlr3, mlr3fselect, rpart
#> * Predict Type: prob
#> * Feature Types: logical, integer, numeric, factor, ordered
#> * Properties: importance, missings, multiclass, selected_features,
#> twoclass, weights
And of course it should be worth it to compare our variable-selected learner with a learner that uses all variables, just to make sure we’re not wasting our time:
<- benchmark_grid(
design tasks = spam_task,
learners = list(
fselected_rpart,lrn("classif.rpart", predict_type = "prob")
),resamplings = rsmp("cv", folds = 3)
)
<- benchmark(design)
bmr $aggregate(msr("classif.auc")) bmr
#> nr task_id learner_id resampling_id iters classif.auc
#> <int> <char> <char> <char> <int> <num>
#> 1: 1 spam classif.rpart.fselector cv 3 0.8923973
#> 2: 2 spam classif.rpart cv 3 0.8981915
#> Hidden columns: resample_result
Of course this is essentially another form of tuning, and doings feature selection with untuned learners is not going to give you the best possible performance in each iteration, but it gives you a good set of features to start your actual hyperparameter tuning with.
1.1 Your turn! (For some other time)
- Try out the bike sharing task (
tsk("bike_sharing")
) - Read the docs to see the meaning of each feature
- Try out different feature selection approaches!
Note that this task has a few more observations, so it’s going to take a bit longer.
We don’t want to spend the in-person session staring overheating laptops, so you can try this out in your own time!