Feature Selection

Last Updated

July 16, 2025

library(mlr3verse) # All the mlr3 things
lgr::get_logger("mlr3")$set_threshold("error")

# Spam Task setup
spam_task <- tsk("spam")

Goals of this part:

Introduce feature selection
Introduce the auto_fselector analogous to auto_tuner

1 Feature Selection

There is a lot more to cover than we have time for here, see e.g.:

Selecting features with {mlr3} is similar to parameter tuning: We need to set a budget (e.g. 20 evaluations like before) and a criterion (like the AUC) with a resampling strategy (here holdout for simplicity).

The feature seelction instance defines our search:

fselect_instance = fsi(
  task = spam_task,
  learner = lrn("classif.rpart", predict_type = "prob"),
  resampling = rsmp("holdout"),
  measure = msr("classif.auc"),
  terminator = trm("evals", n_evals = 20)
)

fselect_instance

#> <FSelectInstanceBatchSingleCrit>
#> * State:  Not optimized
#> * Objective: <ObjectiveFSelectBatch:classif.rpart_on_spam>
#> * Terminator: <TerminatorEvals>

There are multiple feature selection methods available:

Random Search ("random_search): Randomly try combinations of features until our budget is exhausted
Exhaustive Search (exhaustive_search): Try all possible subsets of features. Can take a trillion years. Or 10 minutes
Sequential Search (sequential): Forwards- (default) or backwards-selection
Recursive Feature Elimination (rfe): Recursively eliminates features with low $importance score (if the Learner supports it!)

as.data.table(mlr_fselectors)

#> Key: <key>
#>                       key                         label
#>                    <char>                        <char>
#> 1:          design_points                 Design Points
#> 2:      exhaustive_search             Exhaustive Search
#> 3:         genetic_search                Genetic Search
#> 4:          random_search                 Random Search
#> 5:                    rfe Recursive Feature Elimination
#> 6:                  rfecv Recursive Feature Elimination
#> 7:             sequential             Sequential Search
#> 8: shadow_variable_search        Shadow Variable Search
#>                             properties           packages
#>                                 <list>             <list>
#> 1: dependencies,single-crit,multi-crit  mlr3fselect,bbotk
#> 2:              single-crit,multi-crit        mlr3fselect
#> 3:                         single-crit mlr3fselect,genalg
#> 4:              single-crit,multi-crit        mlr3fselect
#> 5:          single-crit,requires_model        mlr3fselect
#> 6:          single-crit,requires_model        mlr3fselect
#> 7:                         single-crit        mlr3fselect
#> 8:                         single-crit        mlr3fselect

As you might be able to imagine, doing an exhaustive search is not often feasible when we’re working with a lot of features. For a dataset with 10 features, examining every possible subset of features would yield over 1000 models to evaluate. You can imagine how feasible that approach would be for genome-wide studies with thousands of variables.

Random search it is, then!

fselector <- fs("random_search")

fselector$optimize(fselect_instance)

#>    address addresses    all business capitalAve capitalLong capitalTotal
#>     <lgcl>    <lgcl> <lgcl>   <lgcl>     <lgcl>      <lgcl>       <lgcl>
#> 1:   FALSE      TRUE   TRUE     TRUE       TRUE        TRUE         TRUE
#>    charDollar charExclamation charHash charRoundbracket charSemicolon
#>        <lgcl>          <lgcl>   <lgcl>           <lgcl>        <lgcl>
#> 1:      FALSE           FALSE    FALSE             TRUE         FALSE
#>    charSquarebracket conference credit     cs   data direct    edu  email
#>               <lgcl>     <lgcl> <lgcl> <lgcl> <lgcl> <lgcl> <lgcl> <lgcl>
#> 1:              TRUE      FALSE  FALSE  FALSE   TRUE  FALSE   TRUE   TRUE
#>      font   free george     hp    hpl internet    lab   labs   mail   make
#>    <lgcl> <lgcl> <lgcl> <lgcl> <lgcl>   <lgcl> <lgcl> <lgcl> <lgcl> <lgcl>
#> 1:   TRUE   TRUE   TRUE   TRUE  FALSE     TRUE  FALSE  FALSE   TRUE   TRUE
#>    meeting  money num000 num1999  num3d num415 num650  num85 num857  order
#>     <lgcl> <lgcl> <lgcl>  <lgcl> <lgcl> <lgcl> <lgcl> <lgcl> <lgcl> <lgcl>
#> 1:    TRUE   TRUE  FALSE   FALSE  FALSE   TRUE   TRUE   TRUE  FALSE  FALSE
#>    original    our   over  parts people     pm project     re receive remove
#>      <lgcl> <lgcl> <lgcl> <lgcl> <lgcl> <lgcl>  <lgcl> <lgcl>  <lgcl> <lgcl>
#> 1:     TRUE   TRUE   TRUE   TRUE  FALSE   TRUE   FALSE  FALSE    TRUE   TRUE
#>    report  table technology telnet   will    you   your
#>    <lgcl> <lgcl>     <lgcl> <lgcl> <lgcl> <lgcl> <lgcl>
#> 1:   TRUE   TRUE       TRUE   TRUE   TRUE  FALSE   TRUE
#>                                                          features n_features
#>                                                            <list>      <int>
#> 1: addresses,all,business,capitalAve,capitalLong,capitalTotal,...         36
#>    classif.auc
#>          <num>
#> 1:   0.9202298

Here we have picked an selection strategy (ultimitaley also just an optimization problem) and used it on our selection problem.

We can look at the results, also similar to tuning results:

fselect_instance$result_feature_set

#>  [1] "addresses"         "all"               "business"         
#>  [4] "capitalAve"        "capitalLong"       "capitalTotal"     
#>  [7] "charRoundbracket"  "charSquarebracket" "data"             
#> [10] "edu"               "email"             "font"             
#> [13] "free"              "george"            "hp"               
#> [16] "internet"          "mail"              "make"             
#> [19] "meeting"           "money"             "num415"           
#> [22] "num650"            "num85"             "original"         
#> [25] "our"               "over"              "parts"            
#> [28] "pm"                "receive"           "remove"           
#> [31] "report"            "table"             "technology"       
#> [34] "telnet"            "will"              "your"

fselect_instance$result_y

#> classif.auc 
#>   0.9202298

We can also look at the (somewhat unqieldy) tuning archive which shows us all of the feature combinations that we tried out, wehre TRUE indicates features that were in this particular evaluation and FALSE for those omitted.

as.data.table(fselect_instance$archive)[1:5, ]

#>    address addresses    all business capitalAve capitalLong capitalTotal
#>     <lgcl>    <lgcl> <lgcl>   <lgcl>     <lgcl>      <lgcl>       <lgcl>
#> 1:    TRUE      TRUE   TRUE     TRUE       TRUE        TRUE         TRUE
#> 2:   FALSE     FALSE  FALSE    FALSE      FALSE       FALSE        FALSE
#> 3:    TRUE     FALSE   TRUE    FALSE       TRUE       FALSE         TRUE
#> 4:    TRUE      TRUE  FALSE    FALSE       TRUE        TRUE         TRUE
#> 5:   FALSE     FALSE  FALSE    FALSE      FALSE       FALSE        FALSE
#>    charDollar charExclamation charHash charRoundbracket charSemicolon
#>        <lgcl>          <lgcl>   <lgcl>           <lgcl>        <lgcl>
#> 1:       TRUE            TRUE     TRUE             TRUE          TRUE
#> 2:       TRUE           FALSE    FALSE             TRUE         FALSE
#> 3:      FALSE           FALSE     TRUE            FALSE          TRUE
#> 4:      FALSE           FALSE     TRUE            FALSE         FALSE
#> 5:      FALSE           FALSE    FALSE            FALSE         FALSE
#>    charSquarebracket conference credit     cs   data direct    edu  email
#>               <lgcl>     <lgcl> <lgcl> <lgcl> <lgcl> <lgcl> <lgcl> <lgcl>
#> 1:              TRUE       TRUE   TRUE   TRUE   TRUE   TRUE   TRUE   TRUE
#> 2:             FALSE       TRUE  FALSE  FALSE  FALSE  FALSE  FALSE   TRUE
#> 3:              TRUE       TRUE  FALSE   TRUE   TRUE  FALSE  FALSE   TRUE
#> 4:              TRUE       TRUE  FALSE   TRUE  FALSE  FALSE   TRUE  FALSE
#> 5:             FALSE      FALSE  FALSE   TRUE  FALSE  FALSE  FALSE  FALSE
#>      font   free george     hp    hpl internet    lab   labs   mail   make
#>    <lgcl> <lgcl> <lgcl> <lgcl> <lgcl>   <lgcl> <lgcl> <lgcl> <lgcl> <lgcl>
#> 1:   TRUE   TRUE   TRUE   TRUE   TRUE     TRUE   TRUE  FALSE   TRUE   TRUE
#> 2:  FALSE  FALSE   TRUE   TRUE  FALSE    FALSE  FALSE  FALSE  FALSE  FALSE
#> 3:  FALSE  FALSE  FALSE   TRUE  FALSE     TRUE  FALSE   TRUE  FALSE  FALSE
#> 4:  FALSE   TRUE  FALSE   TRUE   TRUE     TRUE  FALSE  FALSE  FALSE   TRUE
#> 5:  FALSE  FALSE  FALSE   TRUE  FALSE    FALSE   TRUE   TRUE  FALSE  FALSE
#>    meeting  money num000 num1999  num3d num415 num650  num85 num857  order
#>     <lgcl> <lgcl> <lgcl>  <lgcl> <lgcl> <lgcl> <lgcl> <lgcl> <lgcl> <lgcl>
#> 1:    TRUE   TRUE   TRUE    TRUE   TRUE   TRUE   TRUE   TRUE   TRUE   TRUE
#> 2:   FALSE  FALSE  FALSE   FALSE  FALSE   TRUE  FALSE  FALSE  FALSE  FALSE
#> 3:    TRUE   TRUE  FALSE   FALSE  FALSE   TRUE   TRUE  FALSE   TRUE   TRUE
#> 4:   FALSE   TRUE  FALSE    TRUE   TRUE  FALSE   TRUE   TRUE  FALSE   TRUE
#> 5:   FALSE   TRUE  FALSE   FALSE   TRUE  FALSE  FALSE  FALSE  FALSE  FALSE
#>    original    our   over  parts people     pm project     re receive remove
#>      <lgcl> <lgcl> <lgcl> <lgcl> <lgcl> <lgcl>  <lgcl> <lgcl>  <lgcl> <lgcl>
#> 1:     TRUE   TRUE   TRUE   TRUE   TRUE  FALSE    TRUE   TRUE    TRUE   TRUE
#> 2:    FALSE  FALSE  FALSE  FALSE   TRUE  FALSE   FALSE   TRUE   FALSE  FALSE
#> 3:    FALSE   TRUE  FALSE  FALSE   TRUE  FALSE   FALSE   TRUE   FALSE  FALSE
#> 4:    FALSE   TRUE   TRUE  FALSE   TRUE  FALSE    TRUE   TRUE    TRUE  FALSE
#> 5:    FALSE   TRUE  FALSE  FALSE  FALSE  FALSE   FALSE  FALSE   FALSE  FALSE
#>    report  table technology telnet   will    you   your classif.auc
#>    <lgcl> <lgcl>     <lgcl> <lgcl> <lgcl> <lgcl> <lgcl>       <num>
#> 1:   TRUE   TRUE       TRUE   TRUE   TRUE  FALSE   TRUE   0.9054711
#> 2:  FALSE  FALSE      FALSE  FALSE   TRUE  FALSE  FALSE   0.9002972
#> 3:  FALSE  FALSE       TRUE   TRUE  FALSE  FALSE  FALSE   0.8629631
#> 4:   TRUE   TRUE       TRUE  FALSE   TRUE  FALSE   TRUE   0.8813560
#> 5:  FALSE  FALSE      FALSE   TRUE  FALSE   TRUE  FALSE   0.8194007
#>    runtime_learners           timestamp batch_nr warnings errors
#>               <num>              <POSc>    <int>    <int>  <int>
#> 1:            0.032 2025-07-15 19:21:04        1        0      0
#> 2:            0.010 2025-07-15 19:21:04        1        0      0
#> 3:            0.021 2025-07-15 19:21:04        1        0      0
#> 4:            0.018 2025-07-15 19:21:04        1        0      0
#> 5:            0.009 2025-07-15 19:21:04        1        0      0
#>                                                              features
#>                                                                <list>
#> 1:          address,addresses,all,business,capitalAve,capitalLong,...
#> 2:         charDollar,charRoundbracket,conference,email,george,hp,...
#> 3:     address,all,capitalAve,capitalTotal,charHash,charSemicolon,...
#> 4: address,addresses,capitalAve,capitalLong,capitalTotal,charHash,...
#> 5:                                     cs,hp,lab,labs,money,num3d,...
#>    n_features  resample_result
#>        <list>           <list>
#> 1:         54 <ResampleResult>
#> 2:         10 <ResampleResult>
#> 3:         25 <ResampleResult>
#> 4:         32 <ResampleResult>
#> 5:          9 <ResampleResult>

Similar to the auto_tuner we used for parameter tuning, there’s also an auto_fselector which basically works the same way, giving us a “self-tuning” learner as a result

fselected_rpart <- auto_fselector(
  learner = lrn("classif.rpart", predict_type = "prob"),
  resampling = rsmp("holdout"),
  measure = msr("classif.ce"),
  terminator = trm("evals", n_evals = 20),
  fselector = fs("random_search")
)

fselected_rpart

#> <AutoFSelector:classif.rpart.fselector>
#> * Model: list
#> * Packages: mlr3, mlr3fselect, rpart
#> * Predict Type: prob
#> * Feature Types: logical, integer, numeric, factor, ordered
#> * Properties: importance, missings, multiclass, selected_features,
#>   twoclass, weights

And of course it should be worth it to compare our variable-selected learner with a learner that uses all variables, just to make sure we’re not wasting our time:

design <- benchmark_grid(
  tasks = spam_task,
  learners = list(
    fselected_rpart,
    lrn("classif.rpart", predict_type = "prob")
  ),
  resamplings = rsmp("cv", folds = 3)
)

bmr <- benchmark(design)
bmr$aggregate(msr("classif.auc"))

#>       nr task_id              learner_id resampling_id iters classif.auc
#>    <int>  <char>                  <char>        <char> <int>       <num>
#> 1:     1    spam classif.rpart.fselector            cv     3   0.8923973
#> 2:     2    spam           classif.rpart            cv     3   0.8981915
#> Hidden columns: resample_result

Of course this is essentially another form of tuning, and doings feature selection with untuned learners is not going to give you the best possible performance in each iteration, but it gives you a good set of features to start your actual hyperparameter tuning with.

1.1 Your turn! (For some other time)

Try out the bike sharing task (tsk("bike_sharing"))
Read the docs to see the meaning of each feature
Try out different feature selection approaches!

Note that this task has a few more observations, so it’s going to take a bit longer.
We don’t want to spend the in-person session staring overheating laptops, so you can try this out in your own time!