3.1 Tasks

Learning tasks encapsulate the data set and additional meta information about a machine learning problem, for example the name of the target variable for supervised problems.

3.1.1 Task Types

To manually create a task from a data.frame() or data.table(), you must first determine the task type to select the respective constructor:

  • Classification Task: Target column is labels (stored as character()/factor()) with only few distinct values.
    \(\Rightarrow\) TaskClassif
  • Regression Task: Target column is numeric (stored as integer()/double()).
    \(\Rightarrow\) TaskRegr
  • Survival Task: Target is the (right-censored) time to event.
    \(\Rightarrow\) TaskSurv in add-on package mlr3surival
  • Ordinal Regression Task: Target is ordinal.
    \(\Rightarrow\) TaskOrdinal in add-on package mlr3ordinal
  • Cluster Task: You don’t have a target but want to identify similarities in the feature space.
    \(\Rightarrow\) Not yet implemented

3.1.2 Task Creation

Let’s assume we want to create a simple regression task using the mtcars data set from the package datasets to predict the column "mpg" (miles per gallon). We only take the first two features here to keep the output in the following examples compact.

data("mtcars", package = "datasets")
data = mtcars[, 1:3]
str(data)
## 'data.frame':    32 obs. of  3 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...

Next, we create the task by providing the following information:

  1. id: identifier for the task, used in plots and summaries.
  2. backend: here, we simply provide the data.frame() which is internally converted to a DataBackendDataTable. For more fine-grain control over how the data is stored internally, we could also construct a DataBackend manually.
  3. target: Column name of the target column for the regression problem.
task_mtcars = TaskRegr$new(id = "cars", backend = data, target = "mpg")
print(task_mtcars)
## <TaskRegr:cars> (32 x 3)
## Target: mpg
## Features (2):
## * dbl (2): cyl, disp

The print() method gives a short summary of the task: It has 32 observations, 3 columns of which 2 columns are features.

We can also print the task using the mlr3viz package:

library(mlr3viz)
## Registered S3 methods overwritten by 'ggplot2':
##   method         from 
##   [.quosures     rlang
##   c.quosures     rlang
##   print.quosures rlang
autoplot(task_mtcars, type = "pairs")
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

3.1.3 Predefined tasks

mlr3 ships with some predefined machine learning tasks. These are stored in a Dictionary, which is a simple key-value store named mlr3::mlr_tasks. We can obtain a summarizing overview of all stored tasks by converting the dictionary to a data.table()

as.data.table(mlr_tasks)
##               key task_type            measures nrow
## 1: boston_housing      regr            regr.mse  506
## 2:  german_credit   classif german_credit_costs 1000
## 3:           iris   classif          classif.ce  150
## 4:         mtcars      regr            regr.mse   32
## 5:           pima   classif          classif.ce  768
## 6:          sonar   classif          classif.ce  208
## 7:           spam   classif          classif.ce 4601
## 8:           wine   classif          classif.ce  178
## 9:            zoo   classif          classif.ce  101
##    ncol lgl int dbl chr fct ord
## 1:   19   0   3  13   0   2   0
## 2:   21   0   0   7   0  12   1
## 3:    5   0   0   4   0   0   0
## 4:   11   0   0  10   0   0   0
## 5:    9   0   0   8   0   0   0
## 6:   61   0   0  60   0   0   0
## 7:   58   0   0  57   0   0   0
## 8:   14   0   2  11   0   0   0
## 9:   17  15   1   0   0   0   0

For illustration purposes, we now retrieve the popular iris data set from mlr_tasks as a classification task:

task_iris = mlr_tasks$get("iris")
print(task_iris)
## <TaskClassif:iris> (150 x 5)
## Target: Species
## Features (4):
## * dbl (4): Petal.Length, Petal.Width,
##   Sepal.Length, Sepal.Width

3.1.4 Task API

The task properties and characteristics can be queried using the task’s public member values and methods (see Task). Most of them should be self-explanatory, e.g.,

task_iris = mlr_tasks$get("iris")

# public member values
task_iris$nrow
## [1] 150
task_iris$ncol
## [1] 5
# public member methods
task_iris$head(n = 3)
##    Species Petal.Length Petal.Width Sepal.Length
## 1:  setosa          1.4         0.2          5.1
## 2:  setosa          1.4         0.2          4.9
## 3:  setosa          1.3         0.2          4.7
##    Sepal.Width
## 1:         3.5
## 2:         3.0
## 3:         3.2

3.1.4.1 Retrieve Data

In mlr3, each row (observation) has a unique identifier which can be either integer() or character(). These can be used to select specific rows.

# iris uses integer row_ids
head(task_iris$row_ids)
## [1] 1 2 3 4 5 6
# retrieve data for rows with ids 1, 51, and 101
task_iris$data(rows = c(1, 51, 101))
##       Species Petal.Length Petal.Width Sepal.Length
## 1:     setosa          1.4         0.2          5.1
## 2: versicolor          4.7         1.4          7.0
## 3:  virginica          6.0         2.5          6.3
##    Sepal.Width
## 1:         3.5
## 2:         3.2
## 3:         3.3
# mtcars uses the rownames of the original data set
head(task_mtcars$row_ids)
## [1] "AMC Javelin"        "Cadillac Fleetwood"
## [3] "Camaro Z28"         "Chrysler Imperial" 
## [5] "Datsun 710"         "Dodge Challenger"
# retrieve data for rows with id "Datsun 710"
task_mtcars$data(rows = "Datsun 710")
##     mpg cyl disp
## 1: 22.8   4  108

Note that the method $data() is only an accessor and does not modify the underlying data/task.

Analogously, each column has an identifier, which is often just called column name. These are stored in the public fields feature_names and target_names:

task_iris$feature_names
## [1] "Petal.Length" "Petal.Width"  "Sepal.Length"
## [4] "Sepal.Width"
task_iris$target_names
## [1] "Species"
# retrieve data for rows 1, 51, and 101 and only select column "Species"
task_iris$data(rows = c(1, 51, 101), cols = "Species")
##       Species
## 1:     setosa
## 2: versicolor
## 3:  virginica

To retrieve the complete data set, e.g. for a closer inspection, convert to a data.table():

summary(as.data.table(task_iris))
##        Species    Petal.Length   Petal.Width 
##  setosa    :50   Min.   :1.00   Min.   :0.1  
##  versicolor:50   1st Qu.:1.60   1st Qu.:0.3  
##  virginica :50   Median :4.35   Median :1.3  
##                  Mean   :3.76   Mean   :1.2  
##                  3rd Qu.:5.10   3rd Qu.:1.8  
##                  Max.   :6.90   Max.   :2.5  
##   Sepal.Length   Sepal.Width  
##  Min.   :4.30   Min.   :2.00  
##  1st Qu.:5.10   1st Qu.:2.80  
##  Median :5.80   Median :3.00  
##  Mean   :5.84   Mean   :3.06  
##  3rd Qu.:6.40   3rd Qu.:3.30  
##  Max.   :7.90   Max.   :4.40

3.1.4.2 Roles

It is possible to assign special roles to (subsets of) rows and columns.

For example, the previously constructed mtcars task has the following column roles:

task_mtcars$col_roles
## $feature
## [1] "cyl"  "disp"
## 
## $target
## [1] "mpg"
## 
## $label
## character(0)
## 
## $order
## character(0)
## 
## $groups
## character(0)
## 
## $weights
## character(0)

Now, we want the original rownames() of mtcars to be a regular feature column. Thus, we first pre-process the data.frame and then re-create the task.

library("data.table")
# with `keep.rownames`, data.table stores the row names in an extra column "rn"
data = as.data.table(mtcars[, 1:3], keep.rownames = TRUE)
task = TaskRegr$new(id = "cars", backend = data, target = "mpg")

# we now have integer row_ids
task$row_ids
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17
## [18] 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
# there is a new "feature" called "rn"
task$feature_names
## [1] "cyl"  "disp" "rn"

The column “rn” is now a regular feature. As this is a unique string column, most machine learning algorithms will have problems to process this feature without some kind of preprocessing. However, we still might want to carry rn around for different reasons. E.g., we can use the row names in plots or to associate outliers with the row names. This being said, we need to change the role of the row names column rn and remove it from the set of active features.

task$feature_names
## [1] "cyl"  "disp" "rn"
task$set_col_role("rn", new_roles = "label")

# "rn" not listed as feature any more
task$feature_names
## [1] "cyl"  "disp"
# also vanished from "data" and "head"
task$data(rows = 1:2)
##    mpg cyl disp
## 1:  21   6  160
## 2:  21   6  160
task$head(2)
##    mpg cyl disp
## 1:  21   6  160
## 2:  21   6  160

Note that no copies of the underlying data is inflicted by this operation. By changing roles, only the view on the data is changed, not the data itself.

Just like columns, it is also possible to assign different roles to rows. Rows can have two different roles:

  1. Role "use": Rows that are generally available for model fitting (although they may also be used as test set in resampling). This is the default role.
  2. Role "validation": Rows that are held back (see below). Rows which have missing values in the target column upon task creation are automatically moved to the validation set.

There are several reasons to hold some observations back or treat them differently:

  1. It is often good practice to validate the final model on an external validation set to uncover possible overfitting
  2. Some observations may be unlabeled, e.g. in data mining cups or Kaggle competitions. These observations cannot be used for training a model, but you can still predict labels.

3.1.4.3 Task Mutators

The methods set_col_role() and set_row_role() change the view on the data and can be used to subset the task. For convenience, the method filter() subsets the task based on row ids, and select() subsets the task based on feature names. All these operations only change the view on the data, without creating a copy of it, but modify the task in-place.

task = mlr_tasks$get("iris")
task$select(c("Sepal.Width", "Sepal.Length")) # keep only these features
task$filter(1:3) # keep only these rows
task$head()
##    Species Sepal.Length Sepal.Width
## 1:  setosa          5.1         3.5
## 2:  setosa          4.9         3.0
## 3:  setosa          4.7         3.2

Additionally, the methods rbind() and cbind() allow to add extra rows and columns to a task, respectively. The method replace_features() is a convenience wrapper around select() and cbind(). Again, the original data set stored in the original mlr3::DataBackend is not altered in any way.

task$cbind(data.table(foo = letters[1:3])) # add column foo
task$head()
##    Species Sepal.Length Sepal.Width foo
## 1:  setosa          5.1         3.5   a
## 2:  setosa          4.9         3.0   b
## 3:  setosa          4.7         3.2   c