3  Tasks

Tasks are objects that contain the (usually tabular) data and additional meta-data to define a machine learning problem. The meta-data is, for example, the name of the target variable for supervised machine learning problems, or the type of the dataset (e.g. a spatial or survival task). This information is used by specific operations that can be performed on a task.

3.1 Task Types

To create a task from a data.frame(), data.table() or Matrix(), you first need to select the right task type:

  • Classification Task: The target is a label (stored as character or factor) with only relatively few distinct values → TaskClassif.

  • Regression Task: The target is a numeric quantity (stored as integer or numeric) → TaskRegr.

  • Survival Task: The target is the (right-censored) time to an event. More censoring types are currently in development → mlr3proba::TaskSurv in add-on package mlr3proba.

  • Density Task: An unsupervised task to estimate the density → mlr3proba::TaskDens in add-on package mlr3proba.

  • Cluster Task: An unsupervised task type; there is no target and the aim is to identify similar groups within the feature space → mlr3cluster::TaskClust in add-on package mlr3cluster.

  • Spatial Task: Observations in the task have spatio-temporal information (e.g. coordinates) → mlr3spatiotempcv::TaskRegrST or mlr3spatiotempcv::TaskClassifST in add-on package mlr3spatiotempcv.

  • Ordinal Regression Task: The target is ordinal → TaskOrdinal in add-on package mlr3ordinal (still in development).

3.2 Task Creation

As an example, we will create a regression task using the mtcars data set from package datasets. It contains characteristics for different types of cars, along with their fuel consumption. We predict the numeric target variable "mpg" (miles per gallon). We only consider the first two features in the dataset for brevity.

First, we load and prepare the data and then print its structure to get a better idea of what it looks like.

data("mtcars", package = "datasets")
data = mtcars[, 1:3]
str(data)
'data.frame':   32 obs. of  3 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...

Next, we create a regression task, i.e. we construct a new instance of the R6 class TaskRegr. Formally, the intended way to initialize an R6 object is to call the constructor TaskRegr$new(). Here instead, we are calling the converter as_task_regr() to convert our data.frame() stored in the object data to a task and provide the following information:

  1. x: Object to convert. Works for rectangular data formats such as data.frame(), data.table(), or tibble(). Internally, the data is converted and stored in an abstract DataBackend. This allows connecting to out-of-memory storage systems like SQL servers via the extension package mlr3db.
  2. target: The name of the prediction target column for the regression problem, here miles per gallon ("mpg").
  3. id (optional): An arbitrary identifier for the task, used in plots and summaries. If not provided, the deparsed name of x will be used.
library("mlr3")

task_mtcars = as_task_regr(data, target = "mpg", id = "cars")
print(task_mtcars)
<TaskRegr:cars> (32 x 3)
* Target: mpg
* Properties: -
* Features (2):
  - dbl (2): cyl, disp

The print() method gives a short summary of the task: It has 32 observations and 3 columns, of which 2 are features stored in double-precision floating point format.

We can also plot the task using the mlr3viz package, which gives a graphical summary of its properties:

library("mlr3viz")
autoplot(task_mtcars, type = "pairs")

Tip

Instead of loading all the extension packages individually, it is often more convenient to load the mlr3verse package instead. mlr3verse imports the Namespace of most mlr3 packages and re-exports functions which are used for common machine learning and data science tasks.

3.3 Predefined tasks

mlr3 includes a few predefined machine learning tasks. All tasks are stored in an R6 Dictionary (a key-value store) named mlr_tasks. Printing it gives the keys (the names of the datasets):

mlr_tasks
<DictionaryTask> with 11 stored values
Keys: boston_housing, breast_cancer, german_credit, iris, mtcars,
  penguins, pima, sonar, spam, wine, zoo

We can get a more informative summary of the example tasks by converting the dictionary to a data.table() object:

as.data.table(mlr_tasks)
               key                   label task_type nrow ncol properties lgl
 1: boston_housing   Boston Housing Prices      regr  506   19              0
 2:  breast_cancer Wisconsin Breast Cancer   classif  683   10   twoclass   0
 3:  german_credit           German Credit   classif 1000   21   twoclass   0
 4:           iris            Iris Flowers   classif  150    5 multiclass   0
 5:         mtcars            Motor Trends      regr   32   11              0
 6:       penguins         Palmer Penguins   classif  344    8 multiclass   0
 7:           pima    Pima Indian Diabetes   classif  768    9   twoclass   0
 8:          sonar  Sonar: Mines vs. Rocks   classif  208   61   twoclass   0
 9:           spam       HP Spam Detection   classif 4601   58   twoclass   0
10:           wine            Wine Regions   classif  178   14 multiclass   0
11:            zoo             Zoo Animals   classif  101   17 multiclass  15
    int dbl chr fct ord pxc
 1:   3  13   0   2   0   0
 2:   0   0   0   0   9   0
 3:   3   0   0  14   3   0
 4:   0   4   0   0   0   0
 5:   0  10   0   0   0   0
 6:   3   2   0   2   0   0
 7:   0   8   0   0   0   0
 8:   0  60   0   0   0   0
 9:   0  57   0   0   0   0
10:   2  11   0   0   0   0
11:   1   0   0   0   0   0

Above, the columns "lgl" (logical), "int" (integer), "dbl" (double), "chr" (character), "fct" (factor), "ord" (ordered factor) and "pxc" (POSIXct time) show the number of features in the task of the respective type.

To get a task from the dictionary, use the $get() method from the mlr_tasks class and assign the return value to a new variable As getting a task from a dictionary is a very common problem, mlr3 provides the shortcut function tsk(). Here, we retrieve the palmer penguins classification task, which is provided by the package palmerpenguins:

task_penguins = tsk("penguins")
print(task_penguins)
<TaskClassif:penguins> (344 x 8): Palmer Penguins
* Target: species
* Properties: multiclass
* Features (7):
  - int (3): body_mass, flipper_length, year
  - dbl (2): bill_depth, bill_length
  - fct (2): island, sex
Note

The loading extension of packages can add to dictionaries such as mlr_tasks. For example, mlr3data adds some more example and toy tasks for regression and classification, and mlr3proba adds survival and density estimation tasks.

To get more information about a particular task, it is easiest to use the help() method that all mlr3-objects come with:

task_penguins$help()

Alternatively, the corresponding man page can be found under mlr_tasks_[id], e.g. mlr_tasks_penguins:

help("mlr_tasks_penguins")

3.4 Task API

All properties and characteristics of tasks can be queried using the task’s public fields and methods (see Task). Methods can also be used to change the stored data and the behavior of the task.

3.4.1 Retrieving Data

The Task object primarily represents a tabular dataset, combined with meta-data about which columns of that data should be used to predict which other columns in what way, as well as some more information about column data types.

Various fields can be used to retrieve meta-data about a task. The dimensions can, for example, be retrieved using $nrow and $ncol:

task_mtcars$nrow
[1] 32
task_mtcars$ncol
[1] 3

The names of the feature and target columns are stored in the $feature_names and $target_names slots, respectively. Here, “target” refers to the variable we want to predict and “feature” to the predictors for the task.

task_mtcars$feature_names
[1] "cyl"  "disp"
task_mtcars$target_names
[1] "mpg"

For the most common tasks, regression and classification, the target will only be a single name. Tasks with other task types, such as for survival estimation, may have more than one target column:

library("mlr3proba")
tsk("unemployment")$target_names
[1] "spell"   "censor1"

While the columns of a task have unique character-valued names, their rows are identified by unique natural numbers, their row-IDs. They can be accessed through the $row_ids slot:

head(task_mtcars$row_ids)
[1] 1 2 3 4 5 6
Warning

Although the row IDs are typically just the sequence from 1 to nrow(data), they are only guaranteed to be unique natural numbers. It is possible that they do not start at 1, that they are not increasing by 1 each, or that they are not even in increasing order. Keep that in mind, especially if you work with data stored in a real database management system (see backends).

The data contained in a task can be accessed through $data(), which returns a data.table object. It has optional rows and cols arguments to specify subsets of the data to retrieve. When a database backend is used, then this avoids loading unnecessary data into memory, making it more efficient than retrieving the entire data first and then subsetting it using [<rows>, <cols>].

task_mtcars$data()
     mpg cyl  disp
 1: 21.0   6 160.0
 2: 21.0   6 160.0
 3: 22.8   4 108.0
 4: 21.4   6 258.0
 5: 18.7   8 360.0
 6: 18.1   6 225.0
 7: 14.3   8 360.0
 8: 24.4   4 146.7
 9: 22.8   4 140.8
10: 19.2   6 167.6
11: 17.8   6 167.6
12: 16.4   8 275.8
13: 17.3   8 275.8
14: 15.2   8 275.8
15: 10.4   8 472.0
16: 10.4   8 460.0
17: 14.7   8 440.0
18: 32.4   4  78.7
19: 30.4   4  75.7
20: 33.9   4  71.1
21: 21.5   4 120.1
22: 15.5   8 318.0
23: 15.2   8 304.0
24: 13.3   8 350.0
25: 19.2   8 400.0
26: 27.3   4  79.0
27: 26.0   4 120.3
28: 30.4   4  95.1
29: 15.8   8 351.0
30: 19.7   6 145.0
31: 15.0   8 301.0
32: 21.4   4 121.0
     mpg cyl  disp
# retrieve data for rows 1, 5, and 10 and select column "mpg"
task_mtcars$data(rows = c(1, 5, 10), cols = "mpg")
    mpg
1: 21.0
2: 18.7
3: 19.2

To extract the complete data from the task, one can also convert it to a data.table:

# show summary of entire data
summary(as.data.table(task_mtcars))
      mpg             cyl             disp      
 Min.   :10.40   Min.   :4.000   Min.   : 71.1  
 1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8  
 Median :19.20   Median :6.000   Median :196.3  
 Mean   :20.09   Mean   :6.188   Mean   :230.7  
 3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0  
 Max.   :33.90   Max.   :8.000   Max.   :472.0  

3.4.2 Task Mutators

It is often necessary to create tasks that encompass subsets of other tasks’ data, for example to manually create train-test-splits, or to fit models on a subset of given features. Restricting tasks to a given set of features can be done by calling $select() with the desired feature names. Restriction to rows is done with $filter() with the row-IDs.

task_penguins_small = tsk("penguins")
task_penguins_small$select(c("body_mass", "flipper_length")) # keep only these features
task_penguins_small$filter(2:4) # keep only these rows
task_penguins_small$data()
   species body_mass flipper_length
1:  Adelie      3800            186
2:  Adelie      3250            195
3:  Adelie        NA             NA

These methods are so-called mutators, they modify the given Task in-place. If you want to have an unmodified version of the task, you need to use the $clone() method to create a copy first.

task_penguins_smaller = task_penguins_small$clone()
task_penguins_smaller$filter(2)
task_penguins_smaller$data()
   species body_mass flipper_length
1:  Adelie      3800            186
task_penguins_small$data()  # this task is unmodified
   species body_mass flipper_length
1:  Adelie      3800            186
2:  Adelie      3250            195
3:  Adelie        NA             NA

Note also how the last call to $filter(2) did not select the second row of the task_penguins_small, but selected the row with ID 2, which is the first row of task_penguins_small.

While the methods above allow us to subset the data, the methods $rbind() and $cbind() allow adding extra rows and columns to a task.

task_penguins_smaller$rbind(  # add another row
  data.frame(body_mass = 1e9, flipper_length = 1e9, species = "GigaPeng")
)
task_penguins_smaller$cbind(data.frame(letters = letters[2:3])) # add column letters
task_penguins_smaller$data()
    species  body_mass flipper_length letters
1:   Adelie       3800            186       b
2: GigaPeng 1000000000     1000000000       c

3.4.3 Roles (Rows and Columns)

We have seen that certain columns are designated as “targets” and “features” during task creation, their “roles”: Target refers to the variable(s) we want to predict and features are the predictors (also called co-variates) for the target. Besides these two, there are other possible roles for columns, see the documentation of Task. These roles affect the behavior of the task for different operations.

The previously-constructed task_penguins_small task, for example, has the following column roles:

task_penguins_small$col_roles
$feature
[1] "body_mass"      "flipper_length"

$target
[1] "species"

$name
character(0)

$order
character(0)

$stratum
character(0)

$group
character(0)

$weight
character(0)

Columns can have have multiple roles. It is also possible for a column to have no role at all, in which case they are ignored. This is, in fact, how $select() and $filter() operate: They unassign the "feature" (for columns) or "use" (for rows) role, the underlying data is still present in the data:

task_penguins_small$backend
<DataBackendDataTable> (344x9)
 species    island bill_length bill_depth flipper_length body_mass    sex year
  Adelie Torgersen        39.1       18.7            181      3750   male 2007
  Adelie Torgersen        39.5       17.4            186      3800 female 2007
  Adelie Torgersen        40.3       18.0            195      3250 female 2007
  Adelie Torgersen          NA         NA             NA        NA   <NA> 2007
  Adelie Torgersen        36.7       19.3            193      3450 female 2007
  Adelie Torgersen        39.3       20.6            190      3650   male 2007
 ..row_id
        1
        2
        3
        4
        5
        6
[...] (338 rows omitted)

There are two main ways to manipulate the col roles of a Task:

  1. Use the Task method $set_col_roles() (recommended).
  2. Simply modify the field $col_roles, which is a named list of vectors of column names. Each vector in this list corresponds to a column role, and the column names contained in that vector have that role.

Just as $select()/$filter(), these are in-place operations, so the task object itself is modified. To retain another unmodified version of a task, use $clone().

Changing the column or row roles, whether by $select()/$filter() or directly, does not change the underlying data, it just updates the view on it. Because the underlying data is still there (and accessible through $backend), we can add the "bill_length" column back into the task by setting its col role to "feature".

task_penguins_small$set_col_roles("bill_length", roles = "feature")
task_penguins_small$feature_names  # bill_length is now a feature again
[1] "body_mass"      "flipper_length" "bill_length"   
task_penguins_small$data()
   species body_mass flipper_length bill_length
1:  Adelie      3800            186        39.5
2:  Adelie      3250            195        40.3
3:  Adelie        NA             NA          NA

Supported column roles can be found in the manual of Task, or just by printing the names of the field $col_roles:

# supported column roles, see ?Task
names(task_penguins_small$col_roles)
[1] "feature" "target"  "name"    "order"   "stratum" "group"   "weight" 

Just like columns, it is also possible to assign different roles to rows. Rows can have two different roles:

  1. Role use: Rows that are generally available for model fitting (although they may also be used as test set in resampling). This role is the default role. The $filter() call changes this role, in the same way that $select() changes the "feature" role.
  2. Role validation: Rows that are not used for training. Rows that have missing values in the target column during task creation are automatically set to the validation role.

There are several reasons to hold some observations back or treat them differently:

  1. It is often good practice to validate the final model on an external validation set to identify possible overfitting.
  2. Some observations may be unlabeled, e.g. in competitions like Kaggle.

These observations cannot be used for training a model, but can be used to get predictions.

3.4.4 Binary classification

Classification problems with a target variable with only two classes are called binary classification tasks. They are special in the sense that one of these classes is denoted positive and the other one negative. You can specify the positive class within the classification task object during task creation. If not explicitly set during construction, the positive class defaults to the first level of the target variable.

# during construction
data("Sonar", package = "mlbench")
task = as_task_classif(Sonar, target = "Class", positive = "R")

# switch positive class to level 'M'
task$positive = "M"

3.5 Plotting Tasks

The mlr3viz package provides plotting facilities for many classes implemented in mlr3. The available plot types depend on the class, but all plots are returned as ggplot2 objects which can be easily customized.

For classification tasks (inheriting from TaskClassif), see the documentation of mlr3viz::autoplot.TaskClassif for the implemented plot types. Here are some examples to get an impression:

library("mlr3viz")

# get the pima indians task
task = tsk("pima")

# subset task to only use the 3 first features
task$select(head(task$feature_names, 3))

# default plot: class frequencies
autoplot(task)

# pairs plot (requires package GGally)
autoplot(task, type = "pairs")

# duo plot (requires package GGally)
autoplot(task, type = "duo")

Of course, you can do the same for regression tasks (inheriting from TaskRegr) as documented in mlr3viz::autoplot.TaskRegr:

library("mlr3viz")

# get the complete mtcars task
task = tsk("mtcars")

# subset task to only use the 3 first features
task$select(head(task$feature_names, 3))

# default plot: boxplot of target variable
autoplot(task)

# pairs plot (requires package GGally)
autoplot(task, type = "pairs")