3.2 Tasks

Tasks are objects that include the data set and additional meta information about a machine-learning problem. This could be the name of the target variable for supervised problems or whether the data set belongs to a specific community of datasets (e.g. a spatial or survival dataset). This information can be used at different points of the workflow to account for specific characteristics of these dataset types.

3.2.1 Task Types

To create a task from a data.frame() or data.table() object, the task type needs to be selected:

  • Classification Task: Target column is labels (stored as character()/factor()) with only few distinct values.
    \(\Rightarrow\) mlr3::TaskClassif
  • Regression Task: Target column is numeric (stored as integer()/double()).
    \(\Rightarrow\) mlr3::TaskRegr
  • Survival Task: Target is the (right-censored) time to event.
    \(\Rightarrow\) mlr3survival::TaskSurv in add-on package mlr3surival
  • Ordinal Regression Task: Target is ordinal.
    \(\Rightarrow\) mlr3ordinal::TaskOrdinal in add-on package mlr3ordinal
  • Cluster Task: You don’t have a target but want to identify similarities in the feature space.
    \(\Rightarrow\) Not yet implemented
  • Spatial Task: The observations come with spatio-temporal information (e.g. coordinates).
    \(\Rightarrow\) Not yet implemented in add-on package mlr3spatiotemporal

3.2.2 Task Creation

Let’s assume we want to create a simple regression task using the mtcars data set from the package datasets to predict the column "mpg" (miles per gallon). For this showcase we only take the first two features to keep the output in the following examples compact.

Next, we create the task by providing the following information:

  1. id: Identifier for the task, used in plots and summaries.
  2. backend: Here, we simply provide the dataset. It is internally converted to a DataBackendDataTable. For more fine-grain control over how the data is stored internally, we could also construct a DataBackend manually.
  3. target: Column name of the target column for the regression problem.

The print() method gives a short summary of the task: It has 32 observations, 3 columns of which 2 columns are features.

We can also print the task using the mlr3viz package:

3.2.3 Predefined tasks

mlr3 ships with some predefined machine-learning tasks. These are stored in a R6 Dictionary, which is a simple key-value storage named mlr3::mlr_tasks. If we simply print it out, we see that is has 9 entries:

We can obtain a summarizing overview of all stored tasks by converting the dictionary to a data.table() object

To create a task from the dictionary (think of it as a book shelf), we use the $get() method from the mlr_tasks class and assign it to a new object.

For example, if we would like to use the iris data set for classification:

3.2.4 Task API

All task properties and characteristics can be queried using the task’s public member values and methods (see Task).

  • Member values: Values stored in the object that can be queried by the user
  • Member methods: Functions from the object that can accept arguments and return information stored in the object

3.2.4.1 Retrieve Data

In mlr3, each row (observation) has a unique identifier which can be either integer or character. These can be used to select specific rows.

The iris dataset uses integer row_ids:

While the mtcars dataset uses names for its row_ids, encoded as character:

Note that the method $data() is only an accessor and does not modify the underlying data/task.

Analogously, each column has an identifier, which is often just called “column name”. These are stored in the public slots feature_names and target_names. Here “target” refers to the response variable and “feature” to the predictor variables of the task.

The row_id names and the “column names” can be combined for subsetting:

To extract the complete dataset from the task, we can simply convert the task to a data.table:

3.2.4.2 Roles (Rows and Columns)

It is possible to assign a special meanings (aka “roles”) to (subsets of) rows and columns.

For example, the previously constructed mtcars task has the following column roles:

Now, we want the original rownames() of mtcars to be a regular feature column. Thus, we first preprocess the data.frame and then re-create the task.

The column “rn” is now a regular feature. As this is a unique string column, most machine-learning algorithms will have problems to process this feature without some kind of preprocessing. However, we still might want to carry rn around for different reasons. For example, we can use the row names in plots or to associate outliers with the row names. This being said, we need to change the role of the row names column rn and remove it from the set of active features.

Note that no copies of the underlying data is inflicted by this operation. By changing roles, only the view on the data is changed, not the data itself.

Just like columns, it is also possible to assign different roles to rows. Rows can have two different roles:

  1. Role "use": Rows that are generally available for model fitting (although they may also be used as test set in resampling). This is the default role.
  2. Role "validation": Rows that are held back (see below). Rows which have missing values in the target column upon task creation are automatically moved to the validation set.

There are several reasons to hold some observations back or treat them differently:

  1. It is often good practice to validate the final model on an external validation set to uncover possible overfitting.
  2. Some observations may be unlabeled, e.g. in data mining cups or Kaggle competitions. These observations cannot be used for training a model, but you can still predict labels.

3.2.4.3 Task Mutators

Task methods .$set_col_role() and .$set_row_role() change the view on the data and can be used to subset the task. For convenience, method .$filter() subsets the task based on row ids, and .$select() subsets the task based on feature names. All these operations only change the view on the data, without creating a copy of it, but modify the task in-place.

Additionally, methods .$rbind() and .$cbind() allow to add extra rows and columns to a task, respectively. Again, the original data set stored in the original DataBackend is not altered in any way.