31  Density Estimation


Survival analysis with {mlr3proba} is currently in a fragile state after its removal from CRAN. Hence most code examples listed in this page will not work for the time being. We are actively working on a solution!

Density estimation is the learning task to find the unknown distribution from which an i.i.d. data set is generated. We interpret this broadly, with this distribution not necessarily being continuous (so may possess a mass not density). The conditional case, where a distribution is predicted conditional on covariates, is known as ‘probabilistic supervised regression’, and will be implemented in mlr3proba in the near-future. Unconditional density estimation is viewed as an unsupervised task. For a good overview to density estimation see Density estimation for statistics and data analysis (Silverman 1986).

The package mlr3proba extends mlr3 with the following objects for density estimation:

In this example we demonstrate the basic functionality of the package on the faithful data from the datasets package. This task ships as pre-defined TaskDens with mlr3proba.


task = tsk("precip")
<TaskDens:precip> (70 x 1): Annual Precipitation
* Target: -
* Properties: -
* Features (1):
  - dbl (1): precip
# histogram and density plot
# autoplot(task, type = "overlay")

Unconditional density estimation is an unsupervised method. Hence, TaskDens is an unsupervised task which inherits directly from Task unlike TaskClassif and TaskRegr. However, TaskDens still has a target argument and a $truth field defined by:

31.1 Train and Predict

Density learners have train and predict methods, though being unsupervised, ‘prediction’ is actually ‘estimation’. In training, a distr6 object is created, see here for full tutorials on how to access the probability density function, pdf, cumulative distribution function, cdf, and other important fields and methods. The predict method is simply a wrapper around self$model$pdf and if available self$model$cdf, i.e. evaluates the pdf/cdf at given points. Note that in prediction the points to evaluate the pdf and cdf are determined by the target column in the TaskDens object used for testing.

# create task and learner

task_faithful = TaskDens$new(id = "eruptions", backend = datasets::faithful$eruptions)
learner = lrn("dens.hist")

# train/test split
train_set = sample(task_faithful$nrow, 0.8 * task_faithful$nrow)
test_set = setdiff(seq_len(task_faithful$nrow), train_set)

# fitting KDE and model inspection
learner$train(task_faithful, row_ids = train_set)

[1] 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5

[1] 35 38  4  6 26 56 48  4

[1] 0.32258065 0.35023041 0.03686636 0.05529954 0.23963134 0.51612903 0.44239631
[8] 0.03686636

[1] 1.75 2.25 2.75 3.25 3.75 4.25 4.75 5.25

[1] "dat"

[1] TRUE

[1] "histogram"

[1] "dens.hist"
[1] "dens.hist"
# make predictions for new data
prediction = learner$predict(task_faithful, row_ids = test_set)

Every PredictionDens object can estimate:

  • pdf - probability density function

Some learners can estimate:

  • cdf - cumulative distribution function

31.2 Benchmark Experiment

Finally, we conduct a small benchmark study on the precip task using some of the integrated survival learners:

# some integrated learners
learners = lrns(c("dens.hist", "dens.kde"))
<LearnerDensHistogram:dens.hist>: Histogram Density Estimator
* Model: -
* Parameters: list()
* Packages: mlr3, mlr3proba, distr6
* Predict Type: pdf
* Feature types: integer, numeric
* Properties: -

<LearnerDensKDE:dens.kde>: Kernel Density Estimator
* Model: -
* Parameters: kernel=Epan, bandwidth=silver
* Packages: mlr3, mlr3proba, distr6
* Predict Type: pdf
* Feature types: integer, numeric
* Properties: missings
# Logloss for probabilistic predictions
measure = msr("dens.logloss")
<MeasureDensLogloss:dens.logloss>: Log Loss
* Packages: mlr3, mlr3proba
* Range: [0, Inf]
* Minimize: TRUE
* Average: macro
* Parameters: eps=1e-15
* Properties: -
* Predict type: pdf
bmr = benchmark(benchmark_grid(task, learners, rsmp("cv", folds = 3)))
INFO  [21:39:10.478] [mlr3] Running benchmark with 6 resampling iterations 
INFO  [21:39:10.587] [mlr3] Applying learner 'dens.kde' on task 'precip' (iter 3/3) 
INFO  [21:39:10.636] [mlr3] Applying learner 'dens.kde' on task 'precip' (iter 1/3) 
INFO  [21:39:10.663] [mlr3] Applying learner 'dens.hist' on task 'precip' (iter 3/3) 
INFO  [21:39:10.680] [mlr3] Applying learner 'dens.hist' on task 'precip' (iter 1/3) 
INFO  [21:39:10.697] [mlr3] Applying learner 'dens.kde' on task 'precip' (iter 2/3) 
INFO  [21:39:10.724] [mlr3] Applying learner 'dens.hist' on task 'precip' (iter 2/3) 
INFO  [21:39:10.745] [mlr3] Finished benchmark 
   nr      resample_result task_id learner_id resampling_id iters dens.logloss
1:  1 <ResampleResult[22]>  precip  dens.hist            cv     3     4.396138
2:  2 <ResampleResult[22]>  precip   dens.kde            cv     3     4.817715
autoplot(bmr, measure = measure)

The results of this experiment show that the sophisticated Penalized Density Estimator does not outperform the baseline histogram, but that the Kernel Density Estimator has at least consistently better (i.e. lower logloss) results.