7.2 Density Estimation
Density estimation is the learning task to find the unknown distribution from which an i.i.d. data set is generated. We interpret this broadly, with this distribution not necessarily being continuous (so may possess a mass not density). The conditional case, where a distribution is predicted conditional on covariates, is known as ‘probabilistic supervised regression,’ and will be implemented in mlr3proba in the near-future. Unconditional density estimation is viewed as an unsupervised task. For a good overview to density estimation see Density estimation for statistics and data analysis (Silverman 1986).
The package mlr3proba extends mlr3 with the following objects for density estimation:
TaskDens
to define density tasksLearnerDens
as base class for density estimatorsPredictionDens
as specialized class forPrediction
objectsMeasureDens
as specialized class for performance measures
In this example we demonstrate the basic functionality of the package on the faithful
data from the datasets package.
This task ships as pre-defined TaskDens
with mlr3proba.
library("mlr3")
library("mlr3proba")
= tsk("precip")
task print(task)
## <TaskDens:precip> (70 x 1)
## * Target: -
## * Properties: -
## * Features (1):
## - dbl (1): precip
# histogram and density plot
library("mlr3viz")
autoplot(task, type = "overlay")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Unconditional density estimation is an unsupervised method.
Hence, TaskDens
is an unsupervised task which inherits directly from Task
unlike TaskClassif
and TaskRegr
.
However, TaskDens
still has a target
argument and a $truth
field defined by:
target
- the name of the variable in the data for which to estimate density$truth
- the values of thetarget
column (which is not the true density, which is always unknown)
7.2.1 Train and Predict
Density learners have train
and predict
methods, though being unsupervised, ‘prediction’ is actually ‘estimation.’
In training, a distr6 object is created,
see here for full tutorials on how to access the probability density function, pdf
, cumulative distribution function, cdf
, and other important fields and methods.
The predict method is simply a wrapper around self$model$pdf
and if available self$model$cdf
, i.e. evaluates the pdf/cdf at given points.
Note that in prediction the points to evaluate the pdf and cdf are determined by the target
column in the TaskDens
object used for testing.
# create task and learner
= TaskDens$new(id = "eruptions", backend = datasets::faithful$eruptions)
task_faithful = lrn("dens.hist")
learner
# train/test split
= sample(task_faithful$nrow, 0.8 * task_faithful$nrow)
train_set = setdiff(seq_len(task_faithful$nrow), train_set)
test_set
# fitting KDE and model inspection
$train(task_faithful, row_ids = train_set)
learner$model learner
## $distr
## Histogram
##
## $hist
## $breaks
## [1] 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5
##
## $counts
## [1] 41 31 4 6 23 58 50 4
##
## $density
## [1] 0.37788 0.28571 0.03687 0.05530 0.21198 0.53456 0.46083 0.03687
##
## $mids
## [1] 1.75 2.25 2.75 3.25 3.75 4.25 4.75 5.25
##
## $xname
## [1] "dat"
##
## $equidist
## [1] TRUE
##
## attr(,"class")
## [1] "histogram"
##
## attr(,"class")
## [1] "dens.hist"
class(learner$model)
## [1] "dens.hist"
# make predictions for new data
= learner$predict(task_faithful, row_ids = test_set) prediction
Every PredictionDens
object can estimate:
pdf
- probability density function
Some learners can estimate:
cdf
- cumulative distribution function
7.2.2 Benchmark Experiment
Finally, we conduct a small benchmark study on the precip
task using some of the integrated survival learners:
# some integrated learners
= lrns(c("dens.hist", "dens.kde"))
learners print(learners)
## [[1]]
## <LearnerDensHistogram:dens.hist>
## * Model: -
## * Parameters: list()
## * Packages: distr6
## * Predict Type: pdf
## * Feature types: integer, numeric
## * Properties: -
##
## [[2]]
## <LearnerDensKDE:dens.kde>
## * Model: -
## * Parameters: kernel=Epan, bandwidth=silver
## * Packages: distr6
## * Predict Type: pdf
## * Feature types: integer, numeric
## * Properties: missings
# Logloss for probabilistic predictions
= msr("dens.logloss")
measure print(measure)
## <MeasureDensLogloss:dens.logloss>
## * Packages: -
## * Range: [0, Inf]
## * Minimize: TRUE
## * Properties: -
## * Predict type: pdf
set.seed(1)
= benchmark(benchmark_grid(task, learners, rsmp("cv", folds = 3)))
bmr $aggregate(measure) bmr
## nr resample_result task_id learner_id resampling_id iters dens.logloss
## 1: 1 <ResampleResult[21]> precip dens.hist cv 3 4.396
## 2: 2 <ResampleResult[21]> precip dens.kde cv 3 4.818
autoplot(bmr, measure = measure)
The results of this experiment show that the sophisticated Penalized Density Estimator does not outperform the baseline Histogram, but that the Kernel Density Estimator has at least consistently better
(i.e. lower logloss) results.