7 Preprocessing
TODO
TODO
The last chapter (Chapter 6) gave an technical introduction to mlr3pipelines
, this chapter will show important concepts and typical problems when preprocessing data for machine learning, namely factor encoding, imputation of missing values, feature and target transformations as well as feature extraction. Feature selection is an important important aspect of preprocessing as well, but was already introduced in Chapter 5. In this chapter we will use mlr3pipelines
on a data set introduced below that possess most of the mentioned difficulties. We will not be able to cover every concept and we refer the reader to relevant further resources.
But first we will briefly define data preprocessing in machine learning. There exist various definitions for preprocessing, data cleaning and feature engineering. To clarify, preprocessing refers here to everything that happens with the data before it is used to fit the model, while postprocessing encompasses everything that occurs with predictions after the model is fitted. Data cleaning is an important aspect of preprocessing as it involves the removal of errors, noise, and redundancy in the data. There is usually little ambiguity in what needs to be done here, and it is generally straightforward. Feature engineering, on the other hand, covers all other transformations of data before it is fed to the machine learning model. It involves the creation of useful features from possibly unstructured data, such as written text, sequences or images. We will focus less on data cleaning as this process does not necessitate mlr3pipelines
and is intertwined with exploratory data analysis (EDA).
Although deep learning has shown promising results in automating feature engineering, its effectiveness depends on the complexity and nature of the data being processed, as well as the specific problem being addressed. Typically it is applicable to natural language processing (NLP) and computer vision (CV) problems, while standard tabular data is lacking in structure for deep learning models to extract meaningful features automatically. Furthermore, different problems require different features to be extracted, and deep learning models may not always be able to identify the most relevant features for a given problem without human guidance. The goal of feature engineering is to prepare the data so that a model can be trained on it, and/or to further improve predictive performance. It is important to note that feature engineering helps mostly for simpler algorithms, while highly complex models usually gain little from it and require little data preparation to be trained. Common difficulties in data include features with (high) skew distributions, high cardinality categorical features, missing observations, high dimensional dimensionality and imbalanced classes in classification tasks.
7.1 Ames Housing Data
The data we will be using is an adapted version of the Ames housing data, initially collected by (de2011ames?). This data was collected as an alternative to the Boston Housing data as an end of semester regression project and is used in various other books on feature engineering1 and machine learning2 with R. Raw and processed versions of the data can be directly loaded from the AmesHousing
package.
The dataset encompasses data related to 2,930 residential properties situated in Ames, Iowa, sold between 2006 and 2010. It contains 81 features on various aspects of the house (basement, garage, size, fence, pool, porch, bedrooms, etc.), size and shape of the lot as well as information about condition and quality. The prediction target is the sale price in USD, hence it is a regression task (Section 2.1).
We read in the data from the web and quickly peek at the target distribution
ggplot(ames, aes(x = Sale_Price)) +
geom_histogram()
As expected we see a skewed distribution of prices with a few outliers of very expensive properties.
7.2 Data Cleaning and Exploratory Data Analysis
In a first step we explore the data and look for simple problems such as constant or duplicated features. This can be done quite efficiently with a package like DataExplorer
or skimr
which create a large number of plots.
Here we just summarize the important findings:
summary(ames$Misc_Feature_2)
Othr
2930
Misc_Feature_2
is a factor with only a single level othr
.
identical(ames$Condition_2, ames$Condition_3)
[1] TRUE
Condition_2
and Condition_3
are identical.
cor(ames$Lot_Area, ames$Lot_Area_m2)
[1] 1
Both features represent the lot area, just on different scales. For all three cases simple removal is sufficient.
to_remove = c("Lot_Area_m2", "Condition_3", "Misc_Feature_2")
Next to constant or identical features, further typical problems that should be checked for are
- ID columns, i.e., columns that are unique for every observations should be removed or tagged.
-
NA
s not correctly encoded, e.g. as"NA"
or""
- Semantic errors in the data, e.g., negative
Lot_Area
- Numeric features encoded as categorical for learners that can not handle such features.
- …
We will check most of these in the next sections.
Before we continue with feature engineering we create a task, choose a suitable performance measure and resampling strategy.
library(mlr3verse)
ames_task = TaskRegr$new(
backend = ames,
target = "Sale_Price",
id = "ames"
)
ames_task$select(setdiff(ames_task$feature_names, to_remove))
measure = msr("regr.mae")
cv10 = rsmp("cv")
cv10$instantiate(ames_task)
print(ames_task)
<TaskRegr:ames> (2930 x 79)
* Target: Sale_Price
* Properties: -
* Features (78):
- fct (45): Alley, Bldg_Type, BsmtFin_Type_1, BsmtFin_Type_2,
Bsmt_Cond, Bsmt_Exposure, Bsmt_Qual, Central_Air, Condition_1,
Condition_2, Electrical, Exter_Cond, Exter_Qual, Exterior_1st,
Exterior_2nd, Fence, Fireplace_Qu, Foundation, Functional,
Garage_Cond, Garage_Finish, Garage_Qual, Garage_Type, Heating,
Heating_QC, House_Style, Land_Contour, Land_Slope, Lot_Config,
Lot_Shape, MS_SubClass, MS_Zoning, Mas_Vnr_Type, Misc_Feature,
Neighborhood, Overall_Cond, Overall_Qual, Paved_Drive, Pool_QC,
Roof_Matl, Roof_Style, Sale_Condition, Sale_Type, Street, Utilities
- int (33): Bedroom_AbvGr, BsmtFin_SF_1, BsmtFin_SF_2,
Bsmt_Full_Bath, Bsmt_Half_Bath, Bsmt_Unf_SF, Enclosed_Porch,
Fireplaces, First_Flr_SF, Full_Bath, Garage_Area, Garage_Cars,
Garage_Yr_Blt, Gr_Liv_Area, Half_Bath, Kitchen_AbvGr, Lot_Area,
Lot_Frontage, Low_Qual_Fin_SF, Mas_Vnr_Area, Misc_Val, Mo_Sold,
Open_Porch_SF, Pool_Area, Screen_Porch, Second_Flr_SF,
Three_season_porch, TotRms_AbvGrd, Total_Bsmt_SF, Wood_Deck_SF,
Year_Built, Year_Remod_Add, Year_Sold
Lastly we want to compute a simple featureless baseline, i.e., always predicting the median sale price.
baseline_lrn = lrn("regr.featureless", robust = TRUE)
baseline_res = resample(ames_task, baseline_lrn, cv10)
baseline_res$aggregate(measure)
regr.mae
56081.57
We see that simply predicting the median houseprice results in a mean absolute error of a bit more than 56.000$.
7.3 Factor Encoding
In machine learning, categorical features are variables that can take on a limited set of values, such as Paved_Drive
with possible values of Dirt_Gravel
, Partial_Pavement
, Paved
in the Ames housing data.
However, many machine learning algorithms such as support vector machines (SVM) require numerical inputs to function properly.
As shown in the Introduction (?sec-lrns-add-list), we can easily list learners that directly support categorical features.
as.data.table(mlr_learners)[task_type == "regr" &
sapply(feature_types, function(x) "factor" %in% x)]
key label task_type
1: regr.IBk K-nearest neighbour regr
2: regr.M5Rules Rule-based Algorithm regr
3: regr.bart Bayesian Additive Regression Trees regr
4: regr.catboost Gradient Boosting regr
5: regr.cforest Conditional Random Forest regr
---
24: regr.ranger <NA> regr
25: regr.rfsrc Random Forest regr
26: regr.rpart Regression Tree regr
27: regr.rsm Response Surface Model regr
28: regr.rvm Relevance Vector Machine regr
4 variables not shown: [feature_types, packages, properties, predict_types]
For other learners we need to convert categorical features into numerical ones through encoding, otherwise the training process will fail. The popular eXtreme Gradient Boosting learner (Chen and Guestrin 2016) for example does not handle categoricals:
xgboost = lrn("regr.xgboost", nrounds = 100)
xgboost$train(ames_task)
Error: <TaskRegr:ames> has the following unsupported feature types: factor
Categorical features can be distinguished by their cardinality, which refers to the number of levels they contain. There are three types of categorical features: binary, low-cardinality, and high-cardinality.
Generally, no universal threshold exist when a feature should be considered high-cardinality.
In the EDA section we see the number of levels for each categorical feature. For the Ames housing data, we assume Exterior_1st
, Exterior_2nd
, MS_SubClass
and Neighborhood
to be high-cardinality by assuming a threshold of 10. Generally, this threshold can be considered a hyperparameter and can be tuned jointly with all other hyperparameters.
Some learners which support handling categorical features out of the box, can still crash for high-cardinality features, since they internally apply encodings that are only suitable for low-cardinality features, such as one-hot encoding.
Low-cardinality features can be handled by one-hot encoding. One-hot encoding is a process of converting categorical features into a binary representation, where each possible category is represented as a separate binary feature. Theoretically it is sufficient to create one less binary feature than levels, as setting all binary features to zero is also a valid representation. This is typically called dummy or treatment encoding and required if the learner is a generalized linear (GLM) or additive model (GAM) model.
For high-cardinality features, impact encoding is a reasonable approach.
Impact encoding is a type of encoding that converts categorical features into numeric values based on the impact of the feature on the target. The idea behind impact encoding is to use the target feature to create a mapping between the categorical feature and a numerical value that reflects its importance in predicting the target feature. Impact encoding involves the following steps:
- Group the target variable by the categorical feature.
- Compute the mean of the target variable for each group.
- Compute the global mean of the target variable.
- Compute the impact score for each group as the difference between the mean of the target variable for the group and the global mean of the target variable.
- Replace the categorical feature with the impact scores.
By using impact encoding, we can preserve the information of the categorical feature while also creating a numerical representation that reflects its importance in predicting the target. The main advantage, compared to one-hot encoding is that only a single numeric feature is created regardless of the number of levels of the categorical features. Hence, it is especially useful for high-cardinality features. Since information from the target is now used to compute the impact scores, it is crucial that the encoding process is embedded in the cross-validation process to avoid label leakage.
The binary features Alley
, Central_Air
, Street
can be encoded simply as 1 or 0, which does not change anything.
First we collapse all levels that occur less than 1% of the time and build a simple pipelines that encode categorical features based on their cardinality using selector_cardinality_greater_than
.
factor_pipeline =
po("removeconstants") %>>%
po("collapsefactors", no_collapse_above_prevalence = 0.01) %>>%
po("encodeimpact", affect_columns = selector_cardinality_greater_than(10),
id = "high_card_enc") %>>%
po("encode", method = "one-hot", affect_columns = selector_cardinality_greater_than(2),
id = "low_card_enc") %>>%
po("encode", method = "treatment", affect_columns = selector_type("factor"),
id = "binary_enc")
factor_pipeline$plot()
While many implementations of tree based algorithms, such as rpart
or ranger
are able to handle categorical features, xgboost
can not and requires a preprocessing step.
xgboost_impact = GraphLearner$new(
factor_pipeline %>>% xgboost,
id = "regr.xgboost_impact"
)
xgboost_one_hot = GraphLearner$new(
po("encode") %>>% xgboost,
id = "regr.xgboost_one_hot"
)
learners = list(
baseline = baseline_lrn,
tree = lrn("regr.rpart"),
xgboost_impact = xgboost_impact,
xgboost_one_hot = xgboost_one_hot
)
design = benchmark_grid(ames_task, learners = learners, cv10)
bmr = benchmark(design)
bmr$aggregate(measure = measure)[, .(learner_id, regr.mae)]
learner_id regr.mae
1: regr.featureless 56081.57
2: regr.rpart 27393.71
3: regr.xgboost_impact 15494.41
4: regr.xgboost_one_hot 15638.63
We see that gradient boosted tree with impact encoding results in the best model, even though the improvement over one-hot encoding is quite small.
For further readings and a benchmark study on different encoding strategies we refer to (Pargent et al. 2022).
7.4 Missing Values
In the EDA section we can see that most features contain no missing values, while 7 features contain a substantial amount of missing values.
While both rpart
and xgboost
from the previous section could handle missing values automatically, both ranger
and lm
that we would like to consider do not.
For simple imputation techniques, missing values can be replaced with the mean, median, mode, or a sample from the empirical distribution of the feature. The po("imputhist")
operator computes the histogram of a feature and samples a from it to impute a missing value. Constant imputation with the mean of a feature can be done by po("imputemean")
, with the median by po("imputemedian")
and mode by po("imputemode")
. For categorical features, missing values can easily be replaced by a new separate level, e.g. called .MISSING
. The original information if an observation contained a missing value for each feature might be meaningful for the model. So we need keep track of the imputation by adding binary indicator features that are 1
if the feature was missing for an observation and 0
if it was present.
(Ding and Simonoff 2010) show that for binary classification and tree-based models a “separate class is clearly the best method to use when the testing set has missing values and the missingness is related to the response variable”. For numeric features this means that encoding missing values out-of-range, e.g. as two times the largest observed value is a reasonable approach.
impute_histogram = list(
po("missind",
type = "integer",
affect_columns = selector_type("integer")
),
po("imputehist",
affect_columns = selector_type("integer")
)
) %>>%
po("featureunion") %>>%
po("imputeoor",
affect_columns = selector_type("factor")
)
impute_histogram$plot()
ranger_impute_histogram = GraphLearner$new(
impute_histogram %>>%
lrn("regr.ranger"),
id = "regr.ranger_imp_histogram"
)
ranger_impute_oor = GraphLearner$new(
po("imputeoor") %>>%
lrn("regr.ranger"),
id = "regr.ranger_imp_oor"
)
design = benchmark_grid(ames_task,
learners = list(
ranger_impute_histogram,
ranger_impute_oor
),
cv10
)
bmr_new = benchmark(design)
bmr$combine(bmr_new)
bmr$aggregate(measure = measure)[, .(learner_id, regr.mae)]
learner_id regr.mae
1: regr.featureless 56081.57
2: regr.rpart 27393.71
3: regr.xgboost_impact 15494.41
4: regr.xgboost_one_hot 15638.63
5: regr.ranger_imp_histogram 15845.71
6: regr.ranger_imp_oor 15799.53
We see that the out-of-range imputation worked slighly better for the random forest, but the difference is again very small. Overall our gradient tree boosting with impact encoding is still the best model, so far.
Many more advanced imputation strategies exist. For example, in model based imputation, additional machine learning models are trained to predict missing values from other features. Multiple imputation resamples the data and imputes each value multiple time to attain more robust estimates. These more advanced techniques very rarely improve the model substantially and the simple imputation techniques introduced before are usually sufficient.
7.5 Pipeline Robustify
mlr3pipelines
offers a simple and reusable pipeline for (among other things) imputation and factor encoding called pipeline_robustify
.
robustify = mlr3pipelines::pipeline_robustify()
Generally, this is a very sensible default that should be used most of the time instead of doing it manually like in the previous two sections.
robustify$plot()
Robustify does the following steps:
- Constant features are removed as they do not contain any information
- Character features are cast to categorical features as many learners require factors
- Date/time features are encoded as numeric features as most learners can not handle these types of features
- Ordinal features are encoded as categorical features as most learners do not support ordinal features beyond assuming they are categoricals
- Numeric features are imputed by sampling from the empirical distribution and missingness indicators are added as a reasonable imputatation strategy discussed above
- Missing values of categorical features are encoded with a new level
- Factor levels of categorical features are fixed such that during prediction they are the same as during training; possibly dropping empty training factor levels before
- Since the previous step can introduce new missing values at prediction time due to non existing factor levels they are imputed with random factor levels
- Categorical features levels are collapsed (starting from the rarest factors in the training data) until there are no more than 1000 levels as to handle high cardinality categorical features without impact encoding
- Categorical features are encoded by one-hot encoding to retain all information
- Constant features that might have been created in the previous steps are again removed
lm_preproc = GraphLearner$new(
robustify %>>%
lrn("regr.lm"),
id = "regr.lm_preproc"
)
design = benchmark_grid(ames_task,
learners = lm_preproc,
cv10
)
bmr_new = benchmark(design)
bmr$combine(bmr_new)
bmr$aggregate(measure = measure)[, .(learner_id, regr.mae)]
learner_id regr.mae
1: regr.featureless 56081.57
2: regr.rpart 27393.71
3: regr.xgboost_impact 15494.41
4: regr.xgboost_one_hot 15638.63
5: regr.ranger_imp_histogram 15845.71
6: regr.ranger_imp_oor 15799.53
7: regr.lm_preproc 15684.59
We see that while gradient boosting is still the best model, a simple linear model with the robustify preprocessor is not much worse.
7.6 Scaling Features and Targets
Simple transformations of features and the target can be beneficial for certain learners.
Log transformation can help in making the distribution of the target more symmetrical. This is typically useful when the data is skewed, as some machine learning algorithms assume a normal distribution of the target. Thus, log transformation can help reduce the impact of outliers and improve the accuracy of the model. Similarly, log transformation of skew features can help to reduce the influence of outliers and very large models.
For distance based methods such as K-nearest neighbor models or regularized parametric models such as Lasso or Elastic net, normalization is required as otherwise features with larger scale would have an higher impact. Luckily, most based models internally scale the data if required by the algorithm so most of the time we do not need to manually do this in preprocessing.
log_lm_preproc$plot()
log_lm_preproc = GraphLearner$new(
log_lm_preproc,
id = "regr.log_lm_preproc"
)
design = benchmark_grid(ames_task,
learners = log_lm_preproc,
cv10
)
bmr_new = benchmark(design)
bmr$combine(bmr_new)
bmr$aggregate(measure = measure)[, .(learner_id, regr.mae)]
learner_id regr.mae
1: regr.featureless 56081.57
2: regr.rpart 27393.71
3: regr.xgboost_impact 15494.41
4: regr.xgboost_one_hot 15638.63
5: regr.ranger_imp_histogram 15845.71
6: regr.ranger_imp_oor 15799.53
7: regr.lm_preproc 15684.59
8: regr.log_lm_preproc 15198.30
We see that with the target transformation a simple linear regression is the best model we have seen so far.
7.7 Feature Extraction
The quality of the kitchen appliances might give us additional information about the sale price. Unfortunately there is no feature in the data that represents this information. Luckily we have second dataset of the power consumption in the kitchen. Each row of the dataset represents one house and each feature is the power consumption at a given time. The consumption is measured in 2-minute intervals which results in 720 features. The curves are the sum of power consumption of multiple small and large kitchen appliances taken from (Bagnall et al. 2017).
plot(as.numeric(energy_data[1, ]), type = "l",
ylab = "Power Consumption",
xlab="2-Minute Interval")
Just adding these 720 features to our data is probably not a good idea, as each feature by itself does not provide any meaningful information. Instead we can can extract information about the curves to gain insights into the kitchen’s overall energy usage. For example, we can look at the maximum used wattage, overall used wattage, number of peaks, and other similar features. This is what is typically called feature extraction.
To extract some features we write our own simple PipeOp
that inherits from PipeOpTaskPreprocSimple
as shown in Chapter (Chapter 6). The operator is quite simple, as we hardcode all operations. First we extract the functional features save them in a new data.table
called ffeats
, apply our extractors mean
, min
, max
and variance
which are appended to the data. Finally the original functional features are removed as we do not need them anymore.
library(R6)
PipeOpFuncExtract = R6Class("PipeOpFuncExtract",
inherit = mlr3pipelines::PipeOpTaskPreprocSimple,
private = list(
.transform_dt = function(dt, levels) {
ffeat_names = paste0("att", 1:720)
ffeats = dt[, ..ffeat_names]
dt[, energy_means := apply(ffeats, 1, mean)]
dt[, energy_mins := apply(ffeats, 1, min)]
dt[, energy_maxs := apply(ffeats, 1, max)]
dt[, energy_vars := apply(ffeats, 1, var)]
dt[, (ffeat_names) := NULL]
return(dt)
}
)
)
Finally we rerun our benchmark with all candidate models and see how much each learner improves with the feature extraction.
ames_task_ext = cbind(ames, energy_data)
ames_task_ext = TaskRegr$new(
backend = ames_task_ext,
target = "Sale_Price",
id = "ames_extended"
)
ames_task$select(setdiff(ames_task$feature_names, to_remove))
func_extractor = PipeOpFuncExtract$new("energy_extract")
ames_task_ext = func_extractor$train(list(ames_task_ext))[[1]]
ames_task_ext$head()[, .(energy_means,
energy_mins,
energy_maxs,
energy_vars)]
energy_means energy_mins energy_maxs energy_vars
1: 1.0615583 0.0142683350 21.97755 3.708473
2: 0.5021107 0.0213241350 26.30673 2.704132
3: 0.7915833 0.0089580140 19.72245 3.204720
4: 1.0222786 0.0126302520 21.93370 4.071867
5: 0.6051893 0.0588933810 21.55263 3.550878
6: 0.7254133 0.0009512968 23.22744 3.496427
Since this is a deterministic transformation of each observation sperately we do not need to add the pipeop to each learner. Instead we can simply train it on the whole dataset to create the new features.
learners = list(
baseline = baseline_lrn,
tree = lrn("regr.rpart"),
xgboost_impact = xgboost_impact,
ranger_impute_oor = ranger_impute_oor,
lm_preproc = lm_preproc,
log_lm_preproc = log_lm_preproc
)
design = benchmark_grid(list(ames_task_ext, ames_task),
learners = learners,
cv10
)
bmr_final = benchmark(design)
bmr_final$aggregate(measure = measure)
nr task_id learner_id resampling_id iters regr.mae
1: 1 ames_extended regr.featureless cv 10 56081.57
2: 2 ames_extended regr.rpart cv 10 26634.54
3: 3 ames_extended regr.xgboost_impact cv 10 14042.64
4: 4 ames_extended regr.ranger_imp_oor cv 10 13784.84
5: 5 ames_extended regr.lm_preproc cv 10 14492.94
6: 6 ames_extended regr.log_lm_preproc cv 10 13570.04
7: 7 ames regr.featureless cv 10 56081.57
8: 8 ames regr.rpart cv 10 27393.71
9: 9 ames regr.xgboost_impact cv 10 15494.41
10: 10 ames regr.ranger_imp_oor cv 10 15810.88
11: 11 ames regr.lm_preproc cv 10 15719.98
12: 12 ames regr.log_lm_preproc cv 10 15210.84
Hidden columns: resample_result
We observe that feature extraction helped all models, except the featureless baseline for obvious reasons.