9.1 House Price Prediction in King County
We use the
house_sales_prediction dataset contained in this book in order to provide a use-case for the application of
mlr3 on real-world data.
9.1.1 Exploratory Data Analysis
In order to get a quick impression of our data, we perform some initial Exploratory Data Analysis. This helps us to get a first impression of our data and might help us arrive at additional features that can help with the prediction of the house prices.
We can get a quick overview using R’s summary function:
summary(house_sales_prediction) ## id date price bedrooms ## Min. : 1000102 Length:21613 Min. : 75000 Min. : 0.00 ## 1st Qu.:2123049194 Class :character 1st Qu.: 321950 1st Qu.: 3.00 ## Median :3904930410 Mode :character Median : 450000 Median : 3.00 ## Mean :4580301520 Mean : 540088 Mean : 3.37 ## 3rd Qu.:7308900445 3rd Qu.: 645000 3rd Qu.: 4.00 ## Max. :9900000190 Max. :7700000 Max. :33.00 ## bathrooms sqft_living sqft_lot floors ## Min. :0.00 Min. : 290 Min. : 520 Min. :1.00 ## 1st Qu.:1.75 1st Qu.: 1427 1st Qu.: 5040 1st Qu.:1.00 ## Median :2.25 Median : 1910 Median : 7618 Median :1.50 ## Mean :2.12 Mean : 2080 Mean : 15107 Mean :1.49 ## 3rd Qu.:2.50 3rd Qu.: 2550 3rd Qu.: 10688 3rd Qu.:2.00 ## Max. :8.00 Max. :13540 Max. :1651359 Max. :3.50 ## waterfront view condition grade sqft_above ## Min. :0.0000 Min. :0.000 Min. :1.00 Min. : 1.00 Min. : 290 ## 1st Qu.:0.0000 1st Qu.:0.000 1st Qu.:3.00 1st Qu.: 7.00 1st Qu.:1190 ## Median :0.0000 Median :0.000 Median :3.00 Median : 7.00 Median :1560 ## Mean :0.0075 Mean :0.234 Mean :3.41 Mean : 7.66 Mean :1788 ## 3rd Qu.:0.0000 3rd Qu.:0.000 3rd Qu.:4.00 3rd Qu.: 8.00 3rd Qu.:2210 ## Max. :1.0000 Max. :4.000 Max. :5.00 Max. :13.00 Max. :9410 ## sqft_basement yr_built yr_renovated zipcode lat ## Min. : 0 Min. :1900 Min. : 0.0 Min. :98001 Min. :47.2 ## 1st Qu.: 0 1st Qu.:1951 1st Qu.: 0.0 1st Qu.:98033 1st Qu.:47.5 ## Median : 0 Median :1975 Median : 0.0 Median :98065 Median :47.6 ## Mean : 292 Mean :1971 Mean : 84.4 Mean :98078 Mean :47.6 ## 3rd Qu.: 560 3rd Qu.:1997 3rd Qu.: 0.0 3rd Qu.:98118 3rd Qu.:47.7 ## Max. :4820 Max. :2015 Max. :2015.0 Max. :98199 Max. :47.8 ## long sqft_living15 sqft_lot15 ## Min. :-123 Min. : 399 Min. : 651 ## 1st Qu.:-122 1st Qu.:1490 1st Qu.: 5100 ## Median :-122 Median :1840 Median : 7620 ## Mean :-122 Mean :1987 Mean : 12768 ## 3rd Qu.:-122 3rd Qu.:2360 3rd Qu.: 10083 ## Max. :-121 Max. :6210 Max. :871200 dim(house_sales_prediction) ##  21613 21
Our dataset has 21613 observations and 21 columns.
The variable we want to predict is
In addition to the price column, we have several other columns:
id:A unique identifier for every house.
date: A date column, indicating when the house was sold. This column is currently not encoded as a
dateand requires some preprocessing.
zipcode: A column indicating the ZIP code. This is a categorical variable with many factor levels.
long, latThe longitude and latitude of the house
...several other numeric columns providing information about the house, such as number of rooms, square feet etc.
Before we continue with the analysis, we preprocess some features so that they are stored in the correct format.
First we convert the
date column to
numeric to be able to treat it as a numeric feature:
Afterwards, we convert the zip code to a factor:
And add a new column renovated indicating whether a house was renovated at some point.
Additionally we convert the price from Dollar to units of 1000 Dollar to improve readability.
We can now plot the density of the price to get a first impression on its distribution.
We can see that the prices for most houses lie between 75.000 and 1.5 million dollars. There are few extreme values of up to 7.7 million dollars.
Feature engineering often allows us to incorporate additional knowledge about the data and underlying processes.
This can often greatly enhance predictive performance.
A simple example: A house which has
yr_renovated == 0 means that is has not been renovated yet.
Additionally we want to drop features which should not have any influence (
After those initial manipulations, we load all required packages and create a Task containing our data.
We can inspect associations between variables using
autoplot function in order to get some good first impressions for our data.
Note, that this does in no way prevent us from using other powerful plot functions of our choice on the original data.
188.8.131.52 Distribution of the price:
The outcome we want to predict is the price variable.
autoplot function provides a good first glimpse on our data.
As the resulting object is a
ggplot2 object, we can use
faceting and other functions from ggplot2 in order to enhance plots.
We can observe that renovated flats seem to achieve higher sales values, and this might thus be a relevant feature.
Additionally, we can for example look at the condition of the house. Again, we clearly can see that the price rises with increasing condition.
184.108.40.206 Association between variables
In addition to the association with the target variable, the association between the features can also lead to interesting insights.
We investigate using variables associated with the quality and size of the house.
Note that we use
$select() to clone the task and select only a subset of the features for the
autoplot function, as
autoplot per default uses all features.
The task is cloned before we select features in order to keep the original task intact.
9.1.2 Splitting into train and test data
mlr3, we do not create
test data sets, but instead keep only a vector of train and test indices.
9.1.3 A first model: Decision Tree
Decision trees cannot only be used as a powerful tool for predictive models but also for exploratory data analysis.
In order to fit a decision tree, we first get the
regr.rpart learner from the
mlr_learners dictionary by using the sugar function
For now we leave out the
zipcode variable, as we also have the
longitude of each house.
The learned tree relies on several variables in order to distinguish between cheaper and pricier houses. The features we split along are grade, sqft_living, but also some features related to the area (longitude and latitude).
We can visualize the price across different regions in order to get more info:
We can see that the price is clearly associated with the zipcode when comparing then two plots. As a result, we might want to indeed use the zipcode column in our future endeavours.
9.1.4 A first baseline: Decision Tree
After getting an initial idea for our data, we might want to construct a first baseline, in order to see what a simple model already can achieve.
3-fold cross-validation on our training data in order to get a reliable estimate of the algorithm’s performance on future data.
Before we start with defining and training learners, we create a
Resampling in order to make sure that we always compare on exactly the same data.
For the cross-validation we only use the training data by cloning the task and selecting only observations from the training set.
lrn_rpart = lrn("regr.rpart") res = resample(task = tsk$clone()$filter(train.idx), lrn_rpart, cv3) res$score(msr("regr.mse")) ## task task_id learner learner_id resampling ## 1: <TaskRegr> sales <LearnerRegrRpart> regr.rpart <ResamplingCV> ## 2: <TaskRegr> sales <LearnerRegrRpart> regr.rpart <ResamplingCV> ## 3: <TaskRegr> sales <LearnerRegrRpart> regr.rpart <ResamplingCV> ## resampling_id iteration prediction regr.mse ## 1: cv 1 <list> 45416 ## 2: cv 2 <list> 49599 ## 3: cv 3 <list> 42518 sprintf("RMSE of the simple rpart: %s", round(sqrt(res$aggregate()), 2)) ##  "RMSE of the simple rpart: 214.11"
9.1.5 Many Trees: Random Forest
We might be able to improve upon the RMSE using more powerful learners.
We first load the
mlr3learners package, which contains the ranger learner (a package which implements the “Random Forest” algorithm).
lrn_ranger = lrn("regr.ranger", num.trees = 15L) res = resample(task = tsk$clone()$filter(train.idx), lrn_ranger, cv3) res$score(msr("regr.mse")) ## task task_id learner learner_id resampling ## 1: <TaskRegr> sales <LearnerRegrRanger> regr.ranger <ResamplingCV> ## 2: <TaskRegr> sales <LearnerRegrRanger> regr.ranger <ResamplingCV> ## 3: <TaskRegr> sales <LearnerRegrRanger> regr.ranger <ResamplingCV> ## resampling_id iteration prediction regr.mse ## 1: cv 1 <list> 23837 ## 2: cv 2 <list> 22343 ## 3: cv 3 <list> 18913 sprintf("RMSE of the simple ranger: %s", round(sqrt(res$aggregate()), 2)) ##  "RMSE of the simple ranger: 147.3"
Often tuning RandomForest methods does not increase predictive performances substantially. If time permits, it can nonetheless lead to improvements and should thus be performed. In this case, we resort to tune a different kind of model: Gradient Boosted Decision Trees from the package xgboost.
9.1.6 A better baseline:
Tuning can often further improve the performance.
In this case, we tune the xgboost learner in order to see whether this can improve performance.
AutoTuner we have to specify a Termination Criterion (how long the tuning should run) a Tuner (which tuning method to use) and a ParamSet (which space we might want to search through).
For now we do not use the zipcode column, as xgboost cannot naturally
deal with categorical features.
The AutoTuner automatically performs nested cross-validation.
set.seed(444L) library(mlr3tuning) library(paradox) lrn_xgb = lrn("regr.xgboost") # Define the ParamSet ps = ParamSet$new( params = list( ParamDbl$new(id = "eta", lower = 0.2, upper = .4), ParamDbl$new(id = "min_child_weight", lower = 1, upper = 20), ParamDbl$new(id = "subsample", lower = .7, upper = .8), ParamDbl$new(id = "colsample_bytree", lower = .9, upper = 1), ParamDbl$new(id = "colsample_bylevel", lower = .5, upper = .7), ParamInt$new(id = "nrounds", lower = 1L, upper = 25) )) # Define the Terminator terminator = TerminatorEvaluations$new(10) cv3 = rsmp("cv", folds = 3) at = AutoTuner$new(lrn_xgb, cv3, measures = msr("regr.mse"), ps, terminator, tuner = TunerRandomSearch, tuner_settings = list())
We can obtain the resulting params in the respective splits by accessing the
NOTE: To keep runtime low, we only tune parts of the hyperparameter space of xgboost in this example. Additionally, we only allow for \(10\) random search iterations, which is usually to little for real-world applications. Nonetheless, we are able to obtain an improved performance when comparing to the ranger model.
In order to further improve our results we have several options:
- Find or engineer better features
- Remove Features to avoid overfitting
- Obtain additional data (often prohibitive)
- Try more models
- Improve the tuning
- Increase the tuning budget
- Enlarge the tuning search space
- Use a more efficient tuning algorithm
- Stacking and Ensembles (see Pipelines)
Below we will investigate some of those possibilities and investigate whether this improves performance.
9.1.7 Engineering Features: Mutating ZIP-Codes
In order to better cluster the zip codes, we compute a new feature: med_price: It computes the median price in each zip-code. This might help our model to improve the prediction.
Again, we run
resample and compute the RMSE.
lrn_ranger = lrn("regr.ranger") res = resample(task = tsk$clone()$filter(train.idx), lrn_ranger, cv3) res$score(msr("regr.mse")) ## task task_id learner learner_id resampling ## 1: <TaskRegr> sales <LearnerRegrRanger> regr.ranger <ResamplingCV> ## 2: <TaskRegr> sales <LearnerRegrRanger> regr.ranger <ResamplingCV> ## 3: <TaskRegr> sales <LearnerRegrRanger> regr.ranger <ResamplingCV> ## resampling_id iteration prediction regr.mse ## 1: cv 1 <list> 22704 ## 2: cv 2 <list> 20376 ## 3: cv 3 <list> 17783 sprintf("RMSE of ranger with med_price: %s", round(sqrt(res$aggregate()), 2)) ##  "RMSE of ranger with med_price: 142.43"
9.1.8 Obtaining a sparser model
In many cases, we might want to have a sparse model.
For this purpose we can use a
mlr3filters::Filter implemented in mlr3filters.
This can prevent our learner from overfitting make it easier for humans to interpret models as fewer variables influence the resulting prediction.
The resulting RMSE is slightly higher, and at the same time we only use \(12\) features.
lrn_ranger = lrn("regr.ranger") res = resample(task = tsk_ftsel$clone()$filter(train.idx), lrn_ranger, cv3) res$score(msr("regr.mse")) ## task task_id learner learner_id resampling ## 1: <TaskRegr> sales <LearnerRegrRanger> regr.ranger <ResamplingCV> ## 2: <TaskRegr> sales <LearnerRegrRanger> regr.ranger <ResamplingCV> ## 3: <TaskRegr> sales <LearnerRegrRanger> regr.ranger <ResamplingCV> ## resampling_id iteration prediction regr.mse ## 1: cv 1 <list> 34703 ## 2: cv 2 <list> 29052 ## 3: cv 3 <list> 25912 sprintf("RMSE of ranger with filtering: %s", round(sqrt(res$aggregate()), 2)) ##  "RMSE of ranger with filtering: 172.88"