The DALEX package X-rays any predictive model and helps to explore, explain and visualize its behaviour. The package implements a collection of methods for Explanatory Model Analysis. It is based on a unified grammar summarised in Figure 8.1.
In the following sections, we will present subsequent methods available in the DALEX package based on a random forest model trained for football players worth prediction on the FIFA 20 data. We will show both methods analyzing the model at the level of a single prediction and the global level - for the whole data set.
The structure of this chapter is the following:
- In section 8.2.2 we introduce the FIFA 20 dataset and then in section 8.2.3 we train a random regression forest using the ranger package.
- Section 8.2.4 introduces general logic beyond DALEX explainers.
- Section 8.2.5 introduces methods for dataset level model exploration.
- Section 8.2.6 introduces methods for instance-level model exploration.
8.2.2 Read data: FIFA
Examples presented in this chapter are based on data retrieved from the FIFA video game. We will use the data scrapped from the sofifa website. The raw data is available at kaggle. After some basic data cleaning, the processed data for the top 5000 football players is available in the DALEX package under the name
## value_eur age height_cm nationality attacking_crossing ## L. Messi 95500000 32 170 Argentina 88 ## Cristiano Ronaldo 58500000 34 187 Portugal 84
For every player, we have 42 features available.
##  5000 42
In the table below we overview these 42 features for three selected players.
One of the features, called
value_eur, is the worth of the footballer in euros. In the next section, we will build a prediction model, which will estimate the worth of the player based on other player characteristics.
|Lionel Messi||Cristiano Ronaldo||Neymar Junior|
|value_eur||95 500 000||58 500 000||105 500 000|
In order to get a more stable model we remove four variables i.e.
8.2.3 Train a model: Ranger
DALEX package works for any model regardless of its internal structure. Examples of how this package works are shown on a random forest model implemented in the ranger package.
We use the
mlr3 package to build a predictive model.
First, let’s load the required packages.
Then we can define the regression task - prediction of the
Finally, we train mlr3’s
ranger learner with 250 trees. Note that in this example for brevity we do not split the data into a train/test data. The model is built on the whole data.
## <LearnerRegrRanger:regr.ranger> ## * Model: ranger ## * Parameters: num.trees=250 ## * Packages: ranger ## * Predict Type: response ## * Feature types: logical, integer, numeric, character, factor, ordered ## * Properties: importance, oob_error, weights
8.2.4 The general workflow
Working with explanations in the DALEX package always consists of three steps schematically shown in the pipe below.
model %>% explain_mlr3(data = ..., y = ..., label = ...) %>% model_parts() %>% plot()
All functions in the DALEX package can work for models with any structure. It is possible because in the first step we create an adapter that allows the downstream functions to access the model in a consistent fashion. In general, such an adapter is created with
DALEX::explain()function, but for models created in the
mlr3package it is more convenient to use the
Explanations are determined by the functions
DALEX::predict_profile(). Each of these functions takes the model adapter as its first argument. The other arguments describe how the function works. We will present them in the following section.
Explanations can be visualized with the generic function
plot()or summarised with the generic function
print(). Each explanation is a data frame with an additional class attribute. The
plot()function creates graphs using the ggplot2 package, so they can be easily modified with usual
We show this cascade of functions based on the FIFA example.
To get started with the exploration of the model behaviour we need to create an explainer.
DALEX::explain function handles is for all types of predictive models. In the DALEXtra package there generic versions for the most common ML frameworks. Among them the
DALEXtra::explain_mlr3() function works for
This function performs a series of internal checks so the output is a bit verbose. Turn the
verbose = FALSE argument to make it less wordy.
## Preparation of a new explainer is initiated ## -> model label : Ranger RF ## -> data : 5000 rows 38 cols ## -> target variable : 5000 values ## -> predict function : yhat.LearnerRegr will be used ( default ) ## -> predicted values : numerical, min = 482969 , mean = 7471978 , max = 89376367 ## -> model_info : package mlr3 , ver. 0.3.0 , task regression ( default ) ## -> residual function : difference between y and yhat ( default ) ## -> residuals : numerical, min = -7815433 , mean = 1309 , max = 18463093 ## A new explainer has been created!
8.2.5 Dataset level exploration
## variable mean_dropout_loss label ## 1 _full_model_ 1389133 Ranger RF ## 2 value_eur 1389133 Ranger RF ## 3 goalkeeping_kicking 1451134 Ranger RF ## 4 height_cm 1458067 Ranger RF ## 5 movement_balance 1458209 Ranger RF ## 6 weight_kg 1461027 Ranger RF
Results can be visualized with generic
plot(). The chart for all 38 variables would be unreadable, so with the
max_vars argument, we limit the number of variables on the plot.
Once we know which variables are most important, we can use Partial Dependence Plots to show how the model, on average, changes with changes in selected variables. In this example, they show the average relation between the particular variables and players’ value.
## Top profiles : ## _vname_ _label_ _x_ _yhat_ _ids_ ## 1 skill_ball_control Ranger RF 5 6722486 0 ## 2 skill_dribbling Ranger RF 7 7322826 0 ## 3 skill_dribbling Ranger RF 11 7319045 0 ## 4 skill_dribbling Ranger RF 12 7315331 0 ## 5 skill_dribbling Ranger RF 13 7312057 0 ## 6 skill_dribbling Ranger RF 14 7307721 0
Again, the result of the explanation can be presented with the generic function
The general trend for most player characteristics is the same. The higher are the skills the higher is the player’s worth. With a single exception – variable Age.
8.2.6 Instance level explanation
Time to see how the model behaves for a single observation/player This can be done for any player, but this example we will use the Cristiano Ronaldo.
predict_parts is an instance-level version of the
model_parts function introduced in the previous section. For the background behind that method see the Introduction to Break Down.
## contribution ## Ranger RF: intercept 7471978 ## Ranger RF: movement_reactions = 96 11979127 ## Ranger RF: skill_ball_control = 92 6413054 ## Ranger RF: attacking_finishing = 94 5083462 ## Ranger RF: mentality_positioning = 95 5741357 ## Ranger RF: mentality_composure = 95 5138201
plot() function shows the estimated contribution of variables to the final prediction.
Cristiano is a striker, therefore characteristics that influence his worth are those related to attack, like
skill_dribbling. The only variable with negative attribution is
Another way to inspect the local behaviour of the model is to use SHapley Additive exPlanations (SHAP). It locally shows the contribution of variables to a single observation, just like Break Down.
In the previous section, we’ve introduced a global explanation - Partial Dependence Plots. Ceteris Paribus is the instance level version of that plot. It shows the response of the model for observation when we change only one variable while others stay unchanged. Blue dot stands for the original value.
selected_variables <- c("age", "movement_reactions", "skill_ball_control", "skill_dribbling") ronaldo_cp_ranger <- predict_profile(ranger_exp, ronaldo, variables = selected_variables) plot(ronaldo_cp_ranger, variables = selected_variables) + scale_y_continuous("Estimated value of Christiano Ronaldo", labels = scales::dollar_format(suffix = "€", prefix = ""))