12  Fairness

Author
Affiliation

Florian Pfisterer

Ludwig-Maximilians-Universität München

Abstract
TBD

12.1 Fairness

In this chapter, we will explore fairness in automated decision making and how we can build better systems when it comes to systems that make (automated) decisions about human individuals. Such systems range from domains such as banking (credit-scoring) and hiring (applicant scoring) to systems that assist doctors with medical decisions. Methods to help with auditing for or building better models can be found in the mlr3fairness package. This chapter heavily borrows from the paper accompanying the pacakge (Pfisterer et al. 2023), the interested reader is referred to this manuscript for a more in-depth introduction.

Automated decision-making systems based on data-driven models are becoming increasingly common, and studies have found that they often outperform human experts in making decisions, especially in high-stakes scenarios, leading to more efficient and accurate predictions. However, without proper auditing, these models can result in negative consequences for individuals, especially those from underprivileged groups. The proliferation of such systems in everyday life has made it important to address the potential for biases in these models. For instance, historical biases and sampling biases in the data used to train these models can lead to replication of such biases in the future, as well as inadequate representation of unprivileged populations, leading to models that perform well in some groups but worse on others. Biases in the measurement of labels and data, as well as feedback loops, are other sources of biases that need to be addressed. As ML-driven systems are used for highly influential decisions, it is vital to develop capabilities to analyze and assess these models not only with respect to their robustness and predictive performance but also with respect to potential biases.

12.1.1 What is bias?

With bias in the context of fairness, we usually refer to disparities in how a model treats individuals or groups. In order to understand this better, we will first discuss, which disparities we might want to consider in this context. Then we can discuss how to detect and quantify those disparities in machine learning models, and finally how they can be translated to so-called fairness metrics.

12.1.1.1 Notions of fairness

In this chapter, we will concentrate on a subset of bias definitions, so-called group fairness. As a scenario, we might for example imagine that we develop a system that makes decisions about whether a patient should be considered for some new treatment. Our goal might now be, that those decisions are fair across groups defined by a sensitive attribute, which could e.g., be gender, race or age. We will now present two different perspectives on fairness presented in (Barocas, Hardt, and Narayanan 2019; Wachter, Mittelstadt, and Russell 2020).

The first group of fairness notions, bias preserving fairness notions, also called Separation, requires that the prediction made by the model is independent of the sensitive attribute given the true label. In other words, the model should make roughly the same amount of errors (or correct predictions) in each group. Several metrics fall under this category, such as “Equalized Odds”, which requires the same true positive and false positive rates across all groups.

The second group, bias transforming fairness notions, also called Independence, only requires that the prediction made by the model is independent of the sensitive attribute. This group includes the concept of “Demographic Parity”, which requires that the proportion of positive predictions is equal across all groups.

Which fairness notion should be chosen now depends on the decision our model should make and it’s societal context. It is important to note that these metrics condense a large variety of societal issues into a single number, and they are therefore limited in their ability to identify biases that may exist in the data. Similarly, those metrics do not easily translate to legal principles. For example, if societal biases lead to disparities in an observed quantity (such as SAT scores) for individuals with the same underlying ability, these metrics may not identify existing biases. These metrics often naturally extend to more complex scenarios, such as multi-class classification, regression, or survival analysis.

Additional fairness notions beyond statistical group fairness include individual fairness, which assesses fairness at an individual level based on the principle of treating similar cases similarly and different cases differently, and causal fairness notions which incorporate causal relationships in the data and propose metrics based on a directed acyclic graph (Barocas, Hardt, and Narayanan 2019; Mitchell et al. 2021).

12.1.1.2 Choosing fairness notions

Selecting a fairness notion requires careful consideration of the contexts in which the model will be used. Bias preserving metrics, such as equalized odds and equality of opportunity, require that errors made by a model are equal across groups, but might not account for bias in the labels. Bias transforming methods on the other hand do not depend on labels and can help detect biases arising from different base rates across populations, but enforcing them might induce a shift in the data distribution, possibly leading to feedback loops.

The exact metric now depends on the decision that our model should make: Depending on whether the decision is, e.g., punitive instead of assistive, we might want to focus on false positive or false discovery rates instead of true positives. In addition, we need to consider the legal and moral principles we aim to encode with the selected fairness metric.

12.1.1.3 Translating notions of fairness into code.

In order to translate the independence requirements stated above into a fairness metric, we investigate differences between groups. In the following, we will denote the sensitive group with A and for simplicity assume it is binary and can only take two values, 0 and 1. As an example, for a metric \(M\), e.g., the true positive rate (TPR), we calculate the difference in the metric across the two groups:

\[ \Delta_{M} = M_{A=0} - M_{A=1}. \]

To provide an example, denoting \({P}\left(\hat{Y} = 1 \mid A = \star, Y = 1\right)\) with \(TPR_{A=\star}\), we calculate the difference in TPR between the two groups: \[ \Delta_{TPR} = TPR_{A=0} - TPR_{A=1}. \] When \(\Delta_{TPR}\) now significantly deviates from \(0\), the prediction \(\hat{Y}\) violates the requirement for equality of opportunity formulated above.

It is important to note that in practice, we might not be able to perfectly satisfy a given metric, e.g., due to stochasticity in data and labels. Instead, to provide a binary conclusion regarding fairness, a model could be considered fair if \(|\Delta_{TPR}| < \epsilon\) for a given threshold \(\epsilon > 0\), e.g., \(\epsilon = 0.05\). This allows for small deviations from perfect fairness due to variance in the estimation of \(TPR_{A=\star}\) or additional sources of bias.

This approach allows us to construct a fairness Measure from arbitrary metrics, by supplying a base_measure:

library("mlr3fairness")
msr("fairness",
  base_measure = msr("classif.tpr"),
  range = c(0, 1)
)
<MeasureFairness:fairness.tpr>
* Packages: mlr3, mlr3fairness
* Range: [0, 1]
* Minimize: TRUE
* Average: macro
* Parameters: list()
* Properties: requires_task
* Predict type: response

For convenience, we have implemented a variety of Measures that is made available when the mlr3fairness package is loaded. Fairness measures can then be constructed via msr() like any other measure in mlr3.

Tip

Implemented fairness metrics can be constructed using the prefix "fairness." and the name of base measure, e.g. "tpr". Simply calling msr() without any arguments will return a list of all available measures including fairness metrics.

12.1.1.4 Example: The adult dataset

In the following chunk, we retrieve the binary classification task with id "adult_train" from the package. It contains a part of the Adult data set (Dua and Graff 2017).

The task is to predict whether an individual earns more than $50.000 per year. The column "sex" is already set as a binary sensitive attribute with levels "Female" and "Male".

library("mlr3fairness")
task = tsk("adult_train")
print(task)
<TaskClassif:adult_train> (30718 x 13)
* Target: target
* Properties: twoclass
* Features (12):
  - fct (7): education, marital_status, occupation, race, relationship,
    sex, workclass
  - int (5): age, capital_gain, capital_loss, education_num,
    hours_per_week
* Protected attribute: sex

12.1.1.5 Setting a sensitive attribute

For a given task, we can select one or multiple sensitive attributes. In mlr3, the sensitive attribute is identified via the column role pta (short for protected attribute) and can be set as follows:

task$set_col_roles("sex", add_to = "pta")

This example sets the "sex" column of our task as a sensitive attribute. This information is then automatically passed on when the task is used, e.g., when computing fairness metrics. If more than one sensitive attribute is specified, metrics will be computed based on intersecting groups formed by the columns.

12.1.1.6 Auditing a model for bias

We can now fit any Learner on this task and score the resulting Prediction.

learner = lrn("classif.rpart", predict_type = "prob")
idx = partition(task)
learner$train(task, idx$train)
prediction = learner$predict(task, idx$test)

We then employ a fairness measure, here the discrepancy in accuracy between groups:

measure = msr("fairness.tpr")
prediction$score(measure, task = task)
fairness.tpr 
  0.05452256 

This now reports the difference in accuracy between the two groups identified by the pta column role in the task. To determine whether the result is fair, we need to consider the decision’s context, e.g. what real-world quantities this disparity encodes. On a technical level, we can similarly use the metric to score a ResamplingResult or BenchmarkResult for the comparison of multiple algorithms.

12.1.1.7 Visualizing differences

Visualizations can help better understand discrepancies between groups or differences between models. For an in-depth dive into visualizations, please consider the Visualization vignette.

To showcase available visualizations, we will again use the adult dataset (Dua and Graff 2017) and train a random forest on it.

task = tsk("adult_train")
learner = lrn("classif.ranger", predict_type = "prob")
learner$train(task)
prd = learner$predict(tsk("adult_test"))
library(patchwork)
p1 = fairness_prediction_density(prediction, task = task)
p2 = compare_metrics(prediction, 
  msrs(c("fairness.fpr", "fairness.tpr", "fairness.eod")),
  task = task
)

(p1 + xlab("") + p2) * 
  theme_bw() *
  scale_fill_viridis_d(end = 0.8, alpha = 0.8) *
  theme(
    axis.text.x = element_text(angle = 15, hjust = .7),
    legend.position = "bottom"
  )

Two panel plot including two overlapping densities with a concentration on the right for the model's prediction of the positive class labeled "Male" and "Female". Both densities are relatively equal. The second panel shows three bar charts for the three metrics ("fairness.fpr", "fairness.tpr", "fairness.eod") with bars at roughly 0.035, 0.002, 0.018 for the three metrics respectively.

Figure 12.1: Fairness prediction density plot (left) showing the density of predictions for the positive class splitted into “Male” and “Female” individuals. The metrics comparison plot (right) depicts the model’s scores across the specified metrics using bars.

In this example, we can for example see a difference in the prediction density between Male and Female individuals predicting the positive outcome. This might indicate the presence of a small bias. This can also be observed from the different metrics, where small discrepancies in the fairness metric (around 0.05) are observed.

12.1.2 Fair Machine Learning

If we now detect that our model is unfair, a natural next step might be to try and mitigate such biases. The mlr3fairness package comes with several options to adress biases in models, which broadly fall into three categories: Data pre-processing, employing fair models, or adapting model predictions. An overview is for example provided by Caton and Haas (2020). Those methods often slightly decrease predictive performance and we might therefore want to try out which of the existing approaches balances predictive performance and fairness.

Pre- and postprocessing schemes can be connected to learners with the help of mlr3pipelines. If you are not familiar with pipelines, you can familiarize yourself with them in Chapter 6. We provide two examples below, first preprocessing to balance observation weights with po("reweighing_wts") and second post-processing predictions using po("EOd"). The latter enforces the equalized odds fairness definition by stochastically flipping specific predictions.

library("mlr3pipelines")
library("mlr3learners")
l1 = po("reweighing_wts") %>>% lrn("classif.rpart")
l2 = po("learner_cv", lrn("classif.rpart")) %>>%
  po("EOd")

Similarly, mlr3fairness also comes with learners that incorporate fairness considerations directly, e.g. a generalized linear model with fairness constraints is available as "classif.fairzlrm".

Note

Algorithmic interventions can often enforce fairness in suboptimal ways. It is therefore important to try and address biases at their root cause instead of relying solely on algorithmic interventions.

We are now ready to compare the two pipelines constructed above using the benchmark()function.

lrns = list(lrn("classif.rpart"), l1, l2)
grd = benchmark_grid(task, lrns, rsmp("cv", folds = 2L))
bmr = benchmark(grd)
bmr$aggregate(msrs(c("classif.acc", "fairness.eod")))
   nr     task_id                   learner_id resampling_id iters classif.acc
1:  1 adult_train                classif.rpart            cv     2   0.8410053
2:  2 adult_train reweighing_wts.classif.rpart            cv     2   0.8414936
3:  3 adult_train            classif.rpart.EOd            cv     2   0.8321831
1 variable not shown: [fairness.equalized_odds]
Hidden columns: resample_result

We can now study the result using built-in plot functions, e.g. the fairness_accuracy_tradeoff function.

fairness_accuracy_tradeoff(bmr, msr("fairness.eod")) +
scale_color_viridis_d("Learner") +
theme_minimal()

This depicts the results of the cross-validation folds along with their aggregate. Employing the EOd post-processing strategy improves the model for the (equalized odds difference) metric at the cost of some accuracy. Employing the reweighing strategy however seems to have a negligible effet.

Combining mlr3fairness with mlr3pipelines and mlr3tuning allows for tuning over complex pipelines and e.g. simultaneously optimizing for performance and fairness. We describe this in more detail in the mlr3fairness paper (Pfisterer et al. 2023).

12.1.3 Documentation

Because fairness aspects can not always be investigated based on fairness metrics, it is important to document data collection and the resulting data as well as models resulting from this data. Informing auditors about those aspects of a deployed model can lead to better assessment of a model’s fairness. Questionnaires for ML models and data sets have been previously proposed in literature and are easily availble from mlr3fairness. They can be obtained from automated report templates using R markdown for data sets and ML models. In addition, we provide a template for a fairness report which includes many fairness metrics and visualizations to provide a good starting point for generating a fairness report inspired by the Aequitas Toolkit. A preview for the different reports can be obtained from the Reports vignette.

The functions below create a template for creating a data sheet or a model card that can help with documenting machine learning. Running the function creates the markdown templates in a specified folder, here the "docs" folder. The user can then further customize the reports by answering included questions or adding visualizations and other information.

report_datasheet("docs/sheet")
report_modelcard("docs/card")

12.1.4 Fairness: Concluding remarks

The functionality introduced above has the goal to help users investigating their models for potential biases and to potentially mitigate them. Deciding whether a model is fair however requires additional investigation, such as deciding what the measured quantities actually mean for an individual in the real world and what other biases might exist in the data that might lead to discrepancies in how e.g. covariates or the label are measured.

Note

Fairness metrics can not be used to prove or guarantee fairness. They serve as a diagnostic tool to detect disparities and as a basis for model selection and making fair decisions in practice. However, fairness metrics are reduction of complex societal processes into mathematical objectives and require abstraction steps, which can invalidate the metric. Additionally, practitioners should look beyond the model and consider the data used for training and the process of data and label acquisition. Fairness metrics should therefore be used for exploratory purposes only, and practitioners should not solely rely on them to make decisions about employing an ML model or assessing whether a system is fair.

We hope, that pairing the functionality available in mlr3fairness with additional exploratory data analysis, a solid understanding of the societal context in which the decision is made and integrating additional tools, e.g. from interpretability might help to mitigate or diminish unfairness in systems deployed in the future.

12.2 Exercises

Trying to achieve fairness is often tricky, especially in the context of complex fairness metrics or when multiple protected attributes exist. In the following task, we try to re-use the "adult_train" data from above but optimize for different fairness metrics.

  1. Load the "adult_train" task and try to build a first model. Train a simple model and evaluate it on the "adult_test" task that is also available with mlr3fairness.

  2. Assume our goal is to achieve parity in false omission rates. Construct a fairness metric that encodes this and againg evaluate your model. In order to get a deeper understanding, look at the groupwise_metrics function to obtain performance in each group.

  3. Now try to improve your model, e.g. by employing pipelines that use pre- or post-processing methods for fairness. Evaluate your model along the two metrics and visualize the resulting metrics. Compare the different models using an appropriate visualization.

  4. Now we want to add a second sensitive attribute to your dataset, in this case "race". Add the information to your task and evaluate for the initial model again. What changes? Again study the groupwise_metrics.