Now that we understand the basics of the ML pipeline, we can dive into how things can go terribly wrong. In this lesson, we’ll follow this pipeline and highlight key areas where problems frequently occur. Throughout the remainder of this course, we’ll cover each and every one of these topics with great detail and provide real-world case studies that highlight both consequences and solutions.

Sources of bias

The ML pipeline we covered in the last lesson is a good summary of a data scientist’s journey, but doesn’t include key ideas such as where the data came from, how the model will be implemented, etc. Essentially, there are several hidden steps that come before and after the implementation of a pipeline that still contain potential sources of harm that elevate ML risk. These steps can be simply categorized as pre-pipeline and post-pipeline.

Pre-pipeline steps come before the data is obtained and answer questions like:

  • How was the data collected?

  • How was the population defined?

  • What metrics were used and how were they defined?

Post-pipeline steps come after the model is deployed and answer questions like:

  • How do humans interpret model outputs?

  • What kinds of feedback loops are implemented?

Sources of harm: pre-pipeline

The following diagram from MIT researchers Harini Suresh and John Guttag is a great summary of pre-pipeline sources of harm.

Press + to interact
Pre-pipeline sources of harm
Pre-pipeline sources of harm

Historical bias

Historical bias is the idea that even if our data is perfectly sampled and measured, it can still contain historical aspects of societal norms that can be harmful to certain populations. As an example, consider word embeddings.

In the natural language space, word embeddings are a computational method of representing words as a series of numbers in a vector. Essentially, a word like “doctor” would be stored as a d-dimensional vector that captures a bit of the meaning of the word. With this approach, it becomes easy to create word associations (phrases like “a is to b as c is to d”). For example, this association could be written mathematically asab=cda-b=c-d.

When trying to create these word embeddings, data scientists in the past have used online sources to compile their data (such as Google News or Wikipedia), but in doing so they also capture centuries of historical bias that comes with the language. Gender bias is particularly dangerous in English. Words like “computer programmer,” for example, are much more commonly associated with “he/him” pronouns than with “she/her.” When a model learns word embeddings, it begins to assign a vector representation for “computer programmer” that’s much closer to “man” or “him” than to “woman” or “her.” When these embeddings are subsequently used in models, gender bias can easily creep in. In a paper released in 2016 titled “Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings,” researchers used word embeddings obtained from a model trained on Google News articles and found that the following word association actually held true:manwoman=computerprogrammerhomemakerman-woman=computer programmer-homemaker (Bolukbasi T., et al., 2016). The model associated “computer programmer” as a man’s profession and “homemaker” as a woman’s profession. It’s easy to see that models that use these word embeddings as inputs can perpetuate these stereotypes.

Representation bias

Representation bias is similar to sampling bias in statistics. This bias has to do with representative sampling and relates to one of three different problems:

  1. The data originates from a population that is not a good representative population for the use case. For example, foot traffic data obtained from shopping complexes in Asia is not representative of foot traffic for similar areas in North America.

  2. The covered population contains underrepresented groups. For example, building a model on data collected from the United States Census will work better for majority groups than for minority groups because there’s more data on the former.

  3. Sampling methods are uneven. For example, using data from college admissions to predict applicant success is already biased because there are only success metrics for candidates that have been accepted (there’s no information on whether a rejected candidate succeeded at the college).

These are considerations that a data scientist must take into account prior to using a dataset.

Measurement bias

This bias has to do with the simplification of complex problems via proxy measurements and metrics. Models (and humans) can’t possibly make decisions based on extremely complex circumstances, so we tend to simplify problems and create metrics that summarize a target and reduce dimensionality. For example, college admissions metrics like grade point average (GPA) and SAT scores are by no means a comprehensive metric to measure student success (and frankly, are some of the most opaque, least relevant testing procedures), but are used because they simplify the admissions process significantly. However, these simplifications can often carry implicit biases that can dangerously affect certain populations.

A common occurrence that accompanies this oversimplification is the idea that the method of measurement often varies across groups. Data collection in prisons, for example, is performed very differently from data collection elsewhere. The same metric (e.g., “blood pressure”) may be significantly skewed in prisons due to the high-stress experience of a prison guard or prison doctor collecting the measurement.

Similarly, the accuracy of a measurement itself can shift across population groups. Consider misdiagnosis rates in the United States. A recent 2019 study showed that African-Americans are up to 2.4 times more likely to be misdiagnosed with schizophrenia relative to White individuals (Schwartz E., et al., 2019). When collecting data on schizophrenia rates in the United States, then, it would appear that African-Americans are more likely to develop schizophrenia. Models learning from data like this will naturally be skewed to provide conclusions that aren’t necessarily faultless.

Sources of harm: pipeline and beyond

Continuing on in our journey, we reach the pipeline itself. Here, we encounter biases related to how we partition, simplify, and transform the data, as well as inherent biases that models carry. We also briefly discuss what happens beyond the pipeline, since things can still go wrong after ML models are deployed. As a preview, here’s the second part of Harini Suresh and John Guttag’s “Sources of Harm” diagram.

Press + to interact
Post-pipeline sources of harm
Post-pipeline sources of harm

Selection bias

When data scientists attempt to simplify and transform a dataset, we often perform a key step called feature selection in which we choose which features explain the data the best. For example, in the mortgage lending dataset (reproduced below), we might decide that the “Credit Score” best explains the approval decision.

Mortgage Loan Dataset

CustomerID

Sex

Amount (in Thousands)

Credit Score

Income (in Thousands)

Approved?

1

Male

14

790

185

Y

2

Male

7

NA

NA

N

...

...

...

...

...

...

1,000

Other

4

621

87

N

In choosing "Credit Score" as one of the variables in our final model, we’re also implicitly deciding that all of the measurement bias associated with credit scoring (of which there is a ton) is important. Considering that credit scoring is historically biased against minority populations, our model now makes associations between minority status and loan approvals.

Learning bias

In ML, there’s a very stark trade-off between model performance and model fairness. There are some circumstances where this can be avoided, but in general, this holds true. ML models work by optimizing on a certain target metric (e.g., accuracy, cross-entropy loss, etc.), but the choice in performance metric often sacrifices fairness metrics (e.g., demographic parity, equalized odds, etc.). This could lead to undesirable outcomes if not properly monitored. For now, consider that in order to perform well over a dataset, it makes sense to generalize, but generalizing often favors majority groups within the data at the cost of minority groups. In the context of the mortgage lending data, it makes sense to optimize on White male applicants (since those are generally better represented in the data) because doing so will yield better overall accuracies. For minorities, this model may not work so well.

Aggregation bias

Aggregation bias is very similar to representation bias. Succinctly, when a model generalized over a large population (containing majority and minority groups) is used to make predictions on a minority group, the results will be suboptimal since the model is essentially fit on the dominant population. A model trained on mortgage lending data without any controls or fixes will often perform suboptimally for minority groups.

Evaluation bias

In order to conclude that a model is working or not working, it must be compared to some benchmark. For example, in research, lending algorithm performance might be evaluated on large, public datasets like the Home Mortgage Data Act (HMDA) dataset. However, HMDA datasets also contain some level of historical bias due to a long history of mortgage lending being unfair towards minorities. When newer models compare performance against the HMDA, they’re measuring against a standard that is itself misrepresentative. Benchmarks are a great idea when comparing models against one another, but this step should be done internally, i.e., compared on the exact datasets that will be used in production. If we're building a lending algorithm and we want to compare it on real data, we need to backtest it on a dataset that represents the use case as closely as possible.

Press + to interact
Models trained over minority and majority populations in a representative fashion perform well
Models trained over minority and majority populations in a representative fashion perform well

Deployment bias

This source of harm tends to be the most financially and reputationally damaging, and is also (in many experts’ opinions) the hardest to catch because it occurs after the pipeline at a stage where data scientists no longer pay as much attention. Deployment bias is the misalignment of a model’s use pre-deployment and post-deployment. Essentially, models are tested and used in isolation during the pipeline, but when deployed, they become part of a larger "sociotechnical system" with lots of moving parts—both technological and human. Many times, model outputs have to be interpreted by humans in order to make decisions; confirmation bias and other psychological tendencies take control here. Many famous examples of ML products that fail end users in disastrous ways are usually traceable to deployment bias.