Let’s explore the theoretical solutions for solving various types of data bias.

Non-ML approaches

These solutions are useful for data scientists to know because they’re generally cost- and time-effective tools and procedures that can greatly enhance the quality of the data, if done properly. These are, in many cases, simplified approaches to the higher-grade fixes that ML debiasers can provide. However, they are still very much worth knowing.

Oversampling and undersampling

One very simple approach is to change the sampling structure of the underlying data. In essence, we either duplicate rows of the minority group to match the numbers in the majority group (oversampling), or randomly remove rows of the majority class to match the numbers in the minority class (undersampling).

Oversampling

Let's consider a dataset with three variables: age, credit score, and race. We’ll use a binary race variable for simplicity. We’ll also set the prior distribution to draw race of 0 80% of the time. That way, we can quickly calculate the change in representation rate.

Get hands-on with 1200+ tech skills courses.