Practice of Data Bias Mitigation

Explore some products that allow for proper sourcing and observability to mitigate biases.

We'll cover the following

Data sourcing
- Synthetic sources
- Real sources
Data labeling
Data debiasing
Data observability
Ethical AI ecosystem and EAIDB

Theory is nice to have, but professionals often use advanced tooling and software products to handle data quality issues. This lesson covers some of these products and where they excel. We also discuss their disadvantages and where their gaps lie. These services are useful for controlling and mitigating data risk, but nothing is a guaranteed solution. The best and only way to truly reduce data risk is to produce ethically curated data and have domain expertise on how the various factors interrelate with protected attributes. Everything else is a shortcut.

Data sourcing

It’s not typically straightforward to get our own data. In the past, we'd have needed to conduct in-person surveys or collect feedback over long periods of time to amass enough to use for decision-making. In today’s world, this is no longer the case—but sourcing good data is still difficult.

To source safe, ethical, and legal data, there are two paths available: real data and synthetic data. Each has its pros and cons.

Synthetic sources

Synthetic sources are just what they sound like. These are usually very niche companies that provide synthetic datasets or a creation API to generate safe samples. Here are a few examples:

Datagen: This is a data provider for computer vision datasets. They use a mixture of 3D art modeling and GANs to create artificial, diverse (from a racial or gender perspective) data—particularly for faces, bodies, and some other areas.
Datomize: This is a tabular synthetic data provider, with the additional guarantee of model performance. They use several generating methods (in particular, some types of advanced GANs) to create digital twins of any type of data field (including free text, which is quite difficult to do).
Syntric AI: Much like Datagen, this creates localized synthetic data, but specializes in faces. They are the only product on the market that can create infinite skin types (with blemishes, birthmarks, etc.). They provide an API that allows developers to customize items like skin tone and even orientation (roll/pitch/yaw).

Synthetic data is fast, cheap, and quite good if the problem we’re solving is: a) common, and b) has only one kind of dimension. Syntric AI creates facial datasets, but can’t go outside of this realm and create, for example, body datasets. Datagen has made a business out of expanding horizontally to multiple types of dataset generation.

There are some industries that prohibit synthetic data because it it’s not a “real” sample from a “real” population—it’s more of a generated mock-up that exhibits the same statistical properties. There are also some issues with using generative models to produce synthetic data because these models have a tendency to decrease variance between data points and omit important abnormalities like outliers, rare events, etc.

Real sources

For most cases that cannot be solved with synthetic data, the only solution is to acquire real data. There are ways to expedite this process, however. Instead of performing representation checks ourselves (to make sure we’re not convenience sampling and have a diverse set), we can employ the work of other companies.

It’s important to note that most synthetically generated datasets either contain the same amount of bias as the input data or contain no bias at all. With real data, there’s no such guarantee that conditions will remain the same—all data that comes in contact with real people contains bias.

Citibeats: This is a data-sourcing company that essentially builds “social understanding tools.” They build dashboards and have a data feed containing information around the world on particular topics. They also monitor opinions and sentiments around these topics in a way that allows for greater visibility of minority perspectives. This data can be used to make decisions or as a reliable source of information.
vAIsual: This is a repository of ethically sourced, legally compliant facial data. The company has a plug-and-play approach where, after the data is purchased, it’s immediately ready for use.

Data labeling

For cases where unlabeled data is the majority, data labelers are services that provide flexible, functional, bias-less labeling using their own ML capabilities.

Snorkel AI: This is a well-established startup offering (among other things) data-labeling services. Their models can be as simple or as complicated as needed, but they specialize in “distilling domain experts’ knowledge” into their labeling algorithms. Essentially, they apply topics in weak supervision and programmatic labeling.

For those who’re curious, weak supervision is the idea that a “labeler” can be trained with heuristics and rules based on a limited amount of pure, labeled data. This labeler can then go through and apply what it has learned to semi-accurately label other unlabeled data entries. Naturally, this isn’t as good as hand-labeled data, but it’s much more scalable and cost-effective.

Humans in the Loop: This is a hybrid approach that embraces the human-in-the-loop ideology. While they leverage smart labeling algorithms (weak supervision again), they provide a way to loop in humans with domain expertise for edge cases and other alerts. They also provide the ability to give feedback to the labeling systems with reinforcement learning.

Data debiasing

There are also several companies that create technology meant to “debias” or at least remove correlations that could lead to preferential behavior in the inner workings of a model.

Illumr: This is a data debiasing approach using adversarial algorithms (adapted from literature) to identify and fix biased data. Essentially, one algorithm is trained to recognize bias and another is trained to remove it. These algorithms attempt to outwit one another, but with each iteration, the bias remover does a better and better job of removing bias.

AIF360: This is a free, open-source toolkit developed by IBM with several operations related to algorithmic fairness. In particular, the library contains 15 data debiasing/bias mitigating algorithms developed prior to 2019 and also provides access to fairness metrics. It’s important to note, however, that research (and industry now) in AI moves at a breakneck pace. Already, many of these methods are out-of-date, but this is one of the only free toolkits that provide this functionality. Covered algorithms include adversarial debiasing, rich subgroup fairness, and many more.

LatticeFlow: This is a computer vision data debiaser. We mentioned “spurious correlations” in earlier lectures in which black-box vision algorithms (like CNNs) might pick up something in images that should not be taken into consideration (i.e., lighting conditions). LatticeFlow offers proprietary software that identifies and fixes this issue. Their product is cutting-edge and the entire team comes directly from prior research on the same topic done at ETH Zurich.

Data observability

Part of knowing our dataset comes from experimentation that data scientists perform prior to using the data. There are companies that provide advanced insights very quickly for anyone looking to understand a dataset with much more speed and efficiency.

Galileo: This is a machine learning-based error identifier. They provide “data intelligence” for computer vision data with an algorithm that sifts through unlabeled and unstructured data and automatically finds error patterns and error-prone subdata.

Ethical AI ecosystem and EAIDB

The ecosystem that contains all of these various elements and companies is sometimes referred to as the “ethical AI startup ecosystem.”

Get hands-on with 1400+ tech skills courses.

Introduction

Disasters in Data

Disasters in Models

Measuring Causal Relations with Python

Alternatives to Traditional ML

Adversarial Robustness of Neural Networks

Conclusion

Assessment: Disasters in ML Pipelines