Reidentification Example

Learn how attackers can reidentify using leaked data.

We'll cover the following

To better illustrate how dangerous reidentification is, we examine a relevant example in the financial context. We’ll take the recent Experian data breaches as inspiration.

Setup

Imagine we have three datasets—the Netflix ratings dataset (made public for research/competitions), the IMDb ratings dataset (always public), and credit data from Experian (obtained and released through a major data breach or leak).

Summary

Here’s a quick summary of the steps we’ll use to illustrate reidentification:

  1. Build the Netflix dataset (an individual user gives a movie a rating), the IMDb dataset (an individual user with an email gives a movie a rating), and the Experian dataset (an individual user with an email and a name has a credit_score and an annual_income ).

  2. Match the Netflix table with the IMDb dataset to see if we can find users with near-identical ratings across the two different sites.

  3. Match similar users’ emails with those in the Experian dataset to retrieve their credit scores and incomes with high fidelity.

  4. This information can then be used for predatory advertising or simply sold on the dark web.

Get hands-on with 1200+ tech skills courses.