Data Privacy

Learn about the risks of reidentification and some historical examples of this disaster.

Data is very revealing when it’s collected on a mass scale, as it is today. Many will be familiar with data breaches or leaks where personally identifiable information (PII) is hacked, leaked, or stolen from databases and either kept for a ransom or published on the internet. The gravity of situations like this is why global governments charge such a high fee for violation of data privacy laws. In Europe, the General Data Protection Regulation (GDPR) is of the utmost importance for any company handling potentially sensitive data. Clearly, data privacy is important, but why should we be concerned with it?

Motivation

There are many risks to holding private data. Let’s discuss some of these risks and the consequences that may result from them.

Reidentification

In a famous caseNarayanan A., Shmatikov V., “Robust De-anonymization of Large Sparse Datasets” (The University of Texas at Austin)., a University of Texas at Austin researcher showed that they could effectively identify individuals in an anonymized database of Netflix subscriber ratings. The original dataset contained ratings for individuals who used and rated movies on Netflix. The UT Austin researchers combined an external dataset, the IMDb set, to match ratings by an individual. For example, if someone rated Captain America: Winter Soldier a 4.5/5 on Netflix but also rated it the same (or similarly) on IMDb, a match can potentially be made between the two. Because IMDb ratings are public, the individual can be identified. Of course, this only works when users make ratings on both Netflix and IMDb, but consider the alternative information that can be gained and then used predatorily.

If we know an individual watches lots of “rags to riches” movies and rates them highly, we can make the assumption that they not only enjoy this genre, but are also lower- to middle-class and dream of their own rags to riches story. We can then launch predatory advertising campaigns or otherwise use this information for our personal gain. In this manner, sexual orientation, political leanings, etc., can be garnered from just combining the two datasets. This is only possible because we know the full identity of the individual. Let’s recreate a toy example with a fake Netflix and IMDb dataset.

Get hands-on with 1200+ tech skills courses.