Privacy

Anonymization vs. pseudonymization: The lingering data risk

Joe Bosso, 29 Sep 2021
Joe Bosso, 29 Sep 2021

Make sure you’re aware of the types of data you share with companies and how it will be used

In today’s environment, many people are aware that companies are tracking their online activity and mine that data for a variety of business purposes. These purposes range from personalization and advertising to analyzing on-site behavior to inform site design changes. Many users are alright with this data collection because they have been assured that the data will be anonymized and will not be linked to them personally  but is that really the case?

Before we get into the effectiveness of data anonymization, let’s first discuss what it is exactly. The European Union’s headline Data Protection law (the GDPR) defines it as “information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable.” This is the gold standard of data protection measures fully removing the link between the data and the person to whom it originally related  and because of that, truly anonymized data is no longer subject to most data protection rules. What this could mean to you, for example, is that if a company collects some of your personal data (like your name, gender, zip code, and date of birth), they can anonymize it by removing enough identifying information, such as your name, from the data set, to remove the possibility of you being identified. Thus, they seek to eliminate the risk if a data breach occurs and they leak the other data points. The thinking behind this is that the data is harmless because it isn’t tied to an individual (e.g. 48 years old, female, located in New York City), so companies will sometimes share the data or make it publicly available.

This can, however, continue to be a significant privacy concern because researchers have shown that it’s often relatively easy to combine one of these data sets with another in order to (re-)identify individuals. Imperial College London conducted research that concluded "once bought, the data can often be reverse-engineered using machine learning to re-identify individuals, despite the anonymization techniques. This could expose sensitive information about personally identified individuals and allow buyers to build increasingly comprehensive personal profiles of individuals. The research demonstrates for the first time how easily and accurately this can be done — even with incomplete datasets. In the research, 99.98% of Americans were correctly reidentified in any available, “anonymized” dataset by using just 15 characteristics, including age, gender, and marital status." That’s a shockingly high accuracy which highlights that, in some cases, data anonymization is not at all effective at protecting personal data.

In this sort of situation  in which the individuals, or “data subjects”, can be relatively easily re-identified  then what has taken place is not true “anonymization” by the standards of the GDPR (which is very hard to do) because ultimately the individuals were still “identifiable”. When determining whether someone is still identifiable, you need to take into account all methods reasonably likely to be used by anyone to (re-)identify that person, directly or indirectly  this includes merging datasets, as discussed above. 

In these sorts of cases, it may be better to consider the data as “pseudonymous”, rather than fully anonymized. The GDPR defines pseudonymization as “the processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information”, as long as that additional information is kept separate. Although pseudonymization has many uses, it should be distinguished from anonymization. This is because, in many cases, it only provides a limited protection for the identity of data subjects due to the fact that it still allows identification using indirect means. Where a pseudonym is used, it is often possible to identify the data subject by analyzing the underlying or related data. Therefore, personal data can only be considered to be truly anonymized when it’s no longer reasonable to expect that individuals could be identified or identifiable and with advances in data analytics and machine learning, it’s getting harder and harder to ensure this is the case.

How can companies and consumers protect their data?

Rajesh Parthasarathy points out to Forbes that in order to prevent this kind of re-identification from occurring, companies should take additional steps to protect the data that they are collecting. First, they should be aware of the re-identification risks of their data sets. For example, researchers have shown that a zip code, date of birth, and gender can be used as a “quasi-identifier” to accurately identify 87% of the US population. 

Companies should therefore alter their data so that it cannot be re-identified if it’s leaked or if they plan on sharing the data with others. Differential privacy would be one technique for this. However, these companies also need to consider the need to preserve the data so that it is still usable (and valuable) for their analysis. It’s a balancing act for them to weigh the costs and benefits when handling data. 

As a consumer, you are placed in the unenviable position of relying on companies to be good stewards of your data and potentially hurt their bottom line in doing so. Companies may think they’d get a free pass when they incur a data breach if the data has undergone some anonymization techniques. However, unless the data were truly, irreversibly anonymized, then their claims that the data cannot be tied back to individuals are misleading, and there should be more accountability and awareness of the risks associated with partially-anonymized datasets. 

Due to this uncertainty around the safety of your data, we urge you to exercise caution and make sure you’re aware of the types of data you share with companies and how it will be used.


Further reading:
Protecting your personal data online
Steer clear of social media quizzes
How to reclaim your online privacy