Understanding scraping in the Facebook, LinkedIn, and Clubhouse data leaks

Christopher Budd 27 Apr 2021

When it comes to your data, it’s best to remember that public is always public

Over the past few weeks, data for millions of Facebook, LinkedIn and Clubhouse users has leaked online. In fact, if you total all three of these, it amounts to information of over 1 billion users collectively. But you may have also heard all three companies say there wasn’t a hack. So what’s the deal? 

If you’re concerned about your personal information, you might be wondering what’s going on, since these two things seem contradictory. On the one hand, there’s the reality that the information of over a billion users of these services is out there suddenly. On the other hand all three services are saying there was “no hack” behind it. Can these both be true? If so, how?

They are both true. They’re true because the issue behind all three of these events is something called “scraping.” All three services have ultimately blamed “scraping” for the collection of the data -- and scraping is different from a hack or an attack. 

The difference between “scraping” and “hacking” may not matter to you: If your data is out there from any or all of these events and you don’t want it out there, the end result is the same. And, unfortunately, protecting your information against scraping is your job. No one else is going to do it for you.

Because of that, it’s important to understand what scraping is and how it works. That way, you can take steps to better protect your personal information from situations like this in the future.

What is scraping?

“Scraping” is a shortening of “screen scraping.” Screen scraping is when a program or script takes information from a web page or service and copies it, basically “scraping” the information from the screen.


Further reading:
Everything you should know about social media scraping
The Facebook data leak: What you should do today
Simple steps to take back your privacy from Facebook


For example, if you have a public website with names and phone numbers for people in different departments on separate web pages, someone can build a program or script to “scrape” that website and gather all of those names and phone numbers from all those separate pages and put it into one list.

In some cases, scraping can actually be useful because it pulls together dispersed data into one place. Having a single list of names and phone numbers is easier to use than looking across multiple webpages. 

Most importantly, scraping gathers data that’s already accessible. In our example, it’s a public website, so the only thing that’s happened is information that’s already public is gathered together and easier to access and use. If scraping gathered information that wasn’t already accessible, that would be a hack or an attack. But scraping itself doesn’t gather data that’s been hidden or protected: it’s just collecting data that was already publicly out there.

All three companies have indicated that the information that’s out there was already publicly available and is the result of scraping. In other words, people have written scripts or programs that copy and gather information that was already public on their services in order to create these massive lists of data.

What makes the results of scraping scary isn’t that there’s new data that’s been leaked; it’s that information that was already public is now gathered in a different form that is easier to store, catalog, and search.

What you can do about scraping

Odds are when you signed up for any of these services and put your information out there publicly, you were okay with that information being seen on a Facebook, LinkedIn, or Clubhouse page. You may not have expected that information could end up being publicly available in big data lists like this. It’s one thing knowing that someone can find your phone number by navigating to a Facebook page and finding it there. It’s another knowing that information is now in big, searchable files on the internet.

This is where the “hack” versus “scraping” distinction should matter most to you.

These companies are right: they haven’t been hacked; the information was already public. But if you’re not okay with your information ending up in this format, you’ll have to take matters into your own hands.

First, it’s important to understand that any information that is public is always at risk of scraping. Whether it’s a web page or a social media platform, it’s best to remember that, where data is concerned, public is always public. When data is public, you have no control over who copies it and what they do with it. If it’s public, it can be on the internet outside of your control forever.

Second, the only way to ensure that your public data isn’t scraped or used in ways you might not expect is simply to not make it public. If you’re not comfortable with the information potentially ending up in lists like this, either protect it by using privacy controls (if they’re available) or, better yet, don’t put it out there at all.

And if you’ve got data out there now that you don’t want out there, what can you do about that? Unfortunately, nothing. That’s why it’s so important to be sure that the information you put out there is information you’re okay losing control of. Because once it’s out there, there’s no getting it back.

In the end, these data leaks are a reminder that public information is public information and if you want to protect your information the only way to be sure is to not make it public. This is important to understand because the odds are good that there will be more scraping incidents like this in the future. And as we’ve seen in these cases, the only one that can and will protect your data against scraping is you. And once the data is out there, there’s no getting it back.

--> -->