On October 16, 2019 Bob Diachenko and Vinny Troia discovered a wide-open Elasticsearch server containing an unprecedented 4 billion user accounts spanning more than 4 terabytes of data.

A total count of unique people across all data sets reached more than 1.2 billion people, making this one of the largest data leaks from a single source organization in history. The leaked data contained names, email addresses, phone numbers, LinkedIn and Facebook profile information.

What makes this data leak unique is that it contains data sets that appear to originate from 2 different data enrichment companies.

How Does Data Enrichment Work?

For a very low price, data enrichment companies allow you to take a single piece of information on a person (such as a name or email address), and expand (or enrich) that user profile to include hundreds of additional new data points of information. As seen with the Exactis data breach, collected information on a single person can include information such as household sizes, finances and income, political and religious preferences, and even a person’s preferred social activities.

Each time a company chooses to “enrich” a user profile, they are also agreeing to provide what they know about the person to the enriching organization (thereby increasing the validity of the organization’s future results). Despite efforts from social media organizations like Facebook, the resulting data continues to be compounded, creating a situation with no oversight that ultimately allows all of a person’s social and personal information to be easily downloaded.

The Open Elasticsearch Server

The discovered Elasticsearch server containing all of the information was unprotected and accessible via web browser at http://35.199.58.125:9200. No password or authentication of any kind was needed to access or download all of the data.

Elasticsearch stores its information in an index, which is similar to a type of database. The following is a screenshot of the different indexes (databases) available on the discovered server.

Elasticsearch data leak 4tb

The majority of the data spanned 4 separate data indexes, labeled “PDL” and “OXY”, with information on roughly 1 billion people per index. Each user record within the databases was labeled with a “source” field that matched either PDL or Oxy, respectively.

Company 1: People Data Labs (PDL)

Based on our analysis of the data, we believe the data in the PDL indexes originated from People Data Labs, a data aggregator and enrichment company.

De-duplicating the nearly 3 billion PDL user records revealed roughly 1.2 billion unique people, and 650 million unique email addresses, which is in-line with the statistics provided on their website. The data within the three different PDL indexes also varied slightly, some focusing on scraped LinkedIN information, email addresses and phone numbers, while other indexes provided information on individual social media profiles such as a person’s Facebook, Twitter, and Github URLs.

According to their website, the PDL application can be used to search:

  • Over 1.5 Billion unique people, including close to 260 million in the US.
  • Over 1 billion personal email addresses. Work email for 70%+ decision makers in the US, UK, and Canada.
  • Over 420 million Linkedin urls
  • Over 1 billion facebook urls and ids.
  • 400 million+ phone numbers. 200 million+ US-based valid cell phone numbers.

Attribution to PDL

After notifying PDL, we were informed that the server in question does not belong to them. This is consistent with our research as the server in question resided on Google Cloud, while PDL API appears to use Amazon Web Services.

In order to test whether or not the data belonged to PDL, we created a free account on their website which provides users with 1,000 free people lookups per month.

The following is a partially redacted sample of my personal record, downloaded from the 35.199.58.125 server.

{
  "id": null,
  "status": "created",
  "guid": null,
  "positions": [{
    "id": null,
    "title": "security evangelist, hacker, principal consultant",
    "description": null,
    "location": "saint louis, missouri, united states",
    "position_type": "Current",
    "company_name": "night lion security",
    "company_url": "twitter.com/nightlion",
    "start_date_year": 2015,
    "end_date_year": null,
    "start_date_month": 9,
    "end_date_month": null,
    "company_website": "nightlionsecurity.com",
    "company_size": "1-10",
    "company_industry": "information technology and services"
  }],
  "source": "PDL",
  "scheduled": null,
  "full_name": "vinny troia",
  "first_name": "vinny",
  "last_name": "troia",
  "url_profile": "https://www.linkedin.com/in/vinnytroia",
  "id_external_profile": "vinnytroia",
  "short_bio": "ceo, federal cyber / risk mgmt pro, hacker, problem solver, boundary breaker - featured: fox / cnbc / abc at night lion security. ceo, it risk management pro, hacker, problem solver, boundary breaker - featured: fox / cnbc / abc. cyber security pro | fedramp, fisma, nist guru | ethical hacker, hacking forensic investigator. cyber security pro | hacking forensic investigator | risk management, nist, fedramp. hacker, phd, cyber evangelist, keynote speaker, nist csf dissertation author. hacker, cybersecurity keynote speaker, osint, dfir, security evangelist. hacker, cyber evangelist, keynote speaker, nist csf dissertation author. health, environment and safety. greater st. louis area.",
  "is_deleted": false,
  "created_id": 1111,
  "created_dt": 1565870400000,
  "updated_id": 1111,
  "updated_dt": null,
  "timezone_id": null,
  "timezone_name": null,
  "timezone_geocoding_latitude": null,
  "timezone_geocoding_longitude": null,
  "lip_location": "ballwin, missouri, united states",
  "is_tc": null,
  "is_payment": null,
  "headline": null,
  "industry": "computer & network security",
  "linkedin_recruiter_profile_url": null,
  "location_shape": {
    "coordinates": [-90.54, 38.59],
    "type": "point"
  },
  "location_level": null,
  "emails": "vinnytroia@*, vinny@****, vt@***",
  "phone_numbers": "314*******,941*******,3146696569,1-636-825-2744",
  "experience_years": 4,
  "is_scheduled": null
}

Almost 100% Data Match

The data discovered on the open Elasticsearch server was almost a complete match to the data being returned by the People Data Labs API. The only difference being the data returned by the PDL also contained education histories. There was no education information in any of the data downloaded from the server. Everything else was exactly the same, including accounts with multiple email addresses and multiple phone numbers.

To confirm, we randomly tested 50 other users and the results were always consistent.

An Interesting and Unique Match

One of the phone numbers returned for my profile was 1-636-825-2744. I do not remember ever having this phone number, so I decided to look into it. Roughly 10 years ago I was given a land line as part of an AT&T TV bundle. The landline was never used and never given to anyone – I never actually owned a phone, yet somehow this information appears in my profile.

When I checked my account on PeopleDataLabs.com, the returned results were identical – including that phone number.
Since I have never seen this phone number appear in any of my previously breached/leaked records, this is a very good indication that the leaked database originated from PDL.

Company 2: OxyData.io (OXY)

After some basic sleuthing, I came across OxyData.io, another data enrichment company. OxyData’s website claims to have 4TB of user data (exactly the amount discovered), but only 380 million people profiles.

OxyData Analysis

Analysis of the “Oxy” database revealed an almost complete scrape of LinkedIN data, including recruiter information.
Upon contacting OxyData, I was also informed that the server did not belong to them. Oxy was not willing to give me access to their API to test/compare profiles, but they were nice enough to send me a copy of my own record for analysis. The data they sent contained mostly scraped LinkedIN profile, and appears to be a match for the data data.

Who is Accountable?

This is an incredibly tricky and unusual situation. The lion’s share of the data is marked as “PDL”, indicating that it originated from People Data Labs. However, as far as we can tell, the server that leaked the data is not associated with PDL. This raises a number of other questions. First, how did this mystery organization get the data? Are they a current or former customer? If so, the data discovered on the server indicates that this company is a customer of both People Data Labs and OxyData.

If this was a customer that had normal access to PDL’s data, then it would indicate the data was not actually “stolen”, but rather mis-used. This unfortunately does not ease the troubles of any of the 1.2 billion people who had their information exposed.

If this was not a breach, then who is accountable for this exposure?

The Problem of Attribution

Identification of exposed/nameless servers is one of the most difficult parts of an investigation. In this case, all we can tell from the IP address (35.199.58.125) is that it is (or was) hosted with Google Cloud.

Because of obvious privacy concerns cloud providers will not share any information on their customers, making this a dead end.
Agencies like the FBI can request this information through legal process (a type of official Government request), but they have no authority to force the identified organization to disclose the breach.

One could argue that because PDL’s data was mis-used, it is up to them to notify their customers. One could also argue that the owner of 35.199.58.125 is responsible and liable for any potential damages. But legally, we have no way of knowing who that is without a court order.

Due to the sheer amount of personal information included, combined with the complexities identifying the data owner, this has the potential raise questions on the effectiveness of our current privacy and breach notification laws.

About Shadowbyte

Shadowbyte is a next-generation threat intelligence platform, providing organizations, investigators, and law enforcement with the ability to search across thousands of data breaches, with full historical visibility into private, deep, and darkweb hacker channels, pastes, and forums. Shadowbyte is designed for both brand monitoring and threat actor intelligence research. For more information on how we can help in data breach and cyber criminal investigations, please contact us.