These Breach Data Will Make You Rethink Data Security

Emy Cohen
5 min readSep 4, 2020

Data is one of the most valuable assets for companies. It is used to make intelligence decisions and forecasts. In many scenarios, the value of the data grows exponentially as the size of the data grows.

As companies profits from these data, attackers are also actively trying to steal them for profit. The scale of the data breaches has also getting increasingly large. I analyzed the breach data from https://haveibeenpwned.com/ to answer the following questions:

Q1: What are the top incidents?

Q2: Does the incidents and breach scale increase over time?

Q3: Does the amount of leaked accounts increase as the number of incident increase?

Q4: Does any months has more incidents or more leaks than the others across the years?

Q5: What is the ranking of the asset classes in these incidents?

Q1: What are the top incidents?

The top 9 data breaches ranked by the number of compromised account is shown above. The amount of accounts leaked ranges from 0.17 to 0.78 billion. Some breaches happen to well know companies such as Linkedin, Zygna, and MySpace. Others may be less well known.

The largest data breach is from Verifications.io, a self-described “big data email verification platform. The main business of the company is to verify a customer’s email compaign list and make sure that they are valid and will not bounce back. See below for an indepth cover on the incident. https://www.wired.com/story/email-marketing-company-809-million-records-exposed-online/

The second largest data breach is from The Onliner Spambot, a spammer operation that sends spams to victims. See below for an indepth cover on the incident.

https://www.zdnet.com/article/onliner-spambot-largest-ever-malware-campaign-millions/#:~:text=The%20spambot%2C%20dubbed%20%22Onliner%2C,%2Dboggling%20amount%20of%20data.%22

Did the reader find anything interesting here? The largest two data breaches are both from email-centric data aggregation operations. One is operated by legitimate company and the other by the bad guys.

Q2: Does the incidents and breach scale increase over time?

The graph above shows the number of breaches by year and we can obtain the following observations:

  1. The number of breaches increases over time from 2007.
  2. The number of breaches reaches its peak in 2016.

Keep in mind that the data contains the number of breaches disclosed which is different from the number of total breaches happened. In reality, some data breaches are never discovered or discovered after the year it happens. Some data breaches are discovered by not disclosed. Over the years, data breach notification laws have been changing and each state has its own requirements. See below for more details:

One reasonable observation we can draw given the order of magnitude of increase from 2007–2010 to 2014–2020 time frame is that the number of data breaches are likely increasing and the legal requirements are also likely improving to force companies to disclose these breaches.

The graph above shows the number of accounts leaked by year. It is not surprise that as the number of incident increases over the years, the scale of the breach also increases, although the absolute scale of the breach is worrisome.

Q3: Does the amount of leaked accounts increase as the number of incident increase?

I aggregated the total breaches by year and the total leaked account by year. We can see that the total number of leaked account does not necessarily have monotonically increasing relationship as the number of breaches. This is intuitive since some entity has aggregated much more data than others.

Q4: Does any months has more incidents or more leaks than the others across the years?

The above facet plot show the amount of data breaches per month for each year. Although it might be difficult to get a universal conclusion, we can still see in most years between 2011 and 2020 (Note the data from 2020 is from Jan to Aug since this article is written in 2020/09), the amount of incidents increases in June and July time frame. In a few other years, peaks happens in December and January time frame. The reason for this could be that many attackers are picking the time when company employees go for vocations. This makes sense since the IT security team will have less people and other employees are typically either more relax which can make them more subject to social engineering or not using their devices during vocation.

Q5: What is the ranking of the asset classes in these incidents?

Among 121 types of asset categories, I plotted the top 10 assets that appears in the incidents. This shows the type of data attackers are aiming for the most. We can see email address, username and passwords are among the highest targets. Personal information such as names, date of birth, phone number, addresses, and gender are also most frequently targeted. It is interesting to web activity is also among the top 10 categories. There are several reasons why attackers can be interested in this:

  1. Given web activity, attackers can potentially build a personal profile for each person and this can be used for password cracking dictionary creation for the user’s other accounts.
  2. The profile can also be used to create social engineering campaigns such as phishing emails that the users are more likely to be interested in.

Conclusion

As our lives getting more and more digitized, data becomes the most valuable assets for companies and attackers are highly monetary driven today. This results in a higher number and larger scale of data breaches. Attackers also strategically picks their time to attack and collect not only data that they can sell but also data that can improve their intelligence.

I hope to see more technical, legal, and business innovations that can create more secure online world!

What else would you like to see in breach data analysis? Write your comments below.

--

--