The Big Brazil Data Leak

Article by Felipe Daragon and Syhunt Icy Team. February 12, 2021

Thanks to the over 8.000 companies that contacted us after the initial article and requested more information about how they were exposed by the leak. While we wait for the authorities' next moves, we continue to monitor the news and updates regarding the leak. Below you can find our key findings and analysis of the leak.


Our Analysis

Through our expert analysis and participation in a series of articles by the media, we helped highlight the dimension of the mega leak that exposed data from almost all Brazilians in January, 2021.

We concluded that between 673GB and 873GB (nearly 1TB) of data about Brazilian companies and individuals was stolen in 2020 and compiled into a single archive, likely from multiple leaks that occured over time. As a result, key details from a staggering total of 223 million Brazilians individuals and 40 million Brazilian companies were exposed and are being actively sold by cybercriminals on Internet forums and the Dark Web.

Following a request by the Estadão newspaper, we analyzed the case together with the newspaper. We revealed a number breakdown of the leak, among other relevant details with the intention of informing the public and the businesses and raising awareness about the troubling scale of the leak.

Days later, the publication of a second article by the Estadão prompted the Brazilian Supreme Federal Court to order an investigation and the blocking of access to the cybercriminal's posts and links. Since then, an investigation by the Brazilian authorities is underway.

Additional analyses by Syhunt in partnership with the newspaper revealed that 1) the face pictures in the cybercriminal's archive were actually copied from DivulgaCand and also that 2) half a million corporate mobile numbers were exposed.

The leak in numbers

Following Estadão 's request, we processed the catalog and small samples published by the cybercriminal, simulated individual CSV exports and performed a variety of math calculations to confirm the cybercriminal's claims, uncover a "picture" and very closely estimate the full size of the leak and the databases in the hands of the cybercriminal:

223M 40M 104M
Total of Brazilian Individuals ExposedTotal of Brazilian Companies ExposedTotal of Vehicles Exposed
37171 (48 col.)
Information CategoriesInformation CategoriesInformation Categories
650GB200GB23GB
Est. Uncompressed People Database SizeEst. Uncompressed Business Database SizeUncompressed Database Size
3KB4KB200 bytes
Est. Data Size Per Person (Without Face Pic)Est. Data Size Per CompanyAprox. Data Size Per Vehicle
39.64522.983288.167
Total of People in Leaked SamplesTotal of Companies in Leaked SamplesTotal of Vehicles in Leaked Samples
1.1M (16GB est.)N/AN/A
Total of Face Pictures in DatabaseTotal of Face Pictures in DatabaseTotal of Face Pictures in Database


The leak in numbers: Phone numbers

159M 28M
Total of Brazilians with Phone Details Exposed (Mobile/Landline): 159845321Total of Companies with Phone Details Exposed (Mobile/Landline): 28695845
6.945 532,696
Total of Mobile Phone Numbers in Leaked SamplesTotal of Corporate Mobile Phone Numbers in Leaked Samples

Total of Leaked Corporate Mobile Phone Numbers in Samples - By State

Paraná205.640
São Paulo202.829
Minas Gerais19.801
Rio Grande do Sul17.802
Rio de Janeiro14.721
Distrito Federal6.513
Santa Catarina5.134
Espírito Santo4.080
Mato Grosso836
Tocantins439

The source of the leak

We named this leak BLB20 (Big Leak of Brasil 2020), because the cybercriminal's data is up-to-date till 2020. Much has been speculated about the source of the leak and we will likely learn more about it as the investigation by the Brazilian authorities, information security companies and the media organizations progresses.

Part of this data breach may have been an inside job - carried out deliberately and maliciously by a firm employee, an opinion shared by many security researchers. We believe that cybercriminals or some analytics company compiled various leaks that happened over the years into the single archive. We concluded and later Estadão confirmed that face pictures in the database were copied from TSE's DivulgaCand, which appears to confirm the compilation of data from multiple leaks and sources.

The cybercriminal referred to his archive as the Serasa Experian database. Serasa Experian is a major Brazilian credit research firm, but the company stated that carried out an internal investigation and the data in the leaked archive doesn't matches the data found in the company's database.

Another mega leak? On January 10, reports of a second mega leak emerged, but, though it comes from credible source that alerted about the first leak, due to the lack of references, we've not been able to confirm the new leak - this analysis and article is about the first leak only.

The Leaked Information Categories

The following are the categories of information revealed in the leak and the estimated size of each individual database:

Business / Legal Entity Data

Data Set NameDescriptionEstimated Size
01 - BasicCNPJ, corporate name, trade name, registration (head office / branch, situation), date of foundation, number of employees, size, legal nature8.3GB
02 - Email2.9GB
03 - TelephoneArea code, number, operator, plan, line type (fixed, prepaid, postpaid), installation date48.2GB
04 - AddressStreet address, number, neighborhood, city, state, zip code, type (Residential / Commercial), latitude and longitude8.5GB
05 - MosaicTargeting group and subgroup1.7GB
06 - BusinessName and CPF of the company’s partners, participation (shares and %), date of entry into the company45.9GB
07 - IRSFoundation date, registration status (Active / Downloaded / Inept)5.5GB
08 - Credit ScoreRisk score, risk level (Low / Medium / High)2.2GB
09 - Legal RepresentativeCPF and name of representative, registration status (Active / Downloaded / Unfit)2.0GB
10 - Checks without FundsBank code and branch, reason (No funds / Account closed)0.1GB
11 - Operating ClassHours of operation (24h, commercial 9 am to 6 pm, lunch, night etc.), type of distribution (physical retail, online retail, physical wholesale)0.2GB
12 - National Simple and SIMEISituation (Opt / Non-opt)4.3GB
13 - Legal NatureCorporation, individual entrepreneur, cooperative, public agency, etc.2.6GB
14 - Share Capital Value1.7GB
15 - DebtorsType (principal, co-responsible), responsible unit, registration, type of credit (fine, IRPJ, COFINS, CSLL etc.), amount9.5 - 20 GB
16 - SintegraState registration number, activity start date, registration status1.4GB
17 - CNAE3.8GB
All Data Sets - Aprox. Total Size150 - 200GB

Personal Data

  • 01 - Basic: person's name, CPF, gender, date of birth, father’s name, mother’s name, marital status (married, single, divorced, widowed, others)
  • 02 - Email
  • 03 - Telephone: Area code, number, operator, plan, line type (fixed, prepaid, postpaid), installation date
  • 04 - Address: street address, number, neighborhood, city, state, zip code, type (residential / commercial), latitude and longitude households: CPF of householder, number of persons, income bracket, full address schooling: level (illiterate / elementary / technical / higher etc.)
  • 05 - Mosaic: targeting group and subgroup
  • 06 - Occupation: position, number CBO (Brazilian Classification of Occupations)
  • 07 - Credit Score: credit activity, risk score, risk level (Low / Medium / High)
  • 08 - RG (Identity Card)
  • 09 - Voter Title: registration number, zone, section, address, county, state
  • 10 - Education
  • 11 - Business: name of the partner of a company, participation (shares and%), corporate name and trade name of the company, CNPJ, date of entry into the company
  • 12 - IRS: cadastral situation (Regular / Suspended / Canceled / Deceased Holder)
  • 13 - Social Class: A1, A2, B1, B2, C1, C2, D, E
  • 14 - Marital Status: married, single, divorced, widowed, others
  • 15 - Job: CNPJ and corporate name of the employer, PIS / PASEP / NIT number, CTPS number, type of employment (CLT, self-employed, server, apprentice etc.), date of admission, salary, hours of work per week
  • 16 - Affinity: accuracy level, percentile
  • 17 - Analytical Model: predicts chance of consumer having affinity to buy a product or service
  • 18 - Purchasing Power: level (low, medium, high), income, salary
  • 19 - Photos of Faces: 1,176,157 JPEG images with dates between 2012 and 2020; the file name is the CPF of the corresponding person
  • 20 - Public Servants: job description, capacity, exercise, gross income, status, bond, removal (Yes / No)
  • 21 - Checks without Funds: bank code and branch, reason (No funds / Account closed)
  • 22 - Debtors: name, type of debtor (principal, co-responsible), situation (active, in collection, filed), type of debt (fine, income tax, PIS etc.), amount, did it end up in court? (Yes / No)
  • 23 - Family Grant: amount, status of benefit (Released / Blocked), status of benefit (Active / Inactive), number and name of dependents, NIS (Social Identification Number)
  • 24 - University / College Sudents: 1,643,105 people with college name, course, year of entry and year of completion
  • 25 - Advicers: 2,260,960 people who provide consultancy in the public or private sphere, including situation, specialty and occupation code
  • 26 - Households: all the people who shares the same address
  • 27 - Family Bond: categorizes people according to a first degree (mother, father, son, daughter, brother, sister, spouse) or second degree (grandfather, grandson, uncle, nephew, cousin, etc.)
  • 28 - LinkedIn: 5,051,553 social network profiles with ID number and access URL
  • 29 - Salary: value, type (monthly, biweekly, weekly, etc.), hours per week
  • 30 - Income: monthly amount (includes salary, rent, interest, etc.), social class (low, medium, high), income range
  • 31 - Deceased: date of death, age, date of death certificate, name and address of the registry office.
  • 32 - IRPF (Income Tax): bank institution name, branch code, refund lot
  • 33 - INSS: insured’s name, benefit number, start date, type (retirement, pension, maternity salary, etc.)
  • 34 - FGTS: PIS number
  • 35 - CNS (National Health Card)
  • 36 - NIS (Social Identification Number)
  • 37 - PIS / PASEP

All Data Sets - Aprox. Total Size: 500 - 650GB

Vehicles Data

  • ID: internal database number
  • Kind of Person: physical or legal
  • Update Date: varies from 1993 to 2020)
  • Board: in old or new format
  • Municipality and UF of the board
  • Vehicle Situation
  • Restrictions: without restriction, restricted by theft, pledge, fiduciary alienation, etc.
  • Chassis Number
  • Chassis Situation: Normal, Restricted
  • Engine Number
  • Gearbox Number (if applicable)
  • Body Number (if applicable)
  • Body Type: open, closed, jeep, van, double cab, motorcycle etc.)
  • Invoiced Document Type
  • Billed UF
  • Billed: contains sequence of numbers related to the invoiced document, such as invoice
  • Brand and Model: there are 37 thousand different models
  • Model Year
  • Year of Manufacture
  • Vehicle Color
  • Vehicle Type: bicycle, moped, scooter, motorcycle, automobile, bus, truck, etc.
  • Kind of Vehicle: passenger, cargo, mixed, traction, collection etc.
  • Fuel: gasoline, alcohol, diesel, natural gas, electric, etc.
  • Power: power in HP
  • Displacement
  • Maximum Traction Capacity
  • Total Gross Weight
  • Battery Capacity
  • Number of Passengers
  • Number of Axes
  • Nationality: domestic or imported
  • DI: import declaration
  • Importer’s Identity
  • Type of document of the importer

How we got to the numbers

Through our collaboration with the media, which included Estadão, Folha de São Paulo and Tecnoblog, to produce the above analyses and estimates, as long-time information security researchers and professionals, we acted responsibly - during this process, we didn't seek to contact the cybercriminal or seek to purchase data sets from the hacker, and we did not obtain a copy of his archive, that we above estimated the full size. In addition to this, we didn't seek to financially profit from the leak in any way.

  • Est. Data Size Per Person (Without Face Pic) and Est. Data Size Per Company: based on the samples and data set catalog provided by the cybercriminal, we simulated a CSV export of data of single individuals and companies. We concluded, for example, that leaked business data about Syhunt itself was around 7.33 KB of text data. After examining the size of multiple simulated exports, we estimated the data size per person and per company.
  • Aprox. Data Size Per Vehicle - the usual size of each line of the leaked vehicles archive.
  • Total of Face Pictures in Database - 20GB est: we divided the size in bytes of the sample photo archive (17.3 MB) by 1.334 JPEG files. Then we multiplied by the number of available face pictures in the full archive (1.1M, or to be more exact 1,176.157).
  • Est. Uncompressed People Database Size: we multiplicated the estimated data size per person in bytes with the total of Brazilian individuals exposed. We also added the estimated uncompressed size of face pictures in the database.
  • Est. Uncompressed Business Database Size: we multiplicated the estimated data size per company in bytes with the total of Brazilian legal entities exposed. We also processed the cybercriminal catalog information with software per data set column and generated the estimates available below.
  • Est. Uncompressed Database Size (All Databases) - Nearly 1 TB: the sum of the People, Business and Vehicles estimated database sizes.

Conclusion

This is the biggest and most serious data leak that Brazil has ever experienced. Syhunt recommends real, immediate and continuous efforts, by the government and private sector, to vigorously respond to this leak, which must include, among other things:

  • Accelerate response to this leak and future leaks.
  • Suppress the selling of the leaked information.
  • Prevent the leaked data from being actively exploited by criminals.
  • Create new mechanisms to detect, monitor and report leaks.
  • International cooperation with other law enforcement agencies.
  • Discuss, and put in place, concrete countermeasures with the help of key information security companies and professionals.

About Syhunt Security

With next-generation assessment technology, Syhunt established itself as a leading player in the web application security field, delivering its assessment tools to a range of organizations across the globe, from the SMB to the enterprise. Syhunt products help organizations defend against the wide range of sophisticated cyberattacks currently taking place at the Web application layer.

Syhunt proactively detects vulnerabilities and weaknesses that lead to data leak or breach - Syhunt tools focus on the many angles and views that can be used for evaluating the security state of a web application, such as its live version (through dynamic analysis / DAST), source code (SAST), server log (proactive forensics) and configuration (hardening).

Syhunt's founder Felipe Daragon started his career working as a security consultant for government organizations and corporations in the 90s. In the beginning of his career he worked for leading information security firms in Brazil. Daragon's last 22 years in the information security industry were dedicated to proactively defend companies and government agencies from attacks, and raising awareness about pressing security issues and new cyber attack trends.

References & Thanks

Special Thanks

  1. Thanks to Paulo R. Santos (Jump2) and Mario C. Fialho for participating the analyses together with Syhunt and the newspapers.
  2. Thanks to Felipe Ventura for the first detailed analyses about the leak, which were posted by Tecnoblog as part of two articles and started to highlighted the dimension of the leak. Thanks to Renato Kopke for sending me the links to the articles.
  3. Thanks to Roberto F. Marc (Syhunt) for reviewing the math calculations.

References

  1. Megavazamento de dados de janeiro expôs mais de 500 mil celulares corporativos, Gizmodo. February 11, 2021
  2. Megavazamento de janeiro fez meio milhão de celulares corporativos circularem na internet, Estadão, February 10, 2021
  3. Fotos de megavazamento são de políticos que se candidataram entre 2012 e 2020, Canaltech, February 5, 2021
  4. Fotos em megavazamento de dados são de candidatos nas eleições entre 2012 e 2020, Estadão, February 4, 2021
  5. PF investiga venda de dados de Bolsonaro e de ministros do STF, CNN, February 3, 2021
  6. Após megavazamento, dados de ministros do Supremo são postos à venda Conjur. February 2, 2021
  7. Dados vazados podem render R$ 80,8 milhões ao criminoso Folha de São Paulo. February 2, 2021
  8. Dados de Bolsonaro e ministros do STF estão à venda na internet após megavazamento Estadão, February 1, 2021
  9. Após vazamento, dados de 40 mil pessoas já circulam na internet. CNN (Via Estadão), January 29, 2021
  10. Após megavazamento, dados de 40 mil brasileiros já circulam na internet, Estadão, January 28, 2021
  11. O que há no vazamento que afetou 40 milhões de CNPJs, Tecnoblog, January 22, 2021
  12. Vazamento que expôs 220 milhões de brasileiros é pior do que se pensava, Tecnoblog, January 22, 2021

References Translated (In English)

  1. Details of the leak on 100 million vehicles in Brazil, January 25, 2021
  2. Leak that exposed 220 million Brazilians is worse than previously thought, January 22, 2021
  3. What's in the leak that affected 40 million CNPJs, January 22, 2021

Contact