Threat Research

The 5 Vs of big data for machine learning

Martin Bálek, 16 August 2017

How Avast uses big data and machine learning to protect you

Most of today’s malware goes through automated modification, upgrade, and re-deployment so frequently and quickly that machine learning is a vital security solution component. Machine learning allows a system to learn from data and observation automatically. The most effective machine learning occurs when the learnings are gained via big data: the more information we feed our machines, the more accurately they identify trends and create models. This is true not only in security, but in every area that uses machine learning.

So how do we know if information is actually “big data”? Big data can be characterized by 5 traits: volume, velocity, variety, variability, and veracity. These are regarded as the five pillars of big data, and they define the dynamic level of data that is required for truly useful learning in the fight against malware.


Without vast amounts of data, our machines wouldn’t be able to learn. At Avast, thanks to more than 400 million customers worldwide, we see about one million executable files a day. Each customer’s machine acts as a sensor, feeding us detailed information about these files, down to the code’s smallest nuance. Our system processes this mass amount of data (roughly 330 TiB), analyzing, learning, and classifying each file as malicious or clean. The machine learning enables the system to make intelligent decisions when they encounter files they’ve previously never seen.


As we mentioned, malware spreads and morphs rapidly, so detection has to be as quick as possible. Most threats are short-lived -- some actually exist for only a few minutes. Before they can be detected, threats try to morph into something else. The only way we can keep up is through fast, automated systems. Instant, correct decision-making by those systems cannot happen without well-designed, well-trained machine learning that is regularly informed by, or “fed,” big data.


The type and nature of the data is also essential. We need to feed our engines both clean and malicious files to let them learn how to distinguish between the two. The more diversity in the files it analyzes, the smarter our system becomes. Large amounts of contextual data results in more accurate threat detection, as malicious behavior within files become more easily recognizable.


Every file received by our system is categorized as either clean, potentially unwanted, or malicious. A file’s class, however, can change over time, causing our machines to falsely classify them. When a clean file is classified as malicious, it’s called a “false positive,” and when a malicious file is classified as clean, it’s called a “false negative” (or “miss” in antivirus tests). Our goal is to have a zero “false negative” rate - zero misses, meaning our machines catch every malicious file. We also want as few “false positives” as possible (i.e., no legitimate files being blocked). PUPs (potentially unwanted programs), in particular, can cause “false positives,” as they fall into a gray area between clean and malicious. Ultimately, data variability doesn’t pose a big challenge to Avast because we’ve developed our systems to produce almost no misses and relatively low “false positive” rates.


The data we receive comes with a lot of noise that could potentially influence how our machines detect files. For example, we sometimes see hardware failures such as memory drives or hard drives, which can yield a wrongly calculated SHA-256 (the file’s unique fingerprint). We also see a lot of damaged files, which occur when the client is not able to download, or when files were not correctly uploaded to our cloud. To avoid situations like these, we‘ve built machine learning systems robust enough to distinguish signal from noise.

At Avast, our big data encompasses these 5 Vs. Furthermore, this big data fuels our machine learning, which in turn arms us with the knowledge we need to remain the largest threat-detection network in the world. This is exciting work, and we enjoy finding defensive solutions against the most nefarious malware out there. Handling big data is just one part of our overall mission to build the most robust and reliable infrastructure that can consistently deliver top-notch protection to our customers, wherever they are.