The Avast Virus Lab uses supercomputing processing power to leverage an amazing amount of information quickly to automatically classify malware samples.
These are either features inside the product (such as FileRep and autosandboxing, including all of its recent development) as well as components that run on our backend – i.e. things that users don’t necessarily see but that are equally important for the overall quality of the product.
In fact, working on the backend stuff takes up more of their time these days, as more and more intelligence in Avast is moving to the cloud and/or is being delivered in almost real time via the avast! streaming update technology.
The Avast backend classifiers use a number of techniques, but the two hot ones that the team has been working on hard recently are things that we call Malware Similarity Search and Evo-Gen.
Malware Similarity Search is an important feature that allows us to pretty much instantly categorize a big amount of incoming samples. That is, for any file, it is able to say whether the file looks similar to an already seen malware file (or a whole cluster of malware files) as well as whether it’s similar to a known clean file (or a cluster of these). This may sound like an easy problem to solve, but in practice this is actually pretty difficult. Of course, the secret sauce here is how you actually define the metric (to be able to talk about similarity) and what all you take into account when representing a file. In Avast we take into account both static properties of the file as well as the outcome of a dynamic analysis (i.e. basically logs gathered during the execution of the file).
Now, a technology like this is obviously very valuable as it allows us to make fast decisions about files that we have never seen before. For example, if a file is very similar to a cluster of known malware samples, and at the same time it is not similar to any clean files, we categorize it immediately as malware. Believe it or not, we’re seeing thousands of files like this every day.
The second technology I mentioned, Evo-Gen, is somewhat similar but a bit subtler in nature. This is about finding as short and generic descriptions of large sets of malware samples as possible. Say you take a set of 1,000,000 malware samples (and 1,000,000 clean files) and give the algorithm the following task: find as few, and as brief descriptions of as many samples in the malware set, without describing any file in the clean set. Evo-Gen is a genetic algorithm that we have developed just for that. It often happens to find some real gems for us – e.g. a description of an apparently random set of tens of thousands of malware files scattered somewhat randomly across our virus sets. And the size of the description? 8 bytes.
Now, if you think about this for a while, you will find out that both of these algorithms have something in common. I mean, for both of them it’s necessary to have super-fast access to our vast sets of clean and malware files. Forget about sequential access (or any kind of processing of the files one by one). Even reading the samples off the disks takes hours.
For this purpose, the team has developed another great piece of technology that we call MDE. It’s basically an in-memory database that works on top of indexed data and allows heavily parallel access.
Traditionally, we have been running these things on classic server hardware. For the most part, we use standard Dell servers based on Intel Xeon CPUs. However, the performance has never been great and we always thought we should be doing better.
The real breakthrough came when we started experimenting with the GPUs. For starters, modern GPUs (both from NVidia and AMD) are not limited to high-end graphics or gaming. The good thing about them is that they can be massively parallelized – while today’s high-end Intel CPUs contain 6, 8 or maybe 10 cores, the high-end gaming GPUs contain thousands of cores. True, each of them is not that powerful, but if you can unleash their potential with some good parallel algorithms, the resulting power is insane.
So, with MDE, we’re now in the process of transitioning to a GPU-based “supercomputing” farm.
It’s not a rackmount server – but a workstation instead. A hell of a workstation, I should say though. With Intel i7 E3820 4C 3.6GHz CPU and 32 GB DDR3 RAM, it’s not a bad start, but what’s really cool about the box is the 4 NVidia GPU-based graphics cards, each with 3 GB of RAM and connected to each other by a hose for external water cooling. The whole beast is powered by a 1,500W power supply but in case it’s not enough, we are ready to add one more.
While we haven’t put these systems in production yet, we will likely do so soon. And I’m truly looking forward to that – as doing so will allow us to serve you, our users, even better. You never know - if this proves to be as useful as we think it will be, we may end up building something like the Titan one day…
(Now, my job in the meantime will be to keep the gamers off the server room :-)).