Is it malware or clean? Well, it depends on a plethora of diverse features

Avast researchers use a general feature-blind learning framework for fast detection of novel malware based on diverse data sources

This post was written by the following Avast researchers:

Viliam Lisý, Avast Principal AI Scientist
Branislav Bošanský, Avast Principal AI Scientist
Karel Horak, Avast Senior AI Researcher
Matej Racinsky, Avast AI Researcher
Petr Somol, Avast Director AI Research

Every day, antivirus systems all over the world inspect billions of files in order to detect potential threats. For most of them, they can easily decide whether the files are malware or clean based on the reputation of the specific file or common patterns identified in known malware families. However, there is still a considerable portion of files which isn’t easy to classify based on the known patterns. These files are commonly uploaded to massive backends of antivirus systems in the cloud, where they are thoroughly analyzed based on a wide variety of methods, such as static analysis, dynamic analysis, behavioral analysis, or queries to third-party knowledge bases. Each such analysis produces a rich, diverse, and often changing set of features that indicate whether the file is malware or clean. 

The amount of relevant data is large and the decision must be quick. For example, the WannaCry ransomware outbreak spread from one to more than 200,000 computers in just a little more than seven hours. Every minute of delay in detecting such threats can mean thousands of newly infected computers. Therefore, the detection of novel malware must be automated, typically using a machine learning (ML) model that considers features extracted from the binary or some other preprocessing tool. Standard ML approaches require ML engineers to understand the information contained in the reports, determine how indicative it is of the analyzed file being malware, and implement routines that encode the most important information into fixed-sized vector representations required by most machine learning algorithms. If the format or the information contained in the reports is updated or extended, the engineer must understand the differences and adapt these routines. If a new data source is added, the engineer must go through this whole process from scratch: understand the data, implement feature extracting routines, and train the classifier. Note that such changes in reports happen very often in the malware detection domain, since all the preprocessing tools are actively developed to discover important features of new binaries.

In our previous blog post, we introduced a generic framework that allows automating these tasks, traditionally performed by machine learning engineers. With our implementation of the framework, which we call ReportGrinder, adding a new data source means simply adding a pointer to the new training set of analysis reports. If the reports change arbitrarily but the problem of distinguishing malware from clean files remains, no human intervention is necessary and the system can simply be automatically retrained using the new reports.

In this post, we will show how we deployed the ReportGrinder framework for fast detection of malware in new, previously unseen files based on diverse data sources. Each new file is analyzed by several backend systems to extract static features, provide behavioral analysis, and query third-party intelligence. The raw output of these systems in the form of JSON reports is used as the input for the machine learning model trained on hundreds of millions of files that we have classified in the past. We use an ensemble model to assess the confidence of the classification. This new model makes a confident decision on its own regarding 85% of the most difficult files, which we receive from our clients on the backend in less than one minute after receiving them. Extending Avast backend decision systems with this new model has reduced the processing time of new files by a whopping 50%. Moreover, any new feature in the reports from the analysis systems will be automatically incorporated from the report logs into the model without additional human intervention.

A quick classification of novel malware

When an antivirus system encounters a file, its hash is usually checked in a reputation database to determine whether or not it is clean. A small fraction of files will have never been seen before, because they contain, for example, polymorphic malware or a personalized installer. These files are then scanned using client-side detection methods that search for known patterns in the binary of the file and possibly even run some short emulations. For a small portion of these files, even this check is unsuccessful, and the file is sent to the cloud for the analysis by antivirus backends as a result. At this point, the user already starts experiencing some delays and may be waiting for the desired new application to start for the first time. Therefore, the speed for the following steps is very important.

Relevant data sources

When the most difficult files arrive to the backend, a plethora of computationally expensive systems can be executed in parallel to provide additional information about the suspicious sample: 

  • Tools for extracting static features from the binary (such as RetDec or LIEF, example report)
  • Separate tools can execute the sample in a safe and controlled environment to provide the behavioral analysis (such as Cuckoo or Cape, example report)
  • The file can be unpacked
  • The validity, reputation and other properties of digital signatures may be obtained
  • External data sources may be queried for additional data
  • The similarity of the file to existing file clusters may be reported

It is important to consider the wide variety of data sources because malware can manage to avoid one type of analysis, but the avoidance often makes it easier to detect by a complementary method. Each data source produces a structured report in a JSON format, ideal for processing by our ReportGrinder.

Using HMIL for malware classification

Using the Hierarchical Multiple Instance Learning (HMIL) through ReportGrinder for malware classification is rather straightforward. We collect all reports for a large dataset of hundreds of millions of files. Then, the general sequence of steps that we introduced before is automatically performed. 

  1. ReportGrinder automatically derives the schema of each data source. 

  2. Based on the schema, all basic data types, such as strings and numbers, are encoded into a vector representation. 

  3. A neural network following the structure of the schema is automatically derived so that it aggregates an arbitrarily large and variable report into a fixed vector representation. 

  4. The vector representation of all relevant data sources then can be concatenated and complemented with several feed-forward layers and a suitable output layer with a corresponding loss function. In the case of malware classification, it can be just a softmax output layer trained by optimizing cross entropy.

Deployment results

We have deployed ReportGrinder for classification of Windows executable files based on both static and dynamic analysis reports into Avast CyberCapture. This feature receives tens of thousands of unseen suspicious executables from Avast users every day. Even before deploying ReportGrinder, these files were classified as malware, potentially unwanted programs (PUPs), or clean, based on a diverse combination of classifiers using machine learning, the reputation of individual file components, hand-written rules, external intelligence, and so on. 

If none of these systems can make a conclusive decision, the file was reanalyzed after some time because many of the classifiers are continually adapting with each new file analyzed by Avast. Before deploying ReportGrinder, approximately 20% of files incoming to CyberCapture were not conclusively decided upon within several hours. We further refer to such files as “expired”.

Expired files

The initial deployment of ReportGrinder to process files of Avast’s 435 million users was conservative, but it still led to substantial improvements. ReportGrinder’s decision is used only after some of the well established, pre-existing classifiers do not know how to classify the file. 

The breakdown of the CyberCapture decision based on different classifiers that made the final decision is shown in Figure 1. We can see that in the two weeks before deploying the system, 24% of the files expired. In two weeks after deploying ReportGrinder classifiers, though, only 6% of the files expired, while a large proportion of the files that would otherwise have expired were classified by the HMIL classifiers built into the ReportGrinder framework.

Figure 1: A breakdown of the CyberCapture decisions by different internal systems two weeks before and after deploying ReportGrinder.

Processing speed

The reduction of the expired files is very important for user experience because instead of waiting for a few hours to receive a decision, they can continue their work in the one minute sufficient for ReportGrinder classifiers. Even the files that would eventually be decided upon by the pre-existing systems can be decided upon by ReportGrinder within one minute. Therefore, the deployment led to substantial reduction of the time files spent in CyberCapture. Figure 2 presents the (relative) average processing time before and after ReportGrinder was deployed.

Figure 2: The average CyberCapture processing time before and after ReportGrinder was deployed.

Conclusion

Avast researchers turned their theoretical framework for processing complex security data without feature engineering into a practical application. They have built a system that consumes reports from static, as well as dynamic, analysis of executable files in their raw form and decides whether the corresponding files are malware or clean. The system is regularly trained on over 100 million files and it reduced the average time of analysis of the most complex previously unseen files arriving to Avast backends to one half of the time required without the new system.

--> -->