CROWDSOURCE: AN OPEN SOURCE, CROWD TRAINED MACHINE LEARNING MODEL FOR MALWARE CAPABILITY DETECTION

CROWDSOURCE: AN OPEN SOURCE, CROWD TRAINED MACHINE LEARNING MODEL FOR MALWARE CAPABILITY DETECTION

Due to the exploding number of unique malware binaries on the Internet and the slow process required for manually analyzing these binaries, security practitioners today have only limited visibility into the functionality implemented by the global population of malware. To date little work has been focused explicitly on quickly and automatically detecting the broad range of high level malware functionality such as the ability of malware to take screenshots, communicate via IRC, or surreptitiously operate users’ webcams.

To address this gap, we debut CrowdSource, an open source machine learning based reverse engineering tool. CrowdSource approaches the problem of malware capability identification in a novel way, by training a malware capability detection engine on millions of technical documents from the web. Our intuition for this approach is that malware reverse engineers already rely heavily on the web “crowd” (performing web searches to discover the purpose of obscure function calls and byte strings, for example), so automated approaches, using the tools of machine learning, should also take advantage of this rich and as of yet untapped data source.

As a novel malware capability detection approach, CrowdSource does the following:

  • Generates a list of detected software capabilities for novel malware samples (such as the ability of malware to communicate via a particular protocol, perform a given data exfiltration activity, or load a device driver);
  • Provides traceable output for capability detections by including “citations” to the web technical documents that detections are based on;
  • Provides probabilistic malware capability detections when appropriate: e.g., system output may read, “given the following web documents as evidence, it is 80% likely the sample uses IRC as a C2 channel, and 70% likely that it also encrypts this traffic.”

CrowdSource is funded under the DARPA Cyber Fast Track initiative, is being developed by the machine learning and malware analysis group at Invincea Labs and is scheduled for beta, open source release to the security community this October. In this presentation we will give complete details on our algorithm for CrowdSource as it stands, including compelling results that demonstrate that CrowdSource can already rapidly reverse engineer a variety of currently active malware variants.

Presented by