How to build a malware classifier [that doesn’t suck on real-world data]

Machine learning is the latest trend in malware classification. It’s easy enough that everyone can now spin up a malware crawler, extract some features from the files, build some machine learning models, and publish their research in a reputable journal. However, many of these models have issues with overfitting – they have significant accuracy reductions on real-world data. We’ll give a full introduction on using machine learning to classify malware, including the features that are currently used, which models seem to work best, and what datasets exist for this purpose. We’ll also demonstrate that even the best public malware classifiers over fit to their original training sets. We’ll end the talk with forays into avoiding this issue, for creating malware classifiers that don’t suck on real-world data.

Presented by

Links