How to build a malware classifier [that doesn’t suck on real-world data]

SecTor 2016

Presented by: John Seymour
Date: Tuesday October 18, 2016
Time: 10:15 - 11:15
Location: 801B
Track: Tech

Machine learning is the latest trend in malware classification. It’s easy enough that everyone can now spin up a malware crawler, extract some features from the files, build some machine learning models, and publish their research in a reputable journal. However, many of these models have issues with overfitting – they have significant accuracy reductions on real-world data. We’ll give a full introduction on using machine learning to classify malware, including the features that are currently used, which models seem to work best, and what datasets exist for this purpose. We’ll also demonstrate that even the best public malware classifiers over fit to their original training sets. We’ll end the talk with forays into avoiding this issue, for creating malware classifiers that don’t suck on real-world data.

Links

John Seymour

John Seymour is a Data Scientist at ZeroFOX, Inc. by day, and Ph.D. student at University of Maryland, Baltimore County by night. He researches the intersection of machine learning and InfoSec in both roles. He’s mostly interested in avoiding, and also helping others avoid, some of the major pitfalls in machine learning, especially in dataset preparation (seriously, do people still use malware datasets from 1998?). He has spoken at both DEFCON and BSides, and aims to add BlackHat to the list in the near future.


KhanFu - Mobile schedules for INFOSEC conferences.
Mobile interface | Alternate Formats