Labeling the VirusShare Corpus: Lessons Learned

BSidesLV 2016

Presented by: John Seymour
Date: Wednesday August 03, 2016
Time: 11:35 - 12:30
Location: Florentine F
Track: Ground Truth

A machine learning researcher needs a nice dataset to work with, but all of the publicly available malware datasets have major issues. We'll start by reviewing the basics of machine learning on malware: what works, what doesn't, and what data is out there. We'll introduce the VirusShare dataset, show how we fixed the labels issue (using VirusTotal) so that it may be used for supervised machine learning, and discuss why this corpus should be used as a standard for machine learning research. Finally, we'll look at pyspark, and how it can be used to both summarize the corpus and to help us find which chunks have high concentrations of particular families of malware.

John Seymour

John Seymour is a Data Scientist at ZeroFOX, Inc. by day, and Ph.D. student at University of Maryland, Baltimore County by night. He researches the intersection of machine learning and InfoSec in both roles. He's mostly interested in avoiding and helping others avoid some of the major pitfalls in machine learning, especially in dataset preparation (seriously, do people still use malware datasets from 1998?) He has spoken at both DEFCON and his local BSides, and plans to add Black Hat USA, BSidesLV, and SecTor to the list in the near future.


KhanFu - Mobile schedules for INFOSEC conferences.
Mobile interface | Alternate Formats