Labeling the VirusShare Corpus: Lessons Learned

A machine learning researcher needs a nice dataset to work with, but all of the publicly available malware datasets have major issues. We'll start by reviewing the basics of machine learning on malware: what works, what doesn't, and what data is out there. We'll introduce the VirusShare dataset, show how we fixed the labels issue (using VirusTotal) so that it may be used for supervised machine learning, and discuss why this corpus should be used as a standard for machine learning research. Finally, we'll look at pyspark, and how it can be used to both summarize the corpus and to help us find which chunks have high concentrations of particular families of malware.

Presented by