We've already brought our malicious data collection skills to an art level, but in order to get good verdicts (most importantly - low FP rate) our benign (or White) data must enjoy the same level of confidence as the malicious (or Black) data. When dealing with Machine Learning algorithms, the certainty of the White data is taken for granted, but reality shows that it's a less-than-simple challenge. In this talk, we will focus on the collection of White data: Where do we get it from, and how do we collect it?
The talk is based on research we performed in the past year, during which we developed a methodology for the collection and creation of such repositories of clean data. We will share this methodology with the audience.