Crawling for APIs

As client machines become more powerful and JavaScript becomes more ubiquitous, servers are increasingly serving up code for browsers to execute, rather than the display-ready pages of the past. This changes the face of web scraping dramatically, as simply wget'ing and parsing the response from a URL becomes useless without executing bulky JavaScript with third party plugins, reading through code logic manually, and/or digging through piles of browser junk.

However, moving page logic client side can also create data vulnerabilities, as companies leave internal APIs exposed to the world, in order for their client side code to make use of them. I'll show some examples of this practice on traditionally "impossible to scrape" pages, and also some tools I've developed to crawl domains and discover and document these hidden APIs in an automated way. While many bot prevention measures focus on traditional page scraping and site manipulation, scripts that crawl sites through API calls, rather than in a "human like" way through URLs, may present unique security challenges that modern web development practices do not sufficiently address.

Presented by