Hadoop: Apache's Open Source Implementation of Google's MapReduce Framework

This presentation will begin with a brief overview of Google's Map/Reduce Framework. Map/Reduce is built to analyze extremely large datasets. We will first look at what a Mapper and Reducer are, the inputs they take and the outputs they generate. From there, we will look at the open source java based implementation of the Map/Reduce Framework by the Apache Team's Hadoop Project. Since Hadoop is Java based, we will then look at using the Hadoop framework in order to build Mappers and Reducers in Python, as well as running Mappers written in the AWK scripting language. A brief comparison of compile times and efficiencies between the three will be shown, as well as the results from running our code on ASU's Saguaro Cluster. After that, we will brush over HBase, the Hadoop equivalent to Google's BigTable, a non-relational database for Map/Reduce. Finally, we will look at some demo code, including a machine learning algorithm based on the Netflix Prize Dataset, a 2 gigabyte dataset of movie ratings from the Netflix Database.

Also, we will not present on, but will include source code from different team's projects, including Map/Reduce programs for image analysis and recognition, analyzing air traffic data, analyzing package delivery systems for use with swarm theory and a Map/Reduce program that analyzes patterns in large literature as a response to "The Bible Code", most of which use public datasets as inputs.

Presented by