Thursday, November 6, 2014

Data Challenge #1: Map Reduce for NYSE

NYSE_FloorIn October I became a Cloudera Certified Developer for Apache Hadoop.  In addition to gaining my certification, I led the study groups for the other engineers at my company that wanted to obtain their certifications, too.  I'm happy to say that all of those engineers is also now certified.  The most useful tool for preparing for this exam was writing up a series of "Data Challenges" that required members of the study group to utilize what they leaned from the Cloudera Hadoop study guide to solve a Big Data problem.  I've decided to share those data challenges on my blog for other Big Data enthusiasts.

Today it is common for engineers and business analyst to use tools such as Pig and Hive when querying big datasets in Hadoop.  Both tools offer an abstraction layer that encapsulates the complexity of map reduce.  However, for any engineer that wants to understand the how map reduce works, it's best to start by creating map reduce jobs in code first then learn how to use tools such as Pig and Hive.  This data challenge assumes that you've spent some time learning about map reduce and is just an exercise you can use to practice with.

For this data challenge, you are to work with a 12 MB dataset from the New York Stock Exchange (NYSE).  The dataset contains the following data fields for every stock traded on the NYSE for every day from 1/1/2000 - 12/31/2001.
  • stock_symbol
  • date
  • stock_price_open
  • stock_price_high
  • stock_price_low
  • stock_price_close
The image below is a sample of the data set for the stock "ASP" from 12/17/2001 - 12/31/2001
This is a subset view of input file file for the stock ASP
This is a subset view of input file file for the stock ASP
You are to create a map reduce job in Java that returns the date that each stock reached it highest price and the date it reached it's lowest price.  Your output file should be in a easy to read format that allows me to quickly see the information that I need.
Extra Credit:
Modify your map reduce job so that it returns one output file for the year 2000 and another output file for the year 2001.
When your done, zip up your output file(s) and email them to me at collindcouch@gmail.com.  I'll compare your output the solution and let you know how you did.
Good luck, and have fun!
-Collin Couch

No comments:

Post a Comment