Thursday, November 6, 2014

Data Challenge #2: Inverted Indexing Inaugural Speeches

Kennedy_InaugurationIn October I became a Cloudera Certified Developer for Apache Hadoop.  In addition to gaining my certification, I led the study groups for the other engineers at my company that wanted to obtain their certifications, too.  I'm happy to say that all of those engineers is also now certified.  The most useful tool for preparing for this exam was writing up a series of "Data Challenges" that required members of the study group to utilize what they leaned from the Cloudera Hadoop study guide to solve a Big Data problem.  I've decided to share those data challenges on my blog for other Big Data enthusiasts.




For this data challenge, you'll be creating an inverted index. An inverted index is a data structure common to nearly all information retrieval systems. Let us consider the following text:
1: i love big data
2: and hadoop is what i use for big data
3: hdfs and map reduce make up hadoop
An inverted index is a data structure used by search engines and databases to make search terms to files or documents.  There are two main types of inverted index: record level inverted index, and word level inverted index.  For this exercise we will create a variation of a record level inverted index where we build a collection of postings lists, one associated with each unique term in the collection. Let's treat each line in the above sample data as if it were a "document". The complete inverted index would look something like this:
and : 2 : (2, 1),(3,1)
big : 2 : (1, 1),(2,1)
data : 2: (1, 1),(2,1)
for : 1 : (2,1)
hadoop : 2 : (2, 1), (3, 1)
hdfs : 1 : (3, 1)
i : 2 : (1, 1),(2,1)
is : 1 : (2,1)
love : 1 : (1, 1)
make : 1 : (3, 1)
map : 1 : (3, 1)
up : 1 : (3, 1)
what : 1 (2, 1)
As you can see, we have a posting list for each word that appears in the collection. Let us look at the list corresponding to the term hadoop in a bit more detail:
hadoop : 2 : (2, 1), (3, 1)
The number directly after the term is its document frequency or df for short. The df specifies the number of documents that contain this term. Since hadoop appears in two documents, its df is 2. Although the df can be easily calculated by counting the number of lines in the postings , we are storing it in the inverted index. The posting list contains a number of instances, each of which is a (docno, tf) tuple. The docno is simply a unique identifier for the document (one through three, in this case). The tf, which stands for term frequency, is the number of times the term appears in the document. The term hadoop appears once in document 2 and once in document 3.

The Challenge

Write a MapReduce program that builds an inverted index (as described above). Each postings list should explicitly store the df, as well as all the individual postings. Postings should be sorted by ascending docno (postings corresponding to smaller docnos should precede postings corresponding to larger docnos).
Run the inverted indexer on the attached sample input which is the collection of all inaugural speeches from US Presidents through 2012 . As with the above case, treat each line as if it were an individual "document". When you map over a plain text file using TextInputFormat in Hadoop, the key passed to the mapper contains the byte offset of the line from the beginning of the file, while the value contains the text of the line. Use this offset value as the unique docno.
Questions:
  1. Look up the postings corresponding to the term "coherence". There should only be one line in the entire collection that contains the term. What is that line? What's its docno (i.e., byte offset)?
  2. Look up the postings corresponding to the term "war". Generate a histogram of tf values. That is, in how many lines does "war" appear once, twice, three times, etc.?
  3. Do the same for the terms "employment" and "listen".
When your done, zip up your output file(s) and email them to me at collindcouch@gmail.com.  I'll compare your output the solution and let you know how you did.
Good luck, and have fun!
-Collin Couch

Data Challenge #1: Map Reduce for NYSE

NYSE_FloorIn October I became a Cloudera Certified Developer for Apache Hadoop.  In addition to gaining my certification, I led the study groups for the other engineers at my company that wanted to obtain their certifications, too.  I'm happy to say that all of those engineers is also now certified.  The most useful tool for preparing for this exam was writing up a series of "Data Challenges" that required members of the study group to utilize what they leaned from the Cloudera Hadoop study guide to solve a Big Data problem.  I've decided to share those data challenges on my blog for other Big Data enthusiasts.

Today it is common for engineers and business analyst to use tools such as Pig and Hive when querying big datasets in Hadoop.  Both tools offer an abstraction layer that encapsulates the complexity of map reduce.  However, for any engineer that wants to understand the how map reduce works, it's best to start by creating map reduce jobs in code first then learn how to use tools such as Pig and Hive.  This data challenge assumes that you've spent some time learning about map reduce and is just an exercise you can use to practice with.

For this data challenge, you are to work with a 12 MB dataset from the New York Stock Exchange (NYSE).  The dataset contains the following data fields for every stock traded on the NYSE for every day from 1/1/2000 - 12/31/2001.
  • stock_symbol
  • date
  • stock_price_open
  • stock_price_high
  • stock_price_low
  • stock_price_close
The image below is a sample of the data set for the stock "ASP" from 12/17/2001 - 12/31/2001
This is a subset view of input file file for the stock ASP
This is a subset view of input file file for the stock ASP
You are to create a map reduce job in Java that returns the date that each stock reached it highest price and the date it reached it's lowest price.  Your output file should be in a easy to read format that allows me to quickly see the information that I need.
Extra Credit:
Modify your map reduce job so that it returns one output file for the year 2000 and another output file for the year 2001.
When your done, zip up your output file(s) and email them to me at collindcouch@gmail.com.  I'll compare your output the solution and let you know how you did.
Good luck, and have fun!
-Collin Couch

Thursday, March 27, 2014

Gesture-Based Technology


Minority Report
I've been doing research on emerging technologies for 2014 and the one that stands out more than others is gesture-based technology.  When the movie Minority Report came out in 2002, I saw how cool the future could be as Tom Cruise saved the word (I think), interacting with the machines around him using gestures.  With what I've learned,  perhaps movie director, Steven Spielberg may have been on to something.

Making life easier
We've all experienced the situation where you're in the kitchen cooking,  hands are covered with ingredients, and your cell phone rings.  Swiping your finger across the screen only gets the phone dirty; so by the time you wash your hands and answer the phone, the call went to voice mail.  Soon some of life's minor inconveniences with the machines we've come to rely will be dealt with using gestures-based technology.  You'll be able to answer that annoying phone call and put it into speaker mode with a wave of the hand.

Products
Some gesture-based products in the market today gaining momentum are Microsoft's Kinect, and the gesture-based computing product,  Leap.  I believe gesture-based remote control for TV will be mainstream in a few years.  There are more products on the way serving industries such as education, military, travel, gaming, and telecommnuncation.


Education
Having spent the last 10 years developing products for the education industry, I have a strong interest in what role this technology will play in education.  The thought of students using gestures to interact with lessons, and materials seems far more engaging than the existing model of listening to a teacher lecture.  I would imagine team project work that utilized gesture-based technology would improve student outcomes as well.  These are just hypothetical thoughts that need to be proved, but the idea seems encouraging.

Where go from here?
I believe various industries are going to all find ways to leverage gesture-based technology in the coming years and this will open huge opportunities for start-ups everywhere to provide the technological products and platforms to make it all work.