Get up, and be somebody: Data Challenge #2: Inverted Indexing Inaugural Speeches

In October I became a Cloudera Certified Developer for Apache Hadoop. In addition to gaining my certification, I led the study groups for the other engineers at my company that wanted to obtain their certifications, too. I'm happy to say that all of those engineers is also now certified. The most useful tool for preparing for this exam was writing up a series of "Data Challenges" that required members of the study group to utilize what they leaned from the Cloudera Hadoop study guide to solve a Big Data problem. I've decided to share those data challenges on my blog for other Big Data enthusiasts.

For this data challenge, you'll be creating an inverted index. An inverted index is a data structure common to nearly all information retrieval systems. Let us consider the following text:

1: i love big data
2: and hadoop is what i use for big data
3: hdfs and map reduce make up hadoop

An inverted index is a data structure used by search engines and databases to make search terms to files or documents. There are two main types of inverted index: record level inverted index, and word level inverted index. For this exercise we will create a variation of a record level inverted index where we build a collection of postings lists, one associated with each unique term in the collection. Let's treat each line in the above sample data as if it were a "document". The complete inverted index would look something like this:

and : 2 : (2, 1),(3,1)
big : 2 : (1, 1),(2,1)
data : 2: (1, 1),(2,1)
for : 1 : (2,1)
hadoop : 2 : (2, 1), (3, 1)
hdfs : 1 : (3, 1)
i : 2 : (1, 1),(2,1)
is : 1 : (2,1)
love : 1 : (1, 1)
make : 1 : (3, 1)
map : 1 : (3, 1)
up : 1 : (3, 1)
what : 1 (2, 1)

As you can see, we have a posting list for each word that appears in the collection. Let us look at the list corresponding to the term hadoop in a bit more detail:
hadoop : 2 : (2, 1), (3, 1)

The number directly after the term is its document frequency or df for short. The df specifies the number of documents that contain this term. Since hadoop appears in two documents, its df is 2. Although the df can be easily calculated by counting the number of lines in the postings , we are storing it in the inverted index. The posting list contains a number of instances, each of which is a (docno, tf) tuple. The docno is simply a unique identifier for the document (one through three, in this case). The tf, which stands for term frequency, is the number of times the term appears in the document. The term hadoop appears once in document 2 and once in document 3.

The Challenge

Write a MapReduce program that builds an inverted index (as described above). Each postings list should explicitly store the df, as well as all the individual postings. Postings should be sorted by ascending docno (postings corresponding to smaller docnos should precede postings corresponding to larger docnos).
Run the inverted indexer on the attached sample input which is the collection of all inaugural speeches from US Presidents through 2012 . As with the above case, treat each line as if it were an individual "document". When you map over a plain text file using TextInputFormat in Hadoop, the key passed to the mapper contains the byte offset of the line from the beginning of the file, while the value contains the text of the line. Use this offset value as the unique docno.

Questions:

Look up the postings corresponding to the term "coherence". There should only be one line in the entire collection that contains the term. What is that line? What's its docno (i.e., byte offset)?
Look up the postings corresponding to the term "war". Generate a histogram of tf values. That is, in how many lines does "war" appear once, twice, three times, etc.?
Do the same for the terms "employment" and "listen".

When your done, zip up your output file(s) and email them to me at collindcouch@gmail.com. I'll compare your output the solution and let you know how you did.

Good luck, and have fun!

-Collin Couch

Get up, and be somebody

Thursday, November 6, 2014

Data Challenge #2: Inverted Indexing Inaugural Speeches

The Challenge

No comments:

Post a Comment