Introduction to Hadoop

35 Flares Twitter 0 Facebook 3 Google+ 30 LinkedIn 2 Filament.io 35 Flares ×

Quick Introduction to Hadoop

Hadoop is project under the umbrella of the Apache Software Foundation. Hadoop is a system for parallel processing of massive data distributed across many nodes(computers). Many nodes join together to form a Hadoop cluster. A cluster can even have nodes hosted in a different geographical region. Normal practice is to keep the nodes in a cluster closer.Use of  Hadoop enables in speeding up large computations and hiding I/O latency through increased concurrency.

Hadoop is designed to work on large data processing tasks like searching user generated logs, indexing of websites and any such massive data. It shines in processing large volume of data sets running into terabytes or petabytes that the traditional systems cannot effectively store or process in any reasonable amount of time frame. Hadoop splilts the large data into chunks and stores each chunk on different nodes in the cluster. It stores data chunks locally on the computational node(s) rather than moving data from a central point to the individual nodes for computation. This makes the computation faster as the data to process is already available on the local node. This distributed file system (HDFS= Hadoop File System ) is the back bone of  Hadoop and its effectiveness.

Hadoop uses the map reduce programming practice to split a large computational problem into small units,  work on different units in parallel and finally collate the individual results to calculate the final result.

Hadoop is written in Java. But Hadoop programs can be written in many languages.

Data Structures

Hadoop operates on <key, value> pairs, or two-tuples. All data that your Hadoop programs produce must be formatted as pairs.

For instance, let us take an example of  calculating how many times a particular user has visited an URL. In this case the output of a hadoop program might look like

/index  500

/tutorial 100

/java 1000

The key point is hadoop always operates on key value pairs. The take away point is map phase emits sets of key value pairs and reduce gets to work on all data with the same key. To make the point more clearer, let us take an example of a counting of all words in a book. We want the output like

  • the 10000
  • apple 100
  • hadoop 100

Imagine we have a big book or all books in a library available to us. In such a case normal computation is a challenge and it won’t work in any amount of reasonable time. So we have concluded that we need hadoop here. Now when we configure hadoop to process this data, the system can be configured to process one line at a time parallel in hundreds of map tasks. So one map task gets a line and another gets another line and so on. Hence if we have configured 1000 map tasks, we are processing 1000 lines at a time in parallel.  Ok, now let us see what each map task will produce as output. A map task takes a line, breaks it into words and emit the count of word. For example if map task 1 is processing a line “Hadoop is good. Hadoop is nice ” then it will output  the following

  • Hadoop 1
  • is 1
  • good 1
  • Hadoop 1
  • is 1
  • nice 1

You may be wondering why it is emitting the same key more than once. For example it is outputting Hadoop twice with value 1 and why can’t it just sum it and output Hadoop 2. Well, that is possible but it doesn’t matter really whether we do the sum here or not. Because that is the job of the reduce task. Like this we have 1000’s of map tasks emitting key value pairs. The hadoop system collects every key and values, and to a reduce task gives a key and all values. It is  something like,  [key , list[value]].

So if we consider the earlier word ‘Hadoop’, the reduce will get some thing like  ” [Hadoop, [1, 1, 1, 1, …….]] . In the reduce task, we just need to do a sum or just the count of data in the values list will give the total occurrence of word Hadoop. We can configure many reduce jobs (say 100 ) to further improve the computation efficiency. Hadoop guarantees that a reduce will get a key and all values for that key. The  same key will not go to more than one reduce task.

Introduction to Hadoop – Video  (Courtesy:  Standford University, Cloudera)