Share it!

In today’s data-driven market, MapReduce is one the of the hottest technologies, along with BigData, Hadoop and NoSQL database engines. But what if I told you that MapReduce really was invented by Julius Caesar, the famous Roman emperor?

Julius Caesar was a Roman politician, military general, and historian who played a critical role in the events that led to the demise of the Roman Republic and the rise of the Roman Empire. He became known for to several quotes attributed to him such as “alea iacta est” or “vini, vidi, vinci”.

Another famous quote is “divide et impera” (divide and conquer), a military/political strategy based on breaking the opponent into smaller isolated groups in order to weaken the enemy. Nevertheless, divide and conquer can be applied to any other field where a big problem has to be dealt with and it is easier to split it into multiple smaller ones.

In the days of the Roman empire, the Caesar issued a decree that a census should be taken of the entire Roman world (Luke 2:1).  One of the main reasons that Rome went to all of the work of doing a census was to make sure that people were paying the taxes Rome demanded.

MapReduce in the Roman Empire's Census

Because the empire was very large, a single person would not have been able to perform the census alone. Because of that, the Caesar made each proconsul and propraetorian legate perform their own censuses and then send the results to Rome, where they were finally consolidated in a final census.

In other words, Julius Caesar divided a large set of input data (the people in the whole Roman empire) in several smaller chunks (the people in each province) which were then “processed” in parallel to make the census much faster and return the results in a shorter time.

Using MapReduce to process huge amounts of data

MapReduce is a programming paradigm that was designed to allow parallel, distributed processing of large sets of data, converting them to sets of tuples, and then combining and reducing those tuples into smaller sets of tuples. In layman’s terms, MapReduce was designed to take big data and use parallel distributed computing to turn big data into little- or regular-sized data.

Processing takes place using a large number of nodes (computers) either in a cluster (if all nodes are on the same local network and use similar hardware) or a grid (if the nodes are shared across geographically and administratively distributed systems, and use more heterogenous hardware).

How does MapReduce Work?

The basic unit of information, used in MapReduce is a (Key,value) pair. All types of structured and unstructured data need to be translated to this basic unit, before feeding the data to MapReduce model.  As the name suggests, MapReduce model consist of two separate routines, namely Map-function and Reduce-function. This article will help you understand the step by step functionality of Map-Reduce model. The computation on an input (i.e. on a set of pairs) in MapReduce model occurs in three stages:

  1. The map stage
  2. The shuffle stage
  3. The reduce stage.

Semantically, the map and shuffle phases distribute the data, and the reduce phase performs the computation.

MapReduce Steps

The Map stage

MapReduce logic, unlike other BigData processing frameworks, is not restricted to just structured datasets. It has an extensive capability to handle unstructured data as well.

In the map stage, the mapper takes a single (key, value) pair as input and produces any number of (key, value) pairs as output. It is important to think of the map operation as stateless, that is, its logic operates on a single pair at a time (even if in practice several input pairs are delivered to the same mapper).

To summarize, for the map phase, the user simply designs a map function that maps an input (key, value) pair to any number (even none) of output pairs. Most of the time, the map phase is simply used to specify the desired location of the input value by changing its key.

The shuffle stage

Between the map processing and the reduce processing, a shuffle step sorts all map output values with the same key into a single reduce input (key, value-list) pair, where the ‘value’ is a list of all values sharing the same key. Thus, the input to a reduce task is actually a set of (key, value-list) pairs.

The shuffle stage is automatically handled by the MapReduce framework, i.e. the developer has nothing to do for this stage. The underlying system implementing MapReduce routes all of the values that are associated with an individual key to the same reducer. This makes the next step much faster, as all the data related to the same key  is already in the same reducer node, so there’s little to no network delays when accessing the data.

The Reduce stage

In the reduce stage, the reducer takes all of the values associated with a single key and outputs any number of (key, value) pairs. This highlights one of the sequential aspects of MapReduce computation: all of the Map nodes need to finish before the reduce stage can begin (to ensure that all the required data is already present on the reducer nodes).

Since the reducer has access to all the values with the same key, it can perform sequential computations on these values. In the reduce step, the parallelism is exploited by observing that reducers operating on different keys can be executed simultaneously.

To summarize, for the reduce phase, the user designs a function that takes in input a list of values associated with a single key and outputs any number of pairs.

Overall, a program in the MapReduce paradigm can consist of many rounds (usually called jobs) of different map and reduce functions, performed sequentially one after another.

MapReduce libraries have been written in many programming languages, with different levels of optimization. A popular open-source implementation that has support for distributed shuffles is part of Apache Hadoop. The name MapReduce originally referred to the proprietary Google technology, but has since been genericized.

Advantages of MapReduce

Traditional systems tend to use a centralized server for storing and retrieving data. In the case of BigData, these huge amounts of data cannot be accommodated by standard database servers.  Also, centralized systems create too much of a bottleneck while processing multiple files simultaneously.

Google, came up with MapReduce to solve such bottleneck issues. MapReduce will divide the task into small parts and process each part independently by assigning them to different systems. After all the parts are processed and analyzed, the output of each computer is collected in one single location and then an output dataset is prepared for the given problem. The two biggest advantages of MapReduce are:

Parallel Processing:

In MapReduce, we are dividing the job among multiple nodes and each node works with a part of the job simultaneously. So, MapReduce is based on Divide and Conquer paradigm which helps us to process the data using different machines. As the data is processed by multiple machine instead of a single machine in parallel, the time taken to process the data gets reduced by a tremendous amount.

Data Locality

Instead of moving data to the processing unit, we are moving processing unit to the data in the MapReduce Framework.  In the traditional system, we used to bring data to the processing unit and process it. But, as the data grew and became very huge, bringing this huge amount of data to the processing unit posed following issues:

  • Moving huge data to processing is costly and deteriorates the network performance.
  • Processing takes time as the data is processed by a single unit which becomes the bottleneck.
  • Master node can get over-burdened and may fail.

MapReduce allows us to overcome above issues by bringing the processing unit to the data. So, as you can see in the above image that the data is distributed among multiple nodes where each node processes the part of the data residing on it. This allows us to have the following advantages:

  • It is very cost effective to move processing unit to the data.
  • The processing time is reduced as all the nodes are working with their part of the data in parallel.
  • Every node gets a part of the data to process and therefore, there is no chance of a node getting overburdened.

Conclusion

The functioning of MapReduce like we just went through is a sequential flow and the example was a very small and basic example to understand MapReduce at beginners level. The reason why it is admired so much is its capability of parallelism and getting the output based on key-value pair analysis.

It is capable of doing big wonders when it comes to Big Data and when it comes to huge datasets and real world problems, MapReduce does make it a great choice for easy processing of any volume of data. If Big Data is what you are looking forward to, MapReduce should be the first thing that comes to your mind.


Share it!