How to conceptualize MapReduce
November 14, 2023

Cloud computing makes it possible to use thousands of commodity computers to carry out large scale tasks. However, to explore the full power of this combined computational capacity, there is need for supporting mechanisms to write programs that are capable of utilising the full potential. Such programmes should distribute the subtasks across the different computers and use every opportunity to engage each computer in parallel.

Map Reduce provides a means to engage many computers in parallel to carry out a given task. The critical issues that are decided on by map reduce are how to sub-divide the task into small tasks for each computer to carry out independently as much as possible. How dow we add-up the results from the sub-tasks into one single result. From the computing discipline we call these decisions and others a “framework”, and so Map Reduce is a framework that facilitates concurrent processing by splitting petabytes of data into smaller chunks, and processing them in parallel commodity servers — commodity means computers with average specifications.

Take a simple example where we want find the the richest person in a given country. So we have dispatched our data collectors to record the owner of each property. The recorded data captures properties such as cars, land, buildings etc against the national-id of the owner. Assuming we end up with 2TB of data in say 70 different files. For ease of understanding, a combination should be conceptualised as a representation in which items are grouped by key and placed in different bags. Each bag will only contain items of the same key.

At our disposal we have 150 computers to work with. The starting point is to split the 70 files into chunks of data and give each computer a chunk to work on as independently as possible. When each of the computers finish, we then combine the results to extract the richest person. Conceptually, each computer should put properties each person in a separate bag. So we have a bag with properties for every person found in their data chunk. The next step is to get bags of the same person coming from each computer and add them up. After this round we shared have one bag per son containing the properties. We can then sort to find the richest.

That is how MapReduce precisely works. The first is the map job, which takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). The reduce job takes the output from a map as input and combines those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce job is always performed after the map job.

Initially, the data for a MapReduce task is stored in input files, and input files typically reside in HDFS. In computing terms, MapReduce works into major steps, the map and reduce. The MapReduce allows your to write the programs to map and another to reduce, then the framework will call these functions to do what you programmed. The map function splits the data into chunks, emits pairs. Once again the choice of the key and values is the duty of the programmer based on the task at hand. In our our rich-man-task, a suitable key is the national-id, and the value is the properties. For processing purposes, the map task breaks its chunk into input formats defined by pairs. This first set of pairs used to process input are defined by the MapReduce framework with an option for custom definition. The input formats include FileInputFormat, TextInputFormat, KeyFileTextInputFormat and many others.

Take a case of FileInputFormat, which provides with a path containing files to read. FileInputFormat will read all files and divides these files into one or more InputSplits. TextInputFormat treats each line of each input file as a separate record and performs no parsing. This is useful for unformatted data or line-based records like log files. For each line, the Key – is the byte offset of the beginning of the line within the file (not whole file just one split), so it will be unique if combined with the file name, and the Value –  is the contents of the line, excluding line terminators. The output is the an another pairs where in our case the key is the national-id, and value will be the extract properties of the person that we find in the this data-chuck being handled by one computer

In the second step, bags belonging the same person encountered by different mappers are sent to same reducer. Now the reducer can do final processing for that person. So each reducer receives values for one key form different mappers

The types of keys and values differ based on the use case. All inputs and outputs are stored in the HDFS. While the map is a mandatory step to filter and sort the initial data, the reduce function is optional.

<k1, v1> -> Map() -> list(<k2, v2>)

<k2, list(v2)> -> Reduce() -> list(<k3, v3>)

MapReduce facilitates concurrent processing by splitting petabytes of data into smaller chunks, and processing them in parallel on Hadoop commodity servers. In the end, it aggregates all the data from multiple servers to return a consolidated output back to the application. With MapReduce, rather than sending data to where the application or logic resides, the logic is executed on the server where the data already resides, to expedite processing. Data access and storage is disk-based—the input is usually stored as files containing structured, semi-structured, or unstructured data, and the output is also stored in files.

A Mapper is a Hadoop servers that runs the Map function, while Reducers are servers carryout the reduce function. It doesn’t matter if these are the same or different servers.

Map
The input data is first split into smaller blocks. Each block is then assigned to a mapper for processing.

Reduce
After all the mappers complete processing, the framework shuffles and sorts the results before passing them on to the reducers. A reducer cannot start while a mapper is still in progress. All the map output values that have the same key are assigned to a single reducer, which then aggregates the values for that key.

Combine and Partition
These are two intermediate steps between Map and Reduce. While Combine is an optional process, the combiner is a reducer that runs individually on each mapper server. It reduces the data on each mapper further to a simplified form before passing it downstream. The

Partition is the process that translates the pairs resulting from mappers to another set of pairs to feed into the reducer. It decides how the data has to be presented to the reducer and also assigns it to a particular reducer. The default partitioner determines the hash value for the key, resulting from the mapper, and assigns a partition based on this hash value. There are as many partitions as there are reducers. So, once the partitioning is complete, the data from each partition is sent to a specific reducer.

So MapReduce is a programming model initially introduced by Google. It provides a scalable and fault-tolerant approach to process massive amounts of data in parallel across a distributed cluster of computers. The model is inspired by functional programming principles and leverages the power of parallelism to achieve high-performance data processing. Map Reduce has become a fundamental tool in AI/ML and Data Science due to its ability to process vast amounts of data efficiently. It allows practitioners to tackle complex Data Analysis tasks that involve large-scale datasets

Loading

Leave a Reply

Your email address will not be published. Required fields are marked *