How to conceptualize MapReduce

Cloud computing makes it possible to use thousands of commodity computers to carry out large scale tasks. However, to explore the full power of this combined computational capacity, there is need for supporting mechanisms to write programs that are capable of utilising the full potential. Such programmes should distribute the subtasks across the different computers and use every opportunity to engage each computer in parallel.

Map Reduce provides a means to engage many computers in parallel to carry out a given task. The critical issues that are decided on by map reduce are how to sub-divide the task into small tasks for each computer to carry out independently as much as possible. How dow we add-up the results from the sub-tasks into one single result. From the computing discipline we call these decisions and others a “framework”, and so Map Reduce is a framework that facilitates concurrent processing by splitting petabytes of data into smaller chunks, and processing them in parallel commodity servers — commodity means computers with average specifications.

Take a simple example where we want find the the richest person in a given country. So we have dispatched our data collectors to record the owner of each property. The recorded data captures properties such as cars, land, buildings etc against the national-id of the owner. Assuming we end up with 2TB of data in say 70 different files. For ease of understanding, a combination should be conceptualised as a representation in which items are grouped by key and placed in different bags. Each bag will only contain items of the same key.

At our disposal we have 150 computers to work with. The starting point is to split the 70 files into chunks of data and give each computer a chunk to work on as independently as possible. When each of the computers finish, we then combine the results to extract the richest person. Conceptually, each computer should put properties each person in a separate bag. So we have a bag with properties for every person found in their data chunk. The next step is to get bags of the same person coming from each computer and add them up. After this round we shared have one bag per son containing the properties. We can then sort to find the richest.

That is how MapReduce precisely works. The first is the map job, which takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). The reduce job takes the output from a map as input and combines those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce job is always performed after the map job.

Initially, the data for a MapReduce task is stored in input files, and input files typically reside in HDFS. In computing terms, MapReduce works into major steps, the map and reduce. The MapReduce allows your to write the programs to map and another to reduce, then the framework will call these functions to do what you programmed. The map function splits the data into chunks, emits pairs. Once again the choice of the key and values is the duty of the programmer based on the task at hand. In our our rich-man-task, a suitable key is the national-id, and the value is the properties. For processing purposes, the map task breaks its chunk into input formats defined by pairs. This first set of pairs used to process input are defined by the MapReduce framework with an option for custom definition. The input formats include FileInputFormat, TextInputFormat, KeyFileTextInputFormat and many others.

Take a case of FileInputFormat, which provides with a path containing files to read. FileInputFormat will read all files and divides these files into one or more InputSplits. TextInputFormat treats each line of each input file as a separate record and performs no parsing. This is useful for unformatted data or line-based records like log files. For each line, the Key – is the byte offset of the beginning of the line within the file (not whole file just one split), so it will be unique if combined with the file name, and the Value –  is the contents of the line, excluding line terminators. The output is the an another pairs where in our case the key is the national-id, and value will be the extract properties of the person that we find in the this data-chuck being handled by one computer

In the second step, bags belonging the same person encountered by different mappers are sent to same reducer. Now the reducer can do final processing for that person. So each reducer receives values for one key form different mappers

The types of keys and values differ based on the use case. All inputs and outputs are stored in the HDFS. While the map is a mandatory step to filter and sort the initial data, the reduce function is optional.

<k1, v1> -> Map() -> list(<k2, v2>)

<k2, list(v2)> -> Reduce() -> list(<k3, v3>)

MapReduce facilitates concurrent processing by splitting petabytes of data into smaller chunks, and processing them in parallel on Hadoop commodity servers. In the end, it aggregates all the data from multiple servers to return a consolidated output back to the application. With MapReduce, rather than sending data to where the application or logic resides, the logic is executed on the server where the data already resides, to expedite processing. Data access and storage is disk-based—the input is usually stored as files containing structured, semi-structured, or unstructured data, and the output is also stored in files.

A Mapper is a Hadoop servers that runs the Map function, while Reducers are servers carryout the reduce function. It doesn’t matter if these are the same or different servers.

Map
The input data is first split into smaller blocks. Each block is then assigned to a mapper for processing.

Reduce
After all the mappers complete processing, the framework shuffles and sorts the results before passing them on to the reducers. A reducer cannot start while a mapper is still in progress. All the map output values that have the same key are assigned to a single reducer, which then aggregates the values for that key.

Combine and Partition
These are two intermediate steps between Map and Reduce. While Combine is an optional process, the combiner is a reducer that runs individually on each mapper server. It reduces the data on each mapper further to a simplified form before passing it downstream. The

Partition is the process that translates the pairs resulting from mappers to another set of pairs to feed into the reducer. It decides how the data has to be presented to the reducer and also assigns it to a particular reducer. The default partitioner determines the hash value for the key, resulting from the mapper, and assigns a partition based on this hash value. There are as many partitions as there are reducers. So, once the partitioning is complete, the data from each partition is sent to a specific reducer.

So MapReduce is a programming model initially introduced by Google. It provides a scalable and fault-tolerant approach to process massive amounts of data in parallel across a distributed cluster of computers. The model is inspired by functional programming principles and leverages the power of parallelism to achieve high-performance data processing. Map Reduce has become a fundamental tool in AI/ML and Data Science due to its ability to process vast amounts of data efficiently. It allows practitioners to tackle complex Data Analysis tasks that involve large-scale datasets

Loading

Understanding the key differences between procedural and functional languages

The distinction between functional and procedural programs remains a confusing concept. In fact, many programmers use the terms function and procedure interchangeably. For this reason am going use the term pure function when i want to refer to a functional program.

A procedure in a program is a named block of code that can use several times by name to carry out a task. When a function or procedure is “called” to do a task. The task to be carried out may involve returning the output of the task when completed, modification of ‘something’ doings its task or both. When task involves modifying ’something’ outside its environment, we call that a side effect. A side effect will affect something outside the scope of the function, such as printing something to the screen, changing the value of a variable. There can be many side effects of a function before it’s done. For example, it might display values in the interpreter, or modify a file, or produce graphics before it completes. The built-in function print() in many languages creates a side effect by printing to the screen.

While side effects have their place in programming, the challenge is that side-effects, on completion of the task, the side-effect is NOT returned as an output that is sent directly to the caller. This means that if a procedure does not return value at the end of its task, we cannot assign it to a variable. For it only makes sense to have int bankBalance = bankSomeMoney(500), because what would be the value of bankBalance if the procedure bankSomeMoney(500) does not return a value? While this procedure might do so many things, side effects are different from returned values because they are not the output, and many side effects can occur in one procedure.

In programming, when something evaluates to a single value we call it an expression. if something does not evaluate to a value, we call it a statement. So the a call to a procedure that returns a single value is therefore an expression. Importantly, returned values can be used in future computations. Side effects cannot. Function calls are expressions, since they evaluate to a single value. That means we can nest them, the same way we can nest basic operations.

When a program relies on side-effects to produce the final result, then the order in which the side effects are produced and modified is very important, lest you might end-up with un-expected results. Thus every action must be carried out immediately.

Now, if we demand that our procedures do not make any side-effects, and must always return a single value, then we can eliminate many challenges and make other things possible. First, we can avoid state and mutable data. That is once a variable has be given a value it should note be changed (mutated), and state means the value of every variable used in the program.

So in computer science, functional programming is a programming paradigm that treats computation as the evaluation of mathematical functions and avoids state and mutable data. It emphasizes the application of functions, in contrast with the procedural programming style that emphasizes changes in state. Because in math, the functional works with only it inputs to produce an out.

Thus a pure a function in the sense of a functional language always evaluate to the same output given the same input. And since they evaluate to a value, in technical terms they are expressions. On the other hand, procedures don’t evaluate to a value, they might not return anything, just change internal state and therefore they are NOT expressions but statements.

Our bankSomeMoney(500) procedure is capable of return differing results depending on the current bank balance. So its output does not rely on only the inputs. You can modify it to bankSomeMoney(500, 2000) where 2000 is the current balance, in this case, for the same inputs it will return the same output.

Because functional programmers do not expect any thing to be changed “behind the curtains”, we can value the function to its value at the point when we need the value. Until that time the function itself is what is passed around. And when we have many functions ready for evaluation, we can decide on the most effective optimal evaluation strategy combining the functions.This property, evaluating a computation when its result is needed rather than sequentially where it’s called, is known as “laziness”.;values are computed when they are needed.

We are now ready to point out the key differences between procedural and functional languages. In summary, functional programming focuses on expressions while procedural programming focuses on statements. Expressions have values. A functional program is an expression whose value is a sequence of instructions for the computer to carry out.

Procedural:

  • The output of a routine does not always have a direct correlation with the input.
  • Everything is done in a specific order.
  • Execution of a routine may have side effects.
  • Tends to emphasize implementing solutions in a linear fashion.

Functional:

  • Always returns the same output for a given input.
  • Order of evaluation is usually undefined.
  • Must be stateless. i.e. No operation can have side effects.
  • Good fit for parallel execution – Each function can ignore the rest of the universe and focus on what it needs to do. When combined, functions are guaranteed to work the same as they would in isolation.
  • May have the feature of Lazy Evaluation. – Which means a function is not executed until the value is needed. .

Are there pure functional and Procedural Programmings languages?

Man languages have both functional and procedural capabilities. So it is better to think in terms of a languages can be classified as more functional or more procedural based on how much they encourage the use of statements versus expressions.

You can still write in a functional style in a language which encourages the procedural paradigm and vice versa. It’s just harder and/or more awkward to write in a paradigm which isn’t encouraged by the language.

Lisp family and ML family,Haskell, Erlang, are on the side of “purely functional” while many early languages like assembly, Asm, lC, Pascal, Fortran and on the procedural side

Loading