How to conceptualize MapReduce

Cloud computing makes it possible to use thousands of commodity computers to carry out large scale tasks. However, to explore the full power of this combined computational capacity, there is need for supporting mechanisms to write programs that are capable of utilising the full potential. Such programmes should distribute the subtasks across the different computers and use every opportunity to engage each computer in parallel.

Map Reduce provides a means to engage many computers in parallel to carry out a given task. The critical issues that are decided on by map reduce are how to sub-divide the task into small tasks for each computer to carry out independently as much as possible. How dow we add-up the results from the sub-tasks into one single result. From the computing discipline we call these decisions and others a “framework”, and so Map Reduce is a framework that facilitates concurrent processing by splitting petabytes of data into smaller chunks, and processing them in parallel commodity servers — commodity means computers with average specifications.

Take a simple example where we want find the the richest person in a given country. So we have dispatched our data collectors to record the owner of each property. The recorded data captures properties such as cars, land, buildings etc against the national-id of the owner. Assuming we end up with 2TB of data in say 70 different files. For ease of understanding, a combination should be conceptualised as a representation in which items are grouped by key and placed in different bags. Each bag will only contain items of the same key.

At our disposal we have 150 computers to work with. The starting point is to split the 70 files into chunks of data and give each computer a chunk to work on as independently as possible. When each of the computers finish, we then combine the results to extract the richest person. Conceptually, each computer should put properties each person in a separate bag. So we have a bag with properties for every person found in their data chunk. The next step is to get bags of the same person coming from each computer and add them up. After this round we shared have one bag per son containing the properties. We can then sort to find the richest.

That is how MapReduce precisely works. The first is the map job, which takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). The reduce job takes the output from a map as input and combines those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce job is always performed after the map job.

Initially, the data for a MapReduce task is stored in input files, and input files typically reside in HDFS. In computing terms, MapReduce works into major steps, the map and reduce. The MapReduce allows your to write the programs to map and another to reduce, then the framework will call these functions to do what you programmed. The map function splits the data into chunks, emits pairs. Once again the choice of the key and values is the duty of the programmer based on the task at hand. In our our rich-man-task, a suitable key is the national-id, and the value is the properties. For processing purposes, the map task breaks its chunk into input formats defined by pairs. This first set of pairs used to process input are defined by the MapReduce framework with an option for custom definition. The input formats include FileInputFormat, TextInputFormat, KeyFileTextInputFormat and many others.

Take a case of FileInputFormat, which provides with a path containing files to read. FileInputFormat will read all files and divides these files into one or more InputSplits. TextInputFormat treats each line of each input file as a separate record and performs no parsing. This is useful for unformatted data or line-based records like log files. For each line, the Key – is the byte offset of the beginning of the line within the file (not whole file just one split), so it will be unique if combined with the file name, and the Value –  is the contents of the line, excluding line terminators. The output is the an another pairs where in our case the key is the national-id, and value will be the extract properties of the person that we find in the this data-chuck being handled by one computer

In the second step, bags belonging the same person encountered by different mappers are sent to same reducer. Now the reducer can do final processing for that person. So each reducer receives values for one key form different mappers

The types of keys and values differ based on the use case. All inputs and outputs are stored in the HDFS. While the map is a mandatory step to filter and sort the initial data, the reduce function is optional.

<k1, v1> -> Map() -> list(<k2, v2>)

<k2, list(v2)> -> Reduce() -> list(<k3, v3>)

MapReduce facilitates concurrent processing by splitting petabytes of data into smaller chunks, and processing them in parallel on Hadoop commodity servers. In the end, it aggregates all the data from multiple servers to return a consolidated output back to the application. With MapReduce, rather than sending data to where the application or logic resides, the logic is executed on the server where the data already resides, to expedite processing. Data access and storage is disk-based—the input is usually stored as files containing structured, semi-structured, or unstructured data, and the output is also stored in files.

A Mapper is a Hadoop servers that runs the Map function, while Reducers are servers carryout the reduce function. It doesn’t matter if these are the same or different servers.

Map
The input data is first split into smaller blocks. Each block is then assigned to a mapper for processing.

Reduce
After all the mappers complete processing, the framework shuffles and sorts the results before passing them on to the reducers. A reducer cannot start while a mapper is still in progress. All the map output values that have the same key are assigned to a single reducer, which then aggregates the values for that key.

Combine and Partition
These are two intermediate steps between Map and Reduce. While Combine is an optional process, the combiner is a reducer that runs individually on each mapper server. It reduces the data on each mapper further to a simplified form before passing it downstream. The

Partition is the process that translates the pairs resulting from mappers to another set of pairs to feed into the reducer. It decides how the data has to be presented to the reducer and also assigns it to a particular reducer. The default partitioner determines the hash value for the key, resulting from the mapper, and assigns a partition based on this hash value. There are as many partitions as there are reducers. So, once the partitioning is complete, the data from each partition is sent to a specific reducer.

So MapReduce is a programming model initially introduced by Google. It provides a scalable and fault-tolerant approach to process massive amounts of data in parallel across a distributed cluster of computers. The model is inspired by functional programming principles and leverages the power of parallelism to achieve high-performance data processing. Map Reduce has become a fundamental tool in AI/ML and Data Science due to its ability to process vast amounts of data efficiently. It allows practitioners to tackle complex Data Analysis tasks that involve large-scale datasets

Loading

Understanding the key differences between procedural and functional languages

The distinction between functional and procedural programs remains a confusing concept. In fact, many programmers use the terms function and procedure interchangeably. For this reason am going use the term pure function when i want to refer to a functional program.

A procedure in a program is a named block of code that can use several times by name to carry out a task. When a function or procedure is “called” to do a task. The task to be carried out may involve returning the output of the task when completed, modification of ‘something’ doings its task or both. When task involves modifying ’something’ outside its environment, we call that a side effect. A side effect will affect something outside the scope of the function, such as printing something to the screen, changing the value of a variable. There can be many side effects of a function before it’s done. For example, it might display values in the interpreter, or modify a file, or produce graphics before it completes. The built-in function print() in many languages creates a side effect by printing to the screen.

While side effects have their place in programming, the challenge is that side-effects, on completion of the task, the side-effect is NOT returned as an output that is sent directly to the caller. This means that if a procedure does not return value at the end of its task, we cannot assign it to a variable. For it only makes sense to have int bankBalance = bankSomeMoney(500), because what would be the value of bankBalance if the procedure bankSomeMoney(500) does not return a value? While this procedure might do so many things, side effects are different from returned values because they are not the output, and many side effects can occur in one procedure.

In programming, when something evaluates to a single value we call it an expression. if something does not evaluate to a value, we call it a statement. So the a call to a procedure that returns a single value is therefore an expression. Importantly, returned values can be used in future computations. Side effects cannot. Function calls are expressions, since they evaluate to a single value. That means we can nest them, the same way we can nest basic operations.

When a program relies on side-effects to produce the final result, then the order in which the side effects are produced and modified is very important, lest you might end-up with un-expected results. Thus every action must be carried out immediately.

Now, if we demand that our procedures do not make any side-effects, and must always return a single value, then we can eliminate many challenges and make other things possible. First, we can avoid state and mutable data. That is once a variable has be given a value it should note be changed (mutated), and state means the value of every variable used in the program.

So in computer science, functional programming is a programming paradigm that treats computation as the evaluation of mathematical functions and avoids state and mutable data. It emphasizes the application of functions, in contrast with the procedural programming style that emphasizes changes in state. Because in math, the functional works with only it inputs to produce an out.

Thus a pure a function in the sense of a functional language always evaluate to the same output given the same input. And since they evaluate to a value, in technical terms they are expressions. On the other hand, procedures don’t evaluate to a value, they might not return anything, just change internal state and therefore they are NOT expressions but statements.

Our bankSomeMoney(500) procedure is capable of return differing results depending on the current bank balance. So its output does not rely on only the inputs. You can modify it to bankSomeMoney(500, 2000) where 2000 is the current balance, in this case, for the same inputs it will return the same output.

Because functional programmers do not expect any thing to be changed “behind the curtains”, we can value the function to its value at the point when we need the value. Until that time the function itself is what is passed around. And when we have many functions ready for evaluation, we can decide on the most effective optimal evaluation strategy combining the functions.This property, evaluating a computation when its result is needed rather than sequentially where it’s called, is known as “laziness”.;values are computed when they are needed.

We are now ready to point out the key differences between procedural and functional languages. In summary, functional programming focuses on expressions while procedural programming focuses on statements. Expressions have values. A functional program is an expression whose value is a sequence of instructions for the computer to carry out.

Procedural:

  • The output of a routine does not always have a direct correlation with the input.
  • Everything is done in a specific order.
  • Execution of a routine may have side effects.
  • Tends to emphasize implementing solutions in a linear fashion.

Functional:

  • Always returns the same output for a given input.
  • Order of evaluation is usually undefined.
  • Must be stateless. i.e. No operation can have side effects.
  • Good fit for parallel execution – Each function can ignore the rest of the universe and focus on what it needs to do. When combined, functions are guaranteed to work the same as they would in isolation.
  • May have the feature of Lazy Evaluation. – Which means a function is not executed until the value is needed. .

Are there pure functional and Procedural Programmings languages?

Man languages have both functional and procedural capabilities. So it is better to think in terms of a languages can be classified as more functional or more procedural based on how much they encourage the use of statements versus expressions.

You can still write in a functional style in a language which encourages the procedural paradigm and vice versa. It’s just harder and/or more awkward to write in a paradigm which isn’t encouraged by the language.

Lisp family and ML family,Haskell, Erlang, are on the side of “purely functional” while many early languages like assembly, Asm, lC, Pascal, Fortran and on the procedural side

Loading

Turning ‘software’ into a ‘software service’

This article looks at how to turn an  existing software that is not service oriented into a service that can be used in a service oriented architecture. We need to know  exactly-  what  is a service?  We are assuming the resulting service will provide  identical functionality. So the  only difference between a software service and other software components is at the interfaces. The interfaces define how the service can be used individually or as part of a larger system. In summary a service needs to achieve the following properties :-

  • is self contained, highly modular, and can be independently deployed. A service can do something useful in its own right.
  • is distributed component, accessible over the network or locator other than the absolute network address.
  • has a published interface, so users only need to see the interface and need not to know the internal details of the implementation.
  • is discoverable, meaning users can look it up in a special directory service where all the services are registered. Services designed for public use require to be discoverable, otherwise potential users may never learn about the service.
  • stresses inter-operability such that users and providers use different implementation languages and platforms. That is any software can be turned on a service for use with other services regardless of the languages in which they are implemented.
  • is dynamically bound, which signifies that the service is located and bound at runtime. Therefore service users do not need to have the service implementation at build time.

Therefore, turning  a software system into  a service consists of encapsulating the software such that it is  exposed  to the web via well defined and  flexible network accessible application programming interface (API).  This can only happen using a set of inter-related technologies. Currently web services provide a technology suite that can provide the above listed characteristics.

Our next article will relate the technologies in web service to   the properties of a service.

Loading

Problem solving -Define the problem

“Until the problem is well defined and articulated it is impossible to arrive at a solution”

The first step to solving any software engineering problem is to define the problem. Articulate the problem and eliminate all unnecessary terminologies and jargons. Start by reading the problem completely at least twice. Read and establish the context of each key word. If time allows, research about the problem.

Ensure that there is agreement on the problem to be solved. Try to restate the problem in you own understanding.  Find out from the person who posed the problem whether the restated problem is the same as the original problem. Identify instances of the problem and see it is possible to solve an instance or example problem A solution to the example problem may lead to insights about how to solve the general problem or bring about any remaining misunderstanding.

Look at the problem from multiple perspectives. Each perspective may reveal additional information about the problem. The problem should be distinguished from its symptoms such that the root cause properly identified and stated.

The output of this step is a well-defined and articulated problem that focuses on what is required for its solution.

Loading

Software programming and problem solving

Programming is the process of planning a sequence of steps called instructions for the computer to follow. The fact that you are reading this post you already know that computers lack common sense and cannot make any judgment. So the computer will do as instructed by the programmer through the computer program. Programming is more about problem solving than coding.

A problem is the difference between things as perceived and things as desired. A solution will move the situation from the things as perceived to the things as desired.

Programmers are problem solvers and need to improve the art and science of problem solving. On one hand problem solving involves an element of art in that experience, judgment and common sense can help deliver smart solutions. On the other hand problems solving is a science involving scientific means of arriving at solutions. Overall, there are several steps to be taken to solve problems

  • Define the problem
  • Analyse the problem
  • List/Identify  alternative solutions
  • Select the best solution
  • List instructions that lead to the solution using the selected solution
  • Evaluate the solution

in the next post, more details on each step shall be discussed

Loading