How to conceptualize MapReduce

Cloud computing makes it possible to use thousands of commodity computers to carry out large scale tasks. However, to explore the full power of this combined computational capacity, there is need for supporting mechanisms to write programs that are capable of utilising the full potential. Such programmes should distribute the subtasks across the different computers and use every opportunity to engage each computer in parallel.

Map Reduce provides a means to engage many computers in parallel to carry out a given task. The critical issues that are decided on by map reduce are how to sub-divide the task into small tasks for each computer to carry out independently as much as possible. How dow we add-up the results from the sub-tasks into one single result. From the computing discipline we call these decisions and others a “framework”, and so Map Reduce is a framework that facilitates concurrent processing by splitting petabytes of data into smaller chunks, and processing them in parallel commodity servers — commodity means computers with average specifications.

Take a simple example where we want find the the richest person in a given country. So we have dispatched our data collectors to record the owner of each property. The recorded data captures properties such as cars, land, buildings etc against the national-id of the owner. Assuming we end up with 2TB of data in say 70 different files. For ease of understanding, a combination should be conceptualised as a representation in which items are grouped by key and placed in different bags. Each bag will only contain items of the same key.

At our disposal we have 150 computers to work with. The starting point is to split the 70 files into chunks of data and give each computer a chunk to work on as independently as possible. When each of the computers finish, we then combine the results to extract the richest person. Conceptually, each computer should put properties each person in a separate bag. So we have a bag with properties for every person found in their data chunk. The next step is to get bags of the same person coming from each computer and add them up. After this round we shared have one bag per son containing the properties. We can then sort to find the richest.

That is how MapReduce precisely works. The first is the map job, which takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). The reduce job takes the output from a map as input and combines those data tuples into a smaller set of tuples. As the sequence of the name MapReduce implies, the reduce job is always performed after the map job.

Initially, the data for a MapReduce task is stored in input files, and input files typically reside in HDFS. In computing terms, MapReduce works into major steps, the map and reduce. The MapReduce allows your to write the programs to map and another to reduce, then the framework will call these functions to do what you programmed. The map function splits the data into chunks, emits pairs. Once again the choice of the key and values is the duty of the programmer based on the task at hand. In our our rich-man-task, a suitable key is the national-id, and the value is the properties. For processing purposes, the map task breaks its chunk into input formats defined by pairs. This first set of pairs used to process input are defined by the MapReduce framework with an option for custom definition. The input formats include FileInputFormat, TextInputFormat, KeyFileTextInputFormat and many others.

Take a case of FileInputFormat, which provides with a path containing files to read. FileInputFormat will read all files and divides these files into one or more InputSplits. TextInputFormat treats each line of each input file as a separate record and performs no parsing. This is useful for unformatted data or line-based records like log files. For each line, the Key – is the byte offset of the beginning of the line within the file (not whole file just one split), so it will be unique if combined with the file name, and the Value –  is the contents of the line, excluding line terminators. The output is the an another pairs where in our case the key is the national-id, and value will be the extract properties of the person that we find in the this data-chuck being handled by one computer

In the second step, bags belonging the same person encountered by different mappers are sent to same reducer. Now the reducer can do final processing for that person. So each reducer receives values for one key form different mappers

The types of keys and values differ based on the use case. All inputs and outputs are stored in the HDFS. While the map is a mandatory step to filter and sort the initial data, the reduce function is optional.

<k1, v1> -> Map() -> list(<k2, v2>)

<k2, list(v2)> -> Reduce() -> list(<k3, v3>)

MapReduce facilitates concurrent processing by splitting petabytes of data into smaller chunks, and processing them in parallel on Hadoop commodity servers. In the end, it aggregates all the data from multiple servers to return a consolidated output back to the application. With MapReduce, rather than sending data to where the application or logic resides, the logic is executed on the server where the data already resides, to expedite processing. Data access and storage is disk-based—the input is usually stored as files containing structured, semi-structured, or unstructured data, and the output is also stored in files.

A Mapper is a Hadoop servers that runs the Map function, while Reducers are servers carryout the reduce function. It doesn’t matter if these are the same or different servers.

The input data is first split into smaller blocks. Each block is then assigned to a mapper for processing.

After all the mappers complete processing, the framework shuffles and sorts the results before passing them on to the reducers. A reducer cannot start while a mapper is still in progress. All the map output values that have the same key are assigned to a single reducer, which then aggregates the values for that key.

Combine and Partition
These are two intermediate steps between Map and Reduce. While Combine is an optional process, the combiner is a reducer that runs individually on each mapper server. It reduces the data on each mapper further to a simplified form before passing it downstream. The

Partition is the process that translates the pairs resulting from mappers to another set of pairs to feed into the reducer. It decides how the data has to be presented to the reducer and also assigns it to a particular reducer. The default partitioner determines the hash value for the key, resulting from the mapper, and assigns a partition based on this hash value. There are as many partitions as there are reducers. So, once the partitioning is complete, the data from each partition is sent to a specific reducer.

So MapReduce is a programming model initially introduced by Google. It provides a scalable and fault-tolerant approach to process massive amounts of data in parallel across a distributed cluster of computers. The model is inspired by functional programming principles and leverages the power of parallelism to achieve high-performance data processing. Map Reduce has become a fundamental tool in AI/ML and Data Science due to its ability to process vast amounts of data efficiently. It allows practitioners to tackle complex Data Analysis tasks that involve large-scale datasets


Understanding the key differences between procedural and functional languages

The distinction between functional and procedural programs remains a confusing concept. In fact, many programmers use the terms function and procedure interchangeably. For this reason am going use the term pure function when i want to refer to a functional program.

A procedure in a program is a named block of code that can use several times by name to carry out a task. When a function or procedure is “called” to do a task. The task to be carried out may involve returning the output of the task when completed, modification of ‘something’ doings its task or both. When task involves modifying ’something’ outside its environment, we call that a side effect. A side effect will affect something outside the scope of the function, such as printing something to the screen, changing the value of a variable. There can be many side effects of a function before it’s done. For example, it might display values in the interpreter, or modify a file, or produce graphics before it completes. The built-in function print() in many languages creates a side effect by printing to the screen.

While side effects have their place in programming, the challenge is that side-effects, on completion of the task, the side-effect is NOT returned as an output that is sent directly to the caller. This means that if a procedure does not return value at the end of its task, we cannot assign it to a variable. For it only makes sense to have int bankBalance = bankSomeMoney(500), because what would be the value of bankBalance if the procedure bankSomeMoney(500) does not return a value? While this procedure might do so many things, side effects are different from returned values because they are not the output, and many side effects can occur in one procedure.

In programming, when something evaluates to a single value we call it an expression. if something does not evaluate to a value, we call it a statement. So the a call to a procedure that returns a single value is therefore an expression. Importantly, returned values can be used in future computations. Side effects cannot. Function calls are expressions, since they evaluate to a single value. That means we can nest them, the same way we can nest basic operations.

When a program relies on side-effects to produce the final result, then the order in which the side effects are produced and modified is very important, lest you might end-up with un-expected results. Thus every action must be carried out immediately.

Now, if we demand that our procedures do not make any side-effects, and must always return a single value, then we can eliminate many challenges and make other things possible. First, we can avoid state and mutable data. That is once a variable has be given a value it should note be changed (mutated), and state means the value of every variable used in the program.

So in computer science, functional programming is a programming paradigm that treats computation as the evaluation of mathematical functions and avoids state and mutable data. It emphasizes the application of functions, in contrast with the procedural programming style that emphasizes changes in state. Because in math, the functional works with only it inputs to produce an out.

Thus a pure a function in the sense of a functional language always evaluate to the same output given the same input. And since they evaluate to a value, in technical terms they are expressions. On the other hand, procedures don’t evaluate to a value, they might not return anything, just change internal state and therefore they are NOT expressions but statements.

Our bankSomeMoney(500) procedure is capable of return differing results depending on the current bank balance. So its output does not rely on only the inputs. You can modify it to bankSomeMoney(500, 2000) where 2000 is the current balance, in this case, for the same inputs it will return the same output.

Because functional programmers do not expect any thing to be changed “behind the curtains”, we can value the function to its value at the point when we need the value. Until that time the function itself is what is passed around. And when we have many functions ready for evaluation, we can decide on the most effective optimal evaluation strategy combining the functions.This property, evaluating a computation when its result is needed rather than sequentially where it’s called, is known as “laziness”.;values are computed when they are needed.

We are now ready to point out the key differences between procedural and functional languages. In summary, functional programming focuses on expressions while procedural programming focuses on statements. Expressions have values. A functional program is an expression whose value is a sequence of instructions for the computer to carry out.


  • The output of a routine does not always have a direct correlation with the input.
  • Everything is done in a specific order.
  • Execution of a routine may have side effects.
  • Tends to emphasize implementing solutions in a linear fashion.


  • Always returns the same output for a given input.
  • Order of evaluation is usually undefined.
  • Must be stateless. i.e. No operation can have side effects.
  • Good fit for parallel execution – Each function can ignore the rest of the universe and focus on what it needs to do. When combined, functions are guaranteed to work the same as they would in isolation.
  • May have the feature of Lazy Evaluation. – Which means a function is not executed until the value is needed. .

Are there pure functional and Procedural Programmings languages?

Man languages have both functional and procedural capabilities. So it is better to think in terms of a languages can be classified as more functional or more procedural based on how much they encourage the use of statements versus expressions.

You can still write in a functional style in a language which encourages the procedural paradigm and vice versa. It’s just harder and/or more awkward to write in a paradigm which isn’t encouraged by the language.

Lisp family and ML family,Haskell, Erlang, are on the side of “purely functional” while many early languages like assembly, Asm, lC, Pascal, Fortran and on the procedural side


Problem solving Strategies for Software Programmers

Programming is a problem solving activity. When you write a program, you are actually writing an instruction for a computer to solve some problem. Overtime,  there are several strategies that have been developed and applied to solve problems. Problem solving is the processing of transforming problem from initial state to a desired  state. Some techniques are more effective while others are less. Here I outline some of the common strategies

Trial and Error – This is also known as solving problems using guess and check or generate and test. While it is certainly true that we don’t want to simply guess random answers as a means of solving problems, there are instances when educated guesses are important, valid and useful. For instance estimating the time an activity will end is an example of an informed guess. This techniques works like this:

  • Form an educated guess
  • Check your solution to see if it works and solves the problem
  • If not, revise your guess based on whether it is too high or too low

Root Cause analysis – a sequence of cause and effect is investigated until the origin of the problem is identified. Root Cause Analysis (RCA) is a systematic concept that involves a set of problem-solving approaches used to determine the underlying cause of an issue. In most cases, when a problem occurs, it creates other problems and resulting problems create others. For instance, in one of the software systems we discovered that some parts of the system were becoming very slow. On further analysis, the page was loading too much data. On further analysis the users where not closing the visiting, leaving many data points to be queried. So a possible solution was to close the visits programmatically after some time. The alternative solution could have been to add more RAM and processing power to the computers. Tools that can help in carrying out effective root cause analysis include the 5 WHY and the Fish Bone Diagram.

Algorithms – in this approach one defines set of step-by-step procedures that provides the correct answer to a particular problem. By following the instructions correctly, you are guaranteed to arrive at the right answer. An algorithm provides specific rules that guarantee a solution.

Brain Storming – Here the methods works by collecting a large number of ideas until one works. Some of these ideas can be crafted into original, creative solutions to a problem, while others can spark even more ideas.

Analogies – Here we compare parallels and make analogies to some other fields where the problem can easily be understood. An analogy is an abstract parallel between two quite different things. For example, you might analogize driving to project management. In both cases it helps to have a map (i.e., a plan) for where you’re going. An analog is a comparison between two objects, or systems of objects that highlights aspects in which they are thought to be similar. Analogical reasoning is any type of thinking that relies upon an analogy. Note that analogy is a cognitive process in which the problem solvers reason through the relationship between the prior experience and the current problem. There are three steps to

  • Mapping step
  • Application Step (Inference Step)
  • Learning Step

Challenges with this approach include ability to find relevant analogies and ability to resist false counter-suggestions that would destroy the analogy.

Working backwards – Working backwards is to start with the final solution and work back one step at a time to get to the beginning. This process will include the following

  • Work back through the logic of what is causing the problem, using the 5WHY’s process or any information that may be relevant, to the ‘resources’ that are driving it.
  • Look at the history of the events that have brought the situation to its current level.
  • Sketch out how you think a solution for the future might work, by changing the input flows and working through what could happen to input and output levels.

This technique works well when

  • The final result is clear and the initial portion of a problem is obscure.
  • A problem proceeds from being complex initially to being simple at the end.
  • A direct approach involves a complicated equation.
  • A problem involves a sequence of reversible actions.

Means End analysis – In this technique aims to apply sequence of transformations that directly target the end state. As described, a problem exists in a current state (initial state) that must be transformed to arrive at given final state. So one might look at the current state, identify the differences between the current state and final state and then keep on providing solutions to the differences. For instance, start at initial state and then create every possible permutation from my initial state. The next step is to calculate the difference in the states just made and end state. In summary:-

  • Identify your current state,
  • Identify where you want to be (your goal state),
  • Identify the means that will get you there.

Brute force – Here we systematically try all possible solutions until one of them works. For instance if I know that pin number to unlock a phone is 4 digits, then I can try all the possible 4 digit combinations because the pin is one of them. This approach works where the solution space is well known and can be traversed in reasonable amount of time. The approach also requires checking each of the possible solution whether it is correct or not.

Hill Climbing – This technique involving choosing any available option that moves you closer to the solution. One challenge with this approach is that the chosen move may appear to move closer to the solution but is incapable of progressing to final solution. We call this getting stuck at local maxima. Local maxima are states that are closer to a goal than any neighboring state but they are not a goal state.

In conclusion, the different strategies outlined above, fall under two broad categories of Algorithmic approaches and Heuristic approaches. Hill climbing, brute force, trial and error, means ends analysis, working backwards all belong to the heuristic strategies because they lack systematic step by step procedures that guarantee a solution all the time. Algorithmic problem solving is more common in computer programming and several algorithms such as bubble sort and binary search among others that solve specific problems.


Problem solving for computer programmers – Well and Ill-defined problems

In practice, a problem occurs when a problem solver has a goal but initially does not know how to reach the goal. So, the problem begins in a given state (current state) and the problem solver wants the problem in another state (goal state). Problem solving is required to transform the current state to the goal state. Therefore problem solving is application of ideas, skills, or actual information to achieve the solution to a problem or reach a desired outcome.

In light of the above, a well-defined problem provides clear description of the initial state, allowed operations and the goal. Well-defined problems  have clear goals or solution and its problem solving strategies are easily developed. On the other hand, ill-defined problems do provide specific guidance in terms of what is expected.

As an example, most of you might be familiar with the wolf, goat and cabbage farmer river-crossing problem. It is usually told that once upon a time, a farmer went to a market and purchased a wolf, goat and cabbage. On his way home, the farmer came to the bank of a river and rented a boat. But crossing the river by boat, the farmer could carry only himself and a single one of his purchases: the wolf, the goat, or the cabbage. If left unattended together, the wolf would eat the goat, or the goat would eat the cabbage. The farmer’s challenge was to carry himself and his purchases to the far bank of the river, leaving each purchase intact. How can this be achieved?

The above, is a well-stated problem with a clear initial state, allowed operations and final state. In dealing, with programming, which is largely a problem-solving job, because the role of the programmer is to give solutions to the computer for it to execute. Remember that computers just follow the programmed solutions.

“Write a computer program for music” is an example of an ill-defined problem. It neither states the initial state nor the final goal. Generally, ill-defined problems come out as ambiguous, provoke several interpretations and it is not obvious when a solution has been reached.  In addition, ill-defined problems are are unclear, abstract, or confusing and do not have clear problem solving strategies. One strategy to solved ill-defined problems is to add constraints for which the solution will be valid. Such constraints are called operational constraints. In the music playing example, one may add that the program should play, pause, and stop play songs format mp3.

In conclusion, as a way of understanding a problem, problem solvers and particularly programmers should identify the three aspects of the problem for any meaningful and acceptable solution to be accepted.


Don’t compete with anybody – set your standards

As I was talking to a former a student, he asked me what kind of advice I can give him to advance his software engineering career. I quickly told him, do not compete with anybody set your standards!

Many times, students tend to compare themselves with their friends in terms of what they are able to achieve including scores. Little do they factor in their own desires and what exactly they want to do with their skills.

The bottom line is to  identify your strengths and weaknesses, and act upon them to full potential. This attitude of competing with others especially in terms of scores can lead to a case of “an eyed man in a sea of blind men”, you might appear better, but in relative terms against a very weak point of reference.

The next question, is what are the most appropriate standards. In the case of software engineering, i would advise to look at some of the current software projects and aspire to produce better projects. There are standards also available from computing associations that can be a good point of reference. Also look at the job or business you want to perfect, and aim for it.


Generating alternative Solutions to Problems

In the article about problem solving in software engineering, i highlighted the major problem solving steps as:-

  • Define the problem
  • Analyse the problem
  • List/Identify  alternative solutions
  • Select the best solution
  • List instructions that lead to the solution using the selected solution
  • Evaluate the solution

In this post, i will focus on how to generate alternative solutions.The first solution you are arrive at may not be best of all possible options. It is important that we generate as many alternatives as possible. This will allow us to choose the most effective solution to the problem.

To generate alternative solutions, you can look at the problem in different ways. You are argued to find a new perspective that you have not yet thought about. One technique is to quickly list different solutions including those that do not look viable and then try to eliminate one by one and see where they fail. Try combinations of different parts of solutions.

You can also engage stakeholders. Usually stakeholders see problems from completely different perspectives. If you are a developer, involve users, involve sales people and other stakeholders.

Within the same group, brainstorming sessions tend to generate different solutions.  In general, the  more alternative solutions at hand, the final solution will be cheaper, elegant and easy to implement



Inspiring quotes on computing

“It is not only the violin that shapes the violinist, we are all shaped by the tools we train ourselves to use, and in this respect programming languages have a devious influence: they shape our thinking habits.”
Edsger W. Dijkstra
“Program testing can be used to show the presence of bugs, but never to show their absence!”
Edsger W. Dijkstra


Microservices is not more than a specialized case of SOA

Over the last couple of years, there has been renewed energy in the subject of service oriented architectures. The invention of term ‘micro services’ has warranted a immense interest in the subject of service oriented architectures.

Despite the initial contention on the difference between SOA and the already mature Component Based Software Engineering (CBSE), there is consensus that SOA improves the key attributes of software such as agility, cost to market, flexibility, inter-operability and many others. A service in SOA is now understood as an autonomous software unit that can be used programmatically across the network. Services can interact regardless of the underlying technologies for as long they expose their interfaces for use by other entities.

The size of a service — how much functionality can be bundled into a single service unit is a issue of design left to programmers and software architects. Of course services are compositional – i.e. two or more services may be combined into a new service in its own right. Practice and programmatic recommendations suggest that a service should bundle a business functionality. And SOA in is traditional form has been largely conceived to be applicable at the business level.

Microservice advocate a new way of building systems with more fine grained service units. Simple low-level core functionality can now be exposed as services for use in building high-level systems. A micro service may be as large as a function in a functional language or method in Object oriented Language. Microservices should not be equated to functions or methods because a service is an architectural unit.

Whereas micro services raise the level of flexibility, they increase the number of ‘moving’ parts – each service is autonomous, deployeable in an independent process. This comes with well known challenges of distributed computing. Obviously we have come to learn that in software architecture there is need for trade-offs. It may be very hard to optimize both flexibility and reliability.

The level of granuality and new usage scenarios for microservices, require new support tools for designing, monitoring, provisioning, management of microservice based systems.


Feed Yourself With News Anytime Anywhere via Internet Feeds

Let the news, latest music, videos and whatever you like follow you to your mobile phone, laptop, or to the desktop of your office/home computer. Gone are the days where you had to check the Internet for the latest updates about scholarships or scores of your favorite football club. Perhaps I should also point out that most Internet users settle for a set of favorite sites that they check regularly. Without the Feeds, typically you would regularly have to check the different sites for new content and news.

RSS, also known as Feeds, is a short form for Really Simple Syndication and provides a technology where you choose the news you want to receive. You can filter the news by category and the frequency to be updated. Some Feeds deliver the entire content to you while others deliver teaser text. A teaser text is a summary of the news to entice you to read the full article. Feeds not only provide you with a smart way of surfing the Internet, but also keep you updated while saving a lot of time. That is, instead of looking for the same information on a regular basis, the information actually looks for you.

Although not yet widely adopted, all modern sites support Feeds where you can subscribe. Take a simple example of news website such as BBC, CNN or SOCCERNET, whenever the news are updated, a summarized version of the news, called a teaser is sent to you. Similarly, when you subscribe to facebook, or Photo sharing site, the new photos are downloaded to you. Unlike other technologies, you have the news delivered to you.

But the greatest advantage of Feeds comes from the ability to choose and combine news from multiple sources. This capability is called aggregation and there are specialized tools called Aggregators. Aggregators allow users to combine and let content follow them anytime anywhere. This is opposed to checking 50 or 100 sites every day. Feeds provide an effective way to filter and organize the vast amount of information from the Internet. With Aggregators, you can set parameters such as how often they should be updated. Once you have aggregated your Feeds, you simply check the Aggregator or Reader for your customized news.

Feeds have rich applications in other areas such as distance education. Imagine a situation where students are notified on their phones whenever new content is posted. This form of interaction improves the e-learning experience. The Feed technology can also be extended to agriculture

The experience with Feeds is that you do not miss your favorite updates and are able to save time, generally keeping you informed anytime anywhere. Moreover, Feed technologies are cheaper compared to Short Messages Services, where you subscribe to a certain short code number and your are charged for each content.

*This article  first appeared in the newvision, the  Ugandan government owned newspaper  on may 26  2009 titled “Feed yourself with news anytime”*


Turning ‘software’ into a ‘software service’

This article looks at how to turn an  existing software that is not service oriented into a service that can be used in a service oriented architecture. We need to know  exactly-  what  is a service?  We are assuming the resulting service will provide  identical functionality. So the  only difference between a software service and other software components is at the interfaces. The interfaces define how the service can be used individually or as part of a larger system. In summary a service needs to achieve the following properties :-

  • is self contained, highly modular, and can be independently deployed. A service can do something useful in its own right.
  • is distributed component, accessible over the network or locator other than the absolute network address.
  • has a published interface, so users only need to see the interface and need not to know the internal details of the implementation.
  • is discoverable, meaning users can look it up in a special directory service where all the services are registered. Services designed for public use require to be discoverable, otherwise potential users may never learn about the service.
  • stresses inter-operability such that users and providers use different implementation languages and platforms. That is any software can be turned on a service for use with other services regardless of the languages in which they are implemented.
  • is dynamically bound, which signifies that the service is located and bound at runtime. Therefore service users do not need to have the service implementation at build time.

Therefore, turning  a software system into  a service consists of encapsulating the software such that it is  exposed  to the web via well defined and  flexible network accessible application programming interface (API).  This can only happen using a set of inter-related technologies. Currently web services provide a technology suite that can provide the above listed characteristics.

Our next article will relate the technologies in web service to   the properties of a service.