Introduction to Map Reduce
This blog consists of fundamentals of MapReduce and its significance in Hadoop development services. Map Reduce is a programming model that performs parallel and distributed processing of large data sets. Following are the topics that will be covered in this tutorial blog:
- Map Reduce distributed processing: Traditional way
- What is Map Reduce with Example?
- Commands used in Map Reduce
- Map Reduce Program Example
1. Map Reduce distributed processing: Traditional way
We can take football statistics as an example. For example, if I want to know the top scorer. First, I'm going to try to get all the data. Then I'm going to turn it into small blocks or sections. I'm going to collect it in different machines and try to get my response. So essentially, I'm going to get my results, but during the process we can see several challenges.
- Critical path issue: the amount of time required to complete the job without missing the next milestone or actual completion date. Therefore, if any of the machines postpone the job, it will delay the entire work.
- Reliability issue: What if any of the machines that operate with some information fails? It becomes a challenge to handle this bulk failure.
- Equal split issue: How do I split the data into smaller chunks so that even part of the data can be used for each device. In other words, how to split the data equally so that no one device is overloaded or under-used.
- Single split can fail: if any device fails to provide performance, I will not be able to determine the outcome. There should therefore be a mechanism to ensure the system's ability to tolerate this fault.
- Result aggregation: The result generated by each machine should be aggregated to produce the final output.