Map Reduce explained with example | System Design

Understand how big data processing works with MapReduce on a real-world website logs example.

Hayk Simonyan
Level Up Coding

--

The Problem: How to Analyze Massive Datasets

Imagine you have terabytes of website logs tracking every single visitor interaction, and from here, you want to filter out some information, like which pages are most popular or where visitors drop off in your purchase funnel, etc.

Traditional tools and databases are simply not designed for datasets of this scale. That’s where MapReduce comes in.

What is MapReduce?

MapReduce is a programming model designed specifically to handle the challenges of processing enormous amounts of data that just won’t fit on a single computer. It was introduced by Google in 2004 to tackle exactly these kinds of scenarios. Let’s see how it works through our website log example…

How MapReduce Handles Big Data

MapReduce operates in two primary phases — The map phase and the reduce phase.

Map Phase

In the Map Phase, we first split these huge logs into smaller and manageable chunks. These chunks then get sent to different worker computers in a cluster.

Think of each worker as a separate server that handles its assigned chunk. It has a Map Function that extracts the key information: in our case, it will map the keys, which are the specific webpage visited, to the values, which, if we are counting visits, can be the number of visits to that page (e.g., 1)

Reduce Phase

Then, we enter the reduce phase, where all the key-value pairs generated by the map phase are sorted and grouped by webpage (‘key’).

We forward those to the Reduce Function. For each unique webpage, it adds up the ‘1’ values to find the total visits. It can also tackle more complex questions, such as average time spent, visitor demographics, etc.

And now, using this information, we can visualize it via charts and other visuals on the screen.

Benefits of MapReduce for Log Analysis

We get a couple of benefits when processing data with MapReduce:

  • Parallel Power: Distributing the work makes processing much faster than a single computer could manage.
  • Scalability: Got even more log data? Just add more computers to the cluster and MapReduce can keep up.
  • Fault Tolerance: If a computer fails during a job, MapReduce automatically reassigns its work to other computers in the network. This ensures that all the tasks are completed successfully without interruption.

Batch vs. Stream Processing

To understand why MapReduce is so unique, let’s quickly touch on batch versus stream processing:

Batch Processing

Batch Processing deals with data in large chunks that have already been collected. For example, if you search for a word in Google Docs or Microsoft Word in a large file, the data is available upfront, so it can be processed immediately.

This is useful for large datasets where immediate results aren’t essential, such as when generating monthly sales reports, analyzing customer purchase history, or training machine learning models on data.

Streaming Processing

Stream Processing handles data as it arrives in a continuous flow. For example, when watching a YouTube video, you hit ‘play’, and it starts almost immediately. That’s because tiny pieces of video are sent to your computer in a continuous flow, letting you watch while the rest of the video is still being transmitted.

Streaming is ideal for situations requiring immediate action on data streams, such as when you want to identify suspicious activity in financial transactions or when you need real-time analytics for social media feeds.

Micro-batch Processing

We also have micro-batch processing, which is a hybrid approach that bridges the gap between traditional batch processing and stream processing.

Instead of processing all of your data in one huge batch, micro-batch processing breaks data down into very small batches. These batches are processed at short, fixed intervals (often in seconds or minutes).

Micro-batching is often the preferred method for scenarios demanding faster results than traditional batch, but where full-on streaming isn’t necessary.

What is the processing method used by MapReduce?

MapReduce is a batch-processing model because it operates on data that is already stored, not on a live continuous stream of incoming data. Input data needs to be divided and distributed before the Map phase of MapReduce even begins.

As you can imagine, Batch processing is slower than Streaming due to the accumulation of data before processing. But it’s generally simpler to set up and manage, while Stream Processing can be more complex due to the constant flow of data and the potential for errors or inconsistencies.

MapReduce Limitations and Modern Alternatives

While MapReduce was revolutionary, it has limitations in terms of speed and flexibility for iterative and complex data processing tasks. This is where tools like Apache Spark come in.

Apache Spark

Spark leverages in-memory processing, meaning it keeps data in RAM for very fast calculations compared to MapReduce’s reliance on disk storage. It handles a wider range of tasks, including SQL queries, machine learning, and real-time data processing (streaming).

Apache Flink

Apache Flink is another powerful framework used for real-time data processing (stream processing). It offers similar capabilities to Spark Streaming, allowing for immediate analysis of data as it arrives. This is a specialized tool for scenarios requiring real-time data analysis, often used alongside Spark for a complete big data processing toolkit.

Hadoop

Hadoop is a broader ecosystem that provides the foundation for tools like Spark and MapReduce to run. It includes a distributed file system (HDFS) for storing large datasets across multiple machines and a resource management system (YARN) that allocates resources (CPU, memory) to applications like Spark or MapReduce.

Think of it as the underlying infrastructure that Spark and other tools use to manage and store big data.

Cloud-based Services (AWS, Azure, GCP)

Cloud providers like AWS, Azure, and Google offer managed data processing solutions that often streamline the use of MapReduce frameworks. These include AWS EMR (which supports Hadoop), Azure HDInsight, and Google Cloud Dataflow (Google’s successor to classic MapReduce, built for both batch and streaming data).

In Conclusion

While MapReduce was a breakthrough, Spark has largely taken its place for most modern big data batch processing tasks. However, understanding MapReduce is still important because it gives you a solid foundation for understanding how these powerful tools work.

If you’re new here, I’m Hayk. I help web developers secure their first tech jobs or advance to senior roles at the Web Dev Mastery community.

For weekly insights on web development that you won’t want to miss, subscribe to My Newsletter.

--

--