Scalable Distributed Computation

Distributed computation frameworks solve the same core problem in different ways: how to divide a large computation across many machines, move data where it needs to go, recover from failures, and keep the overall job efficient. The main goal of this topic is to understand the execution model each framework introduces and the kinds of workloads that model supports well.

MapReduce

MapReduce is a batch-processing framework built around a master-worker architecture. The master coordinates the job, assigns map and reduce tasks to workers, tracks their progress, and reassigns work when failures occur. The programmer supplies a map function that emits intermediate key-value pairs and a reduce function that processes one key together with all values associated with that key.

The execution model has a fixed structure. Input is divided into map shards, map workers process those shards in parallel, intermediate key-value pairs are partitioned by key, reducers fetch their assigned partitions, and the framework then performs shuffle and sort. In this phase, data is moved across the cluster so that all records with the same key reach the same reducer, and the reducer’s input is sorted and grouped by key before reduction begins. Reduce workers then process each key and produce the final output.

Key concepts in MapReduce include:

Master-worker coordination: The master assigns work, tracks completion, and restarts failed tasks.
Map tasks and reduce tasks: Map tasks generate intermediate key-value pairs; reduce tasks combine values for each key.
Partitioning of intermediate keys: The framework assigns keys to reducers, often by hashing the key.
Shuffle and sort: Intermediate data is moved, reorganized, sorted, and grouped so that one reducer receives all values for each key.
Locality: Map tasks are preferably placed near the input data they read.
Stragglers and speculative execution: Slow tasks may be duplicated, and the first completed result is used.

Failure handling follows the structure of the job. If a map worker fails, its map task is rerun, and any lost intermediate output is regenerated by re-executing that map task. If a reduce worker fails, the reduce task is rerun after the reducer again fetches its required intermediate partitions. Stragglers are handled through speculative execution, where a slow task may be launched on another worker and the first result to finish is used.

MapReduce is best suited for large batch jobs. It is a poor fit for iterative workloads because each stage typically writes its output to storage before the next stage begins.

BSP

Bulk Synchronous Parallel, or BSP, organizes a distributed computation into supersteps. Each superstep has three parts: local computation, communication, and barrier synchronization. Messages sent during one superstep become available in the next.

The key idea is that BSP gives computation a round-based structure. That makes communication easier to reason about and creates natural points for synchronization and checkpointing. The cost is that fast workers must wait for slow workers at each barrier.

Key concepts in BSP include:

Supersteps: Repeated rounds of local computation, communication, and synchronization.
Message passing between rounds: Messages sent in one round are processed in the next.
Barrier synchronization: All workers wait at the end of a superstep before the next begins.
Checkpointing at synchronization points: The barrier provides a natural place to save state for recovery.

BSP itself is a general model of round-based computation, not a specific graph-processing system with a built-in rule such as vote to halt. In a BSP-style program, the stopping condition is defined by the algorithm or the framework built on top of BSP. A computation may stop after a fixed number of rounds or when no worker has any further useful work to do.

BSP is a better fit than MapReduce for iterative algorithms because it keeps repeated rounds of computation explicit.

Pregel and Giraph

Pregel applies the BSP model to graph processing. Its computation is vertex-centric: each vertex receives messages, updates its state, sends messages to other vertices, and may vote to halt. The graph remains present across iterations rather than being reconstructed as key-value data at each round.

In Pregel, vote to halt means that a vertex declares that it currently has no more work to do. The vertex becomes inactive and is skipped in later supersteps unless a new message arrives for it. If a message does arrive, the vertex becomes active again.

This structure is a natural fit for graph algorithms such as shortest paths and PageRank, where information repeatedly propagates along edges. A vertex becomes active when it has work to do, may become inactive when it does not, and may become active again if it later receives a message. The computation terminates only when every vertex has voted to halt, all vertices are inactive, and no messages remain in transit anywhere in the system.

Key concepts in Pregel and Giraph include:

Vertex-centric computation: The unit of computation is the vertex.
Supersteps: Computation proceeds in synchronized rounds.
Message passing: Vertices communicate by sending messages.
Vote to halt: A vertex declares that it currently has no more work to do.
Active and inactive vertices: A vertex can become inactive and later reactivate if it receives a message.
Global termination: The job ends only when every vertex has voted to halt and no messages remain in transit.

Giraph is an Apache open-source system based on the Pregel model.

Spark

Spark was designed for computations that require more flexibility than MapReduce, especially multi-stage and iterative workloads. Its original core abstraction is the Resilient Distributed Dataset, or RDD, which represents a partitioned collection of data distributed across a cluster.

Spark includes several major architectural components:

Driver program: The program that defines the computation and coordinates execution.
Cluster manager: The component that allocates resources across the cluster.
Workers: Machines that run computation.
Executors: Processes on worker machines that run tasks and can cache data partitions.

Spark organizes computation as a dataflow graph. Transformations create new RDDs lazily, while actions trigger execution. This design allows Spark to support multi-stage pipelines without forcing every stage to materialize its output before the next one begins.

An RDD has several important properties:

Partitioned: The data is divided across the cluster so partitions can be processed in parallel.
Immutable: Once created, an RDD is not modified in place; new RDDs are derived from old ones.
Derived from stable input or earlier RDDs: An RDD comes from source data or from operations on other RDDs.
Fault-tolerant through lineage: Spark records how an RDD was created so lost partitions can be recomputed.
Optionally cached or persisted: An RDD may be kept in memory or on disk for reuse.

Spark performance depends heavily on dependencies between partitions. Narrow dependencies usually allow computation to proceed without redistributing data. Wide dependencies usually require a shuffle. In Spark, a shuffle means that data is moved across workers and reorganized into new partitions so that related records end up together for the next stage. It is therefore both communication across the cluster and repartitioning of the data.

Key concepts in Spark include:

RDDs
Driver program, workers, and executors
Partitioning
Immutability
Lazy evaluation
Transformations and actions
Lineage
Caching and persistence
Narrow and wide dependencies
Shuffle

Comparing the Frameworks

Each framework organizes distributed computation around a different core abstraction.

MapReduce centers computation on map tasks, shuffle and sort, and reduce tasks.
BSP centers computation on supersteps and barrier synchronization.
Pregel centers computation on vertices and messages.
Spark centers computation on partitioned datasets and dataflow.

The main goal is to understand what each framework makes easy, what costs it exposes, and what kinds of workloads it handles best.

What You Do Not Need to Study

Focus on the core abstractions, execution models, and major tradeoffs of each framework. You do not need to study details that are outside that scope.

You should not need to study:

API details or language-specific syntax
The exact names of libraries, commands, or method signatures
Dates, company histories, or product timelines
Which company introduced a framework, except where the historical origin helps explain the model itself
Detailed pseudocode beyond understanding what the example is meant to illustrate
Fine-grained implementation details that were discussed in the notes but are not summarized here
The Ray framework (which was an optional section in the lecture notes)
I will not ask questions about distributed machine learning.