pk.org: CS 417/Lecture Notes

Scalable Distributed Computation

Study Guide

Paul Krzyzanowski – 2026-04-07

Distributed computation frameworks solve the same core problem in different ways: how to divide a large computation across many machines, move data where it needs to go, recover from failures, and keep the overall job efficient. The main goal of this topic is to understand the execution model each framework introduces and the kinds of workloads that model supports well.

MapReduce

MapReduce is a batch-processing framework built around a master-worker architecture. The master coordinates the job, assigns map and reduce tasks to workers, tracks their progress, and reassigns work when failures occur. The programmer supplies a map function that emits intermediate key-value pairs and a reduce function that processes one key together with all values associated with that key.

The execution model has a fixed structure. Input is divided into map shards, map workers process those shards in parallel, intermediate key-value pairs are partitioned by key, reducers fetch their assigned partitions, and the framework then performs shuffle and sort. In this phase, data is moved across the cluster so that all records with the same key reach the same reducer, and the reducer’s input is sorted and grouped by key before reduction begins. Reduce workers then process each key and produce the final output.

Key concepts in MapReduce include:

Failure handling follows the structure of the job. If a map worker fails, its map task is rerun, and any lost intermediate output is regenerated by re-executing that map task. If a reduce worker fails, the reduce task is rerun after the reducer again fetches its required intermediate partitions. Stragglers are handled through speculative execution, where a slow task may be launched on another worker and the first result to finish is used.

MapReduce is best suited for large batch jobs. It is a poor fit for iterative workloads because each stage typically writes its output to storage before the next stage begins.

BSP

Bulk Synchronous Parallel, or BSP, organizes a distributed computation into supersteps. Each superstep has three parts: local computation, communication, and barrier synchronization. Messages sent during one superstep become available in the next.

The key idea is that BSP gives computation a round-based structure. That makes communication easier to reason about and creates natural points for synchronization and checkpointing. The cost is that fast workers must wait for slow workers at each barrier.

Key concepts in BSP include:

BSP itself is a general model of round-based computation, not a specific graph-processing system with a built-in rule such as vote to halt. In a BSP-style program, the stopping condition is defined by the algorithm or the framework built on top of BSP. A computation may stop after a fixed number of rounds or when no worker has any further useful work to do.

BSP is a better fit than MapReduce for iterative algorithms because it keeps repeated rounds of computation explicit.

Pregel and Giraph

Pregel applies the BSP model to graph processing. Its computation is vertex-centric: each vertex receives messages, updates its state, sends messages to other vertices, and may vote to halt. The graph remains present across iterations rather than being reconstructed as key-value data at each round.

In Pregel, vote to halt means that a vertex declares that it currently has no more work to do. The vertex becomes inactive and is skipped in later supersteps unless a new message arrives for it. If a message does arrive, the vertex becomes active again.

This structure is a natural fit for graph algorithms such as shortest paths and PageRank, where information repeatedly propagates along edges. A vertex becomes active when it has work to do, may become inactive when it does not, and may become active again if it later receives a message. The computation terminates only when every vertex has voted to halt, all vertices are inactive, and no messages remain in transit anywhere in the system.

Key concepts in Pregel and Giraph include:

Giraph is an Apache open-source system based on the Pregel model.

Spark

Spark was designed for computations that require more flexibility than MapReduce, especially multi-stage and iterative workloads. Its original core abstraction is the Resilient Distributed Dataset, or RDD, which represents a partitioned collection of data distributed across a cluster.

Spark includes several major architectural components:

Spark organizes computation as a dataflow graph. Transformations create new RDDs lazily, while actions trigger execution. This design allows Spark to support multi-stage pipelines without forcing every stage to materialize its output before the next one begins.

An RDD has several important properties:

Spark performance depends heavily on dependencies between partitions. Narrow dependencies usually allow computation to proceed without redistributing data. Wide dependencies usually require a shuffle. In Spark, a shuffle means that data is moved across workers and reorganized into new partitions so that related records end up together for the next stage. It is therefore both communication across the cluster and repartitioning of the data.

Key concepts in Spark include:

Distributed Machine Learning

Distributed machine learning introduces a workload in which data is used repeatedly to refine a model over many rounds. Training therefore involves repeated passes over data, ongoing parameter updates, and coordination among workers throughout the computation.

Two common strategies are data parallelism and model parallelism. In data parallelism, the dataset is partitioned across workers while each worker holds a copy of the model. In model parallelism, the model itself is partitioned across workers because it is too large to fit comfortably on a single machine or device.

The main ideas to know are:

The essential point is that distributed training requires repeated coordination, not just one pass over a dataset.

Comparing the Frameworks

Each framework organizes distributed computation around a different core abstraction.

The main goal is to understand what each framework makes easy, what costs it exposes, and what kinds of workloads it handles best.

What You Do Not Need to Study

Focus on the core abstractions, execution models, and major tradeoffs of each framework. You do not need to study details that are outside that scope.

You should not need to study:


Next: Terms you should know

Back to CS 417 Documents