pk.org: CS 417/Lecture Notes

Scalable Distributed Computation

Terms you should know

Paul Krzyzanowski – April 7, 2026

Core Distributed Computation

Distributed computation framework
A system that organizes computation across many machines while handling partitioning, coordination, communication, and recovery.
Execution model
The structure a framework imposes on computation, such as map and reduce, supersteps, or dataflow.
Partitioning
Dividing data or computation into pieces so work can be done in parallel.
Locality
Placing computation near the data it needs to reduce communication cost.
Straggler
A slow task or worker that delays completion of a larger job.
Fault tolerance
The ability of a system to continue or recover after failures.

MapReduce

MapReduce
A batch-processing framework that organizes computation into map, shuffle-and-sort, and reduce phases.
Master-worker architecture
A structure in which a master coordinates the job and workers carry out assigned tasks.
Map task
A task that processes an input shard and emits intermediate key-value pairs.
Reduce task
A task that processes one key together with all associated values and produces final output.
Intermediate key-value pair
A key-value pair emitted by a map task for later grouping and reduction.
Shuffle and sort
The phase in which intermediate data is moved across the cluster, grouped by key, and sorted before reduction.
Speculative execution
Launching a backup copy of a slow task so the first completed result can be used.

BSP and Pregel

Bulk Synchronous Parallel (BSP)
A round-based model of parallel computation built around local computation, communication, and barrier synchronization.
Superstep
One round of computation in BSP, consisting of local work, message exchange, and synchronization.
Barrier synchronization
A point at which all workers must wait until every worker reaches the barrier.
Checkpointing
Saving state so computation can resume from a recent point after a failure.
Pregel
A graph-processing model based on BSP in which computation is centered on vertices and messages.
Vertex-centric computation
A style of computation in which each vertex processes messages, updates state, and sends messages.
Vote to halt
A declaration by a vertex that it currently has no more work to do.
In transit
A state in which a message has been sent but has not yet been processed by its destination.
Giraph
An Apache open-source system based on the Pregel model.

Spark

Spark
A distributed data-processing framework designed for multi-stage and iterative computation.
Resilient Distributed Dataset (RDD)
Spark’s original core abstraction for a partitioned, immutable, fault-tolerant distributed collection.
Driver program
The program that defines a Spark computation and coordinates execution.
Cluster manager
The component that allocates cluster resources to Spark applications.
Worker
A machine that runs Spark executors.
Executor
A process on a worker that runs tasks and can cache data partitions.
Transformation
An operation that creates a new RDD lazily from an existing RDD.
Action
An operation that triggers execution and returns a result or writes output.
Lineage
The record of how an RDD was created so lost partitions can be recomputed.
Caching
Keeping an RDD in memory for reuse.
Persistence
Keeping an RDD in memory, on disk, or both for reuse.
Narrow dependency
A dependency in which each output partition depends on a small number of input partitions.
Wide dependency
A dependency in which an output partition depends on data from many input partitions.
Shuffle
Moving data across workers and reorganizing it into new partitions for the next stage.

Distributed Machine Learning

Model
A program with adjustable numerical parameters that makes predictions from data.
Training
Repeatedly adjusting a model’s parameters so it produces better results.
Parameter update
A numerical change applied to a model during training.
Data parallelism
A training strategy in which the data is partitioned across workers and each worker has a copy of the model.
Model parallelism
A training strategy in which the model itself is partitioned across workers or devices.
Parameter server
A central server or set of servers that stores model parameters and receives updates from workers.
All-reduce
A cooperative aggregation method in which workers combine updates without a central coordinator.
Pipeline parallelism
A form of model parallelism in which different groups of layers are placed on different devices.
Tensor parallelism
A form of model parallelism in which the computation within a layer is split across devices.

Back to CS 417 Documents