Scalable Distributed Computation

Core Distributed Computation

Distributed computation framework: A system that organizes computation across many machines while handling partitioning, coordination, communication, and recovery.
Execution model: The structure a framework imposes on computation, such as map and reduce, supersteps, or dataflow.
Partitioning: Dividing data or computation into pieces so work can be done in parallel.
Locality: Placing computation near the data it needs to reduce communication cost.
Straggler: A slow task or worker that delays completion of a larger job.
Fault tolerance: The ability of a system to continue or recover after failures.

MapReduce: A batch-processing framework that organizes computation into map, shuffle-and-sort, and reduce phases.
Master-worker architecture: A structure in which a master coordinates the job and workers carry out assigned tasks.
Map task: A task that processes an input shard and emits intermediate key-value pairs.
Reduce task: A task that processes one key together with all associated values and produces final output.
Intermediate key-value pair: A key-value pair emitted by a map task for later grouping and reduction.
Shuffle and sort: The phase in which intermediate data is moved across the cluster, grouped by key, and sorted before reduction.
Speculative execution: Launching a backup copy of a slow task so the first completed result can be used.

Bulk Synchronous Parallel (BSP): A round-based model of parallel computation built around local computation, communication, and barrier synchronization.
Superstep: One round of computation in BSP, consisting of local work, message exchange, and synchronization.
Barrier synchronization: A point at which all workers must wait until every worker reaches the barrier.
Checkpointing: Saving state so computation can resume from a recent point after a failure.
Pregel: A graph-processing model based on BSP in which computation is centered on vertices and messages.
Vertex-centric computation: A style of computation in which each vertex processes messages, updates state, and sends messages.
Vote to halt: A declaration by a vertex that it currently has no more work to do.
In transit: A state in which a message has been sent but has not yet been processed by its destination.
Giraph: An Apache open-source system based on the Pregel model.

Spark: A distributed data-processing framework designed for multi-stage and iterative computation.
Resilient Distributed Dataset (RDD): Spark’s original core abstraction for a partitioned, immutable, fault-tolerant distributed collection.
Driver program: The program that defines a Spark computation and coordinates execution.
Cluster manager: The component that allocates cluster resources to Spark applications.
Worker: A machine that runs Spark executors.
Executor: A process on a worker that runs tasks and can cache data partitions.
Transformation: An operation that creates a new RDD lazily from an existing RDD.
Action: An operation that triggers execution and returns a result or writes output.
Lineage: The record of how an RDD was created so lost partitions can be recomputed.
Caching: Keeping an RDD in memory for reuse.
Persistence: Keeping an RDD in memory, on disk, or both for reuse.
Narrow dependency: A dependency in which each output partition depends on a small number of input partitions.
Wide dependency: A dependency in which an output partition depends on data from many input partitions.
Shuffle: Moving data across workers and reorganizing it into new partitions for the next stage.

Model: A program with adjustable numerical parameters that makes predictions from data.
Training: Repeatedly adjusting a model’s parameters so it produces better results.
Parameter update: A numerical change applied to a model during training.
Data parallelism: A training strategy in which the data is partitioned across workers and each worker has a copy of the model.
Model parallelism: A training strategy in which the model itself is partitioned across workers or devices.
Parameter server: A central server or set of servers that stores model parameters and receives updates from workers.
All-reduce: A cooperative aggregation method in which workers combine updates without a central coordinator.
Pipeline parallelism: A form of model parallelism in which different groups of layers are placed on different devices.
Tensor parallelism: A form of model parallelism in which the computation within a layer is split across devices.