Core Distributed Computation
- Distributed computation framework
- A system that organizes computation across many machines while handling partitioning, coordination, communication, and recovery.
- Execution model
- The structure a framework imposes on computation, such as map and reduce, supersteps, or dataflow.
- Partitioning
- Dividing data or computation into pieces so work can be done in parallel.
- Locality
- Placing computation near the data it needs to reduce communication cost.
- Straggler
- A slow task or worker that delays completion of a larger job.
- Fault tolerance
- The ability of a system to continue or recover after failures.
MapReduce
- MapReduce
- A batch-processing framework that organizes computation into map, shuffle-and-sort, and reduce phases.
- Master-worker architecture
- A structure in which a master coordinates the job and workers carry out assigned tasks.
- Map task
- A task that processes an input shard and emits intermediate key-value pairs.
- Reduce task
- A task that processes one key together with all associated values and produces final output.
- Intermediate key-value pair
- A key-value pair emitted by a map task for later grouping and reduction.
- Shuffle and sort
- The phase in which intermediate data is moved across the cluster, grouped by key, and sorted before reduction.
- Speculative execution
- Launching a backup copy of a slow task so the first completed result can be used.
BSP and Pregel
- Bulk Synchronous Parallel (BSP)
- A round-based model of parallel computation built around local computation, communication, and barrier synchronization.
- Superstep
- One round of computation in BSP, consisting of local work, message exchange, and synchronization.
- Barrier synchronization
- A point at which all workers must wait until every worker reaches the barrier.
- Checkpointing
- Saving state so computation can resume from a recent point after a failure.
- Pregel
- A graph-processing model based on BSP in which computation is centered on vertices and messages.
- Vertex-centric computation
- A style of computation in which each vertex processes messages, updates state, and sends messages.
- Vote to halt
- A declaration by a vertex that it currently has no more work to do.
- In transit
- A state in which a message has been sent but has not yet been processed by its destination.
- Giraph
- An Apache open-source system based on the Pregel model.
Spark
- Spark
- A distributed data-processing framework designed for multi-stage and iterative computation.
- Resilient Distributed Dataset (RDD)
- Spark’s original core abstraction for a partitioned, immutable, fault-tolerant distributed collection.
- Driver program
- The program that defines a Spark computation and coordinates execution.
- Cluster manager
- The component that allocates cluster resources to Spark applications.
- Worker
- A machine that runs Spark executors.
- Executor
- A process on a worker that runs tasks and can cache data partitions.
- Transformation
- An operation that creates a new RDD lazily from an existing RDD.
- Action
- An operation that triggers execution and returns a result or writes output.
- Lineage
- The record of how an RDD was created so lost partitions can be recomputed.
- Caching
- Keeping an RDD in memory for reuse.
- Persistence
- Keeping an RDD in memory, on disk, or both for reuse.
- Narrow dependency
- A dependency in which each output partition depends on a small number of input partitions.
- Wide dependency
- A dependency in which an output partition depends on data from many input partitions.
- Shuffle
- Moving data across workers and reorganizing it into new partitions for the next stage.
Distributed Machine Learning
- Model
- A program with adjustable numerical parameters that makes predictions from data.
- Training
- Repeatedly adjusting a model’s parameters so it produces better results.
- Parameter update
- A numerical change applied to a model during training.
- Data parallelism
- A training strategy in which the data is partitioned across workers and each worker has a copy of the model.
- Model parallelism
- A training strategy in which the model itself is partitioned across workers or devices.
- Parameter server
- A central server or set of servers that stores model parameters and receives updates from workers.
- All-reduce
- A cooperative aggregation method in which workers combine updates without a central coordinator.
- Pipeline parallelism
- A form of model parallelism in which different groups of layers are placed on different devices.
- Tensor parallelism
- A form of model parallelism in which the computation within a layer is split across devices.