From Storage to Computation
Distributed systems first had to solve the storage problem. Once a dataset became too large for one machine, the obvious answer was to partition it across many machines and replicate it for fault tolerance. That made large-scale storage practical, but it immediately created a second problem. Once the data is spread out, computation has to be spread out too.
That second problem is easy to underestimate until the scale becomes concrete. If a computation takes only 100 milliseconds per item, then one billion items take more than three years on one machine. The only realistic answer is parallelism. The work has to be divided into many pieces and executed concurrently across many systems.
At first glance, the solution seems straightforward. Split the data into chunks, send each chunk to a worker, run the same program on each worker, and combine the results. That works for embarrassingly parallel problems, where each chunk can be processed independently, and the final answer is just a straightforward combination of partial results. But many useful workloads are more complicated. They need to regroup records by key, exchange intermediate state among workers, revisit the same dataset repeatedly, or preserve derived data for later stages.
Once those requirements appear, the main cost is no longer just CPU time. The hard parts become data movement, synchronization, fault recovery, and load imbalance. A good distributed computation framework does not eliminate those costs. It organizes them.
The development of large-scale data processing frameworks can be viewed as a sequence of attempts to impose the right structure on those costs.
-
MapReduce imposed a rigid batch model that made huge jobs practical.
-
BSP and Pregel made repeated rounds of communication explicit, which was a better fit for iterative and graph computation.
-
Spark generalized the model into a richer dataflow system with in-memory reuse.
-
More recent systems, such as Ray, moved toward general distributed task orchestration, where the system is not limited to one batch or graph abstraction.
The important lesson is that these frameworks are not interchangeable descriptions of the same idea. Each one is built around a particular style of computation.
The Structure of Distributed Computation
Despite their differences, these systems all aim to address the same underlying issues.
-
Partitioning. A large dataset has to be cut into pieces so different workers can process it concurrently. Those pieces may be file blocks, individual files, ranges of rows, partitions of key-value data, or partitions of an in-memory dataset. Whatever form they take, they become the unit of scheduling.
-
Locality. If the input data already resides on one machine, it is usually cheaper to move the computation there than to pull the data across the network. This is most important when reading raw input.
-
Communication. Many computations cannot be finished independently on each partition. Data has to be regrouped by key, updates have to be propagated among workers, or partial results have to be merged. Communication often dominates the runtime.
-
Synchronization. Some systems run in loosely connected stages. Others advance in explicit rounds with barriers. Synchronization simplifies reasoning, but it also means that fast workers may spend time waiting.
-
Fault tolerance. Failures are normal in a cluster. Machines crash, disks stall, and networks degrade. A useful framework must recover without forcing the entire job to restart from the beginning.
-
Skew and stragglers. Parallelism is effective only when work is reasonably balanced. If one partition is much heavier than the others, or one worker runs unusually slowly, the entire job finishes at the pace of the slowest remaining task.
Those six ideas reappear throughout this topic: partitioning, locality, communication, synchronization, fault tolerance, and stragglers.
MapReduce
Why MapReduce Was Created
MapReduce was created at Google to support large batch computations over enormous datasets, especially workloads related to web search, indexing, log analysis, and the construction of derived data products. Those jobs often had the same broad shape: scan a massive input, extract useful intermediate information, regroup that information by some key, and compute aggregated results.
Before MapReduce, engineers had to write custom distributed programs to do this work. That meant managing worker processes, dividing the input, restarting failed tasks, moving intermediate results, and coordinating the final output. Much of the program was infrastructure rather than application logic.
MapReduce changed that by imposing a strict computational structure. The user wrote a map function and a reduce function. The framework handled the mechanics of parallelization, task distribution, data placement, load balancing, monitoring, and failure recovery.
The model was inspired by the map and reduce functions from functional programming, especially Lisp, but its importance came from turning those ideas into a practical cluster-scale runtime.
The key design choice was restriction. By limiting the structure of the computation, the runtime gained enough control to manage a large cluster automatically.
The MapReduce Model
MapReduce operates on key-value pairs.
The user supplies two functions:
-
map(k1, v1) -> list(k2, v2) -
reduce(k2, list(v2)) -> list(v3)
The map function processes input records and emits intermediate key-value pairs.
The reduce function receives one intermediate key together with all values associated with that key and emits output.
That description hides the middle of the computation, which is where most of the work happens. Before reduction can occur, the framework must gather all intermediate pairs with the same key. That requires partitioning the intermediate data, moving it to reducers, sorting it by key, and grouping equal keys together.
A more accurate high-level picture of what goes on is:
-
Read input partitions.
-
Run map tasks.
-
Partition intermediate pairs by key.
-
Shuffle partitions to reducers.
-
Sort and group identical keys.
-
Run reduce tasks.
-
Write final output.
The shuffle-and-sort step is the center of MapReduce. It is the point where the system reorganizes the data from its original storage layout into a layout defined by computation.
How Input Is Partitioned
The input to a MapReduce job is usually divided into input splits, also called shards. The exact form depends on the storage system and the input format.
Common cases include:
-
A large file split into fixed-size blocks or splits.
-
A text input where the split is the scheduling unit, but the map function is invoked once per line.
-
A collection of many small files grouped so that one map task processes several files.
-
Structured input partitioned by record or row range rather than by file offset.
Two separate ideas are important here.
The first is the scheduling unit. That is the split assigned to a map task.
The second is the logical record seen by the user code. A split may contain many lines, records, or key-value pairs. The input format determines how the split is parsed into records before the map function sees them.
This is why it is not enough to say that MapReduce processes “files.” The framework schedules work over splits, while the user-level computation processes logical records inside those splits.
Word Count
Word count is still the classic first example because it makes the basic pattern easy to see.
Input:
-
Key: document ID or byte offset
-
Value: document text or a line of text
Map (user code):
map(_, text):
for each word w in text:
emit(w, 1)
The map function extracts each word from the input text and emits the pair (word, 1). The word becomes the key, and the value 1 records one occurrence of that word.
Shuffle and sort:
-
All pairs with the same word are brought together
-
For example,
(system, 1),(system, 1),(system, 1)become: -
system -> [1, 1, 1]
Reduce (user code):
reduce(word, counts):
total = 0
for c in counts:
total += c
emit(word, total)
During shuffle and sort, the framework gathers all values associated with the same word into one list. Each unique word therefore becomes a key whose list contains one 1 for each occurrence. The reduce function adds those values to produce the final count.
This example is useful because it isolates the core pattern: local extraction first, regrouping by key second, aggregation third.
Building an Inverted Index
A more realistic example is the construction of an inverted index, which maps each term to the documents that contain it. This is much closer to the type of workload that originally motivated MapReduce.
Suppose the input consists of documents:
-
(doc1, "distributed systems are fun") -
(doc2, "systems scale well")
Map:
map(docID, text):
for each term t in text:
emit(t, docID)
Every word (whatever our software decides is a “term”) is output as a key, with the document ID as its value. This is a way of saying “this term appears in this document.”
Intermediate output:
-
(distributed, doc1) -
(systems, doc1) -
(systems, doc2) -
(scale, doc2)
Shuffle and sort:
-
distributed -> [doc1] -
systems -> [doc1, doc2] -
scale -> [doc2]
The framework combines the data from all identical keys (the terms) into a single list, calling the reduce function once with each unique key (term) along with the list of all the values (docIDs) associated with it.
Reduce:
reduce(term, docIDs):
postings = unique_sorted(docIDs)
emit(term, postings)
Final output:
-
(distributed, [doc1]) -
(systems, [doc1, doc2]) -
(scale, [doc2])
This example makes the shuffle more meaningful. The documents are initially partitioned by where they happen to be stored. The output index must instead be partitioned by term. The shuffle is the step that changes the data’s organization from storage-based to key-based ownership.
Average Salary by ZIP Code
A second follow-on example shows that MapReduce is not limited to document processing.
Suppose the input is a set of employee records containing salary and ZIP code, and the goal is to compute the average salary for each ZIP code.
Map:
map(_, employee_record):
emit(employee_record.zip, employee_record.salary)
The map function extracts the ZIP code and salary from each record and emits the ZIP code as the key and the salary as the value.
Reduce:
reduce(zip, salaries):
total_salary = 0
count = 0
for s in salaries:
total_salary += s
count += 1
emit(zip, total_salary / count)
The structure is the same as before. The map stage extracts a grouping key and a local contribution. The reduce stage combines all contributions for that key.
In this case, reduce is called once for each unique ZIP code, and salaries is the list of all salaries associated with that ZIP code.
Execution Model
A typical MapReduce execution proceeds as follows:
-
Split the input into
Mshards. -
Launch worker processes.
-
Assign map tasks to workers.
-
Each mapper reads its shard and emits intermediate key-value pairs.
-
The mapper partitions those pairs into
Rreducer partitions, often using: -
hash(key) mod R -
Reducers fetch their partitions from all mappers.
-
Each reducer sorts its data by key and groups identical keys.
-
The reduce function runs once per key.
-
Each reducer writes an output file.
The use of M map tasks and R reducer partitions is worth keeping explicit. It makes the structure of the job concrete. The system is coordinating data flow from many input shards to many reducer partitions.
Locality
MapReduce tries to schedule map tasks on machines that already hold the needed input data, or at least on nearby machines. This is data locality.
The idea is that moving code is cheap, but moving huge input datasets is not.
This optimization matters most for the map phase because maps read raw input directly from storage. Reducers cannot exploit locality in the same way because they must gather intermediate data from many mappers scattered across the cluster. Also, in many applications the map operation ends up discarding most of the data, emitting only the useful components as keys and values.
Locality, therefore, helps at the front of the job, but the regrouping in the middle still requires communication.
Shuffle, Sort, and Cost
The shuffle is usually the most expensive phase of a MapReduce job.
During mapping, data is processed where it already resides. Before the reduce step, the data must be reorganized using an intermediate key (the key output by the mapper). That means:
-
Writing intermediate data locally
-
Partitioning it by reducer: identifying which reducer system will process that data
-
Transferring it across the network
-
Merging it at the reducers
-
Sorting it by key, so identical keys will be together
-
Combining values from equal keys: so that reduce will be called just once per unique key
This is more than just data movement. It is a distributed repartitioning step.
That is why shuffle-heavy jobs are expensive. They combine network traffic, disk I/O, and synchronization delay. If some keys are much more frequent than others, reducers become imbalanced. One overloaded reducer can become the bottleneck that determines the completion time of the entire job.
Stragglers and Speculative Execution
In a large cluster, some tasks inevitably run slower than others. A worker may be contended, a disk may be slow, or a task may simply receive a heavier partition.
These slow tasks are called stragglers. It’s important to be aware of them because a job completes only when its last required tasks complete.
MapReduce addresses this with speculative execution. If a task appears unusually slow, the framework may launch a duplicate copy of that task on another worker. The first result to finish is used.
This is a practical response to a statistical fact: at scale, something is always likely to be slow.
Fault Tolerance
MapReduce handles failure mainly through re-execution.
-
If a mapper fails, its task can be rerun on another worker.
-
If the mapper’s local intermediate output was lost, rerunning the map task regenerates it.
-
If a reducer fails, the reduce task can also be rerun.
This works because map and reduce tasks are expected to be deterministic and free of externally visible side effects. If rerunning a task produces the same result, recovery is straightforward.
Why MapReduce Was Important
MapReduce made large batch jobs much easier to write.
The programmer no longer had to manage:
-
Worker creation
-
Input partitioning mechanics
-
Scheduling
-
Locality decisions
-
Fault recovery
-
Straggler handling
-
Aggregation of results
The model was narrow, but in a productive way. It gave the runtime enough structure to do a large amount of distributed systems work automatically.
MapReduce quickly escaped its original Google setting. Its ideas became the basis of Apache Hadoop MapReduce, which made large-scale batch processing widely accessible in open-source systems, and managed cloud services such as Amazon EMR, originally called Elastic MapReduce, and Google Cloud Dataproc made Hadoop-style processing available without running the cluster infrastructure by hand.
Limitations of MapReduce
MapReduce is excellent for one-pass batch computation. It is much less natural for workloads that require repeated passes over the same data, many dependent stages, or fine-grained communication.
The most important weakness is iterative computation.
Algorithms such as PageRank, shortest paths, clustering, gradient descent, and many machine learning workloads require repeated rounds over mostly the same data. In plain MapReduce, each round is usually expressed as a new job. That means repeated input reading, repeated output writing, repeated scheduling overhead, and repeated shuffle cost.
MapReduce is also a poor fit for graph processing. Graph algorithms often involve propagating state along edges over many rounds. Expressing that, as a sequence of independent batch jobs, is possible but awkward and expensive.
Finally, the fixed map-then-reduce structure is too rigid for many multi-stage pipelines. Some problems need a richer dataflow model than alternating map and reduce phases.
BSP and Pregel
Why Iterative Computation Needs a Different Model
Once a computation becomes iterative, the MapReduce model starts working against the problem rather than helping it. Reconstructing state through repeated jobs is awkward because it has to be passed along through repeated map and reduce runs. This is especially true for graphs.
Graphs appear everywhere:
-
Web pages linked to one another
-
Users connected in a social network
-
Routers connected by network links
-
Roads connected by intersections
-
Citation networks
-
Dependency graphs
Graph algorithms usually do little local work per vertex, but they require repeated propagation of information across edges. That is a poor fit for a batch model based on regrouping key-value pairs from scratch in every round.
Bulk Synchronous Parallel
The Bulk Synchronous Parallel, or BSP, model was introduced by Leslie Valiant in 1990 as a general model of parallel computation, particularly for supercomputers. At the time, large parallel machines were difficult to program efficiently because of complex communication patterns and timing variability. BSP provided a structured model that divided computation into supersteps separated by global synchronization barriers, making performance easier to reason about. Although originally developed for high-performance computing, the model later proved useful for distributed systems, especially for iterative and graph-based workloads.
BSP organizes a computation into repeated supersteps.
Each superstep has three parts:
-
Local computation
-
Communication
-
Barrier synchronization
During one superstep, a worker computes using its current local state and the messages that arrived from the previous superstep. While computing, it may send messages to other workers. Those messages do not become visible immediately. They become inputs to the next superstep. At the end of the superstep, all workers wait at a barrier before the next round begins.
This round-based structure simplifies reasoning. The system no longer has arbitrary message timing. Instead, time advances in explicit steps.
The cost is equally clear. Fast workers must wait for slow ones at the barrier.
Why BSP Is Useful
BSP separates three concerns cleanly:
-
Local work within a round
-
Communication across workers
-
Synchronization between rounds
That separation makes it easier to understand and implement iterative distributed programs. It also creates natural checkpoints for fault recovery. A system can save the state of a computation at the end of every N supersteps and restart from the last completed checkpoint after a failure.
BSP does not remove communication cost or load imbalance. It makes them easier to reason about by organizing them into explicit rounds.
Where BSP Fits
BSP is a good match for applications that proceed in repeated rounds, where each round consists of local work, message exchange, and a clearly defined synchronization point. That includes graph algorithms, iterative linear algebra, some numerical simulations, and certain machine learning workloads that alternate between local computation and global coordination. The model is especially attractive when the computation is easier to reason about in phases than as a fully asynchronous system.
An open-source system in this family was Apache Hama. Hama implemented a BSP-style execution model on top of the Hadoop ecosystem and targeted graph, matrix, and machine learning workloads.
BSP never became the dominant general framework for data processing, but its core ideas remained important. Supersteps, message passing, and barrier synchronization proved to be a particularly good fit for iterative graph algorithms, which led directly to systems such as Pregel.
Pregel: Vertex-Centric Graph Computation
Pregel was introduced at Google in 2010 as a system for large-scale graph processing. Its immediate motivation was that many important graph algorithms, including PageRank and shortest-path computation over very large graphs, were awkward and inefficient to express in MapReduce. Google needed a model that could keep the graph structure alive across iterations and let computation propagate through the graph directly instead of rebuilding state as fresh key-value data in every round.
Pregel’s ideas later influenced Apache Giraph, an open-source system based on the Pregel model. Giraph was used at Facebook to process large social graphs, demonstrating that the vertex-centric BSP style became a practical way to organize computation over very large graphs outside Google as well.
Pregel applies the BSP model specifically to graph processing.
The central idea is vertex-centric computation: think like a vertex.
Each vertex has:
-
An identifier
-
A modifiable state value
-
Outgoing edges, often with associated values
Computation proceeds in supersteps. In each superstep, the same user-defined function runs on each active vertex. That function:
-
Receives messages sent to the vertex in the previous superstep
-
May update the vertex’s state
-
May update outgoing edge state
-
May send messages that will be delivered in the next superstep
-
May vote to halt if the vertex has no more work to do
The computation terminates when all vertices are inactive, and no messages remain in transit. That means that all vertices voted to halt and no vertex has incoming messages.
This is a much more natural way to express graph algorithms than rebuilding graph state as key-value data in a sequence of MapReduce jobs.
Pregel Example: Single-Source Shortest Path
A standard Pregel example is finding the shortest path from one source vertex.
Initialization:
-
Source vertex distance = 0
-
All other vertices distance = infinity
Superstep logic (this runs for each vertex):
if vertex is source and superstep == 0:
vertex.distance = 0
for each outgoing edge (vertex -> neighbor, weight):
send(neighbor, weight)
else:
best = min(incoming_messages)
if best < vertex.distance:
vertex.distance = best
for each outgoing edge (vertex -> neighbor, weight):
send(neighbor, best + weight)
vote_to_halt()
What this code does on each vertex is:
-
Receive candidate distances.
-
Keep the smallest one.
-
If the distance improved, propagate that improvement to neighbors.
-
Otherwise, stay inactive unless a future message arrives.
Eventually, no vertex discovers a shorter path, no useful messages remain, and the computation terminates.
Pregel Example: PageRank
PageRank is an algorithm for ranking pages in a directed graph such as the web graph. The basic idea is that a page should receive a high rank if other highly ranked pages link to it. Each page repeatedly distributes part of its current rank across its outgoing links, and each page updates its own rank from the contributions it receives along its incoming links. Because these values are propagated along edges over many rounds until they stabilize, PageRank is a natural example of iterative graph computation.
In a Pregel-style system, each vertex represents a page and each directed edge represents a hyperlink. During each superstep, a vertex sends part of its current rank to its outgoing neighbors. It then computes a new rank from the contributions it received in the previous superstep.
compute(vertex, incoming_messages):
if superstep == 0:
vertex.rank = initial_value
else:
sum = 0
for m in incoming_messages:
sum += m
vertex.rank = base + damping * sum
if superstep < max_iterations:
contribution = vertex.rank / out_degree(vertex)
for each neighbor:
send(neighbor, contribution)
else:
vote_to_halt()
The repeated exchange of partial rank values fits naturally into the round-based message passing model used by BSP and Pregel.
Synchronization, Checkpointing, and Cost
BSP and Pregel simplify reasoning, but they do not eliminate cost.
The barrier at the end of each superstep means:
-
Fast workers wait for slow workers
-
Communication must be completed before the next round begins
-
Imbalance becomes highly visible
On the other hand, barriers provide natural checkpoint boundaries. A system can save the state of each partition every few supersteps and restart from the last checkpoint after a failure.
BSP and Pregel trade some efficiency for a model that is easier to understand, coordinate, and recover.
Why Pregel Fits Graphs Better Than MapReduce
Pregel fits graph algorithms better in both performance and structure.
MapReduce requires repeatedly reconstructing the graph state as key-value data. Pregel keeps the graph alive and lets vertices exchange messages directly across rounds.
That is a better match for algorithms whose natural unit of work is: what should this vertex do next, given what its neighbors told it?
Spark
Why Spark Was Needed
Spark was created at UC Berkeley’s AMPLab around 2009. The project grew out of frustration with MapReduce on workloads that needed repeated passes over the same data, especially iterative machine learning and interactive data analysis. The first Spark paper appeared in 2010, and the system quickly became influential because it attacked one of the most obvious bottlenecks in the Hadoop era: writing every intermediate stage back to storage before the next stage could begin.
In MapReduce, each phase typically writes its output to stable storage before the next phase begins. That design works well for one large batch computation, but it is inefficient for:
-
Iterative algorithms
-
Multi-stage analytics pipelines
-
Workloads that reuse the same intermediate data several times
-
Interactive or exploratory analysis
Spark keeps the basic idea of distributed processing over partitions, but it replaces the rigid map-then-reduce structure with a richer dataflow model. It also allows intermediate datasets to be cached in memory, which is crucial for repeated computation.
Spark Architecture
Spark has four main runtime components:
-
Driver program: runs the application logic and coordinates the job
-
Cluster manager: allocates resources across the cluster
-
Workers: machines that execute tasks
-
Executors: processes on workers that run tasks and store cached data
The driver constructs the computation. Executors perform the work over partitions of data. Each executor can also keep cached partitions in memory for reuse by later stages.
This architecture reflects Spark’s design as a distributed runtime that coordinates computation across many machines while keeping data partitioned among them.
Resilient Distributed Datasets
Spark’s designers wanted an abstraction for working with large datasets spread across a cluster without forcing every stage of computation to write its results back to storage. The result was the Resilient Distributed Dataset, or RDD, Spark’s original core abstraction.
An RDD is a logical collection of data partitioned across many machines. It is designed so that the system can process those partitions in parallel, recover lost partitions if a worker fails, and optionally keep frequently used data in memory for reuse.
An RDD has several important properties:
-
Partitioned across the cluster
-
Immutable
-
Derived from stable input data or from other RDDs
-
Fault-tolerant through lineage
-
Optionally persisted in memory or on disk
- Resilient
- The word resilient refers to the ability to recover. If a partition is lost because an executor fails, Spark can recompute that partition from the sequence of operations that created it.
- Distributed
- The word distributed means the dataset is split across many machines.
- Dataset
- The word dataset should be interpreted broadly. An RDD may contain lines of text, tuples, records, parsed objects, key-value pairs, or many other types of elements.
How Spark Inputs Are Partitioned
Spark partitions input in ways that resemble MapReduce, but the abstraction is broader because Spark applies partitioning not only to source data but also to intermediate results.
An RDD may be created from:
-
A file or directory of files
-
HDFS or S3
-
A relational database or NoSQL store such as HBase or Cassandra
-
Another RDD derived from an earlier RDD
The partition is again the scheduling unit.
Some examples are:
-
A large text file may be split into file partitions.
-
A directory of files may yield many partitions, possibly combining small files.
-
A database read may be partitioned by key range.
-
A keyed RDD created by
reduceByKeymay use hash partitioning.
As in MapReduce, the partition is the unit assigned to workers. The logical records inside a partition may be lines, rows, tuples, objects, or key-value pairs.
Lazy Evaluation
One of Spark’s most important ideas is lazy evaluation.
Transformations do not execute immediately. Instead, Spark records them as a lineage graph describing how the result can be computed.
For example:
lines = textFile("logs.txt")
errors = lines.filter(startsWith("ERROR"))
fields = errors.map(split_on_tabs)
messages = fields.map(extract_message)
mysqlCount = messages.filter(contains("mysql")).count()
When lines, errors, fields, and messages are defined, Spark does not yet run the job. It records the transformations.
Only when the action count() is invoked does Spark build an execution plan and schedule tasks.
Lazy evaluation has several benefits:
-
Spark can avoid unnecessary work.
-
Spark can pipeline compatible operations together.
-
Spark can optimize execution around the requested action.
-
Spark can recompute lost partitions later because the lineage is known.
An easy way to think about it is that transformations build a recipe. An action tells Spark to actually run it.
Caching RDDs in Memory
Caching is one of Spark’s defining features.
Suppose a program extracts error messages from a large log file and then asks several questions about those messages. Without caching, every action may trigger the recomputation of the same filtering and parsing steps.
With caching:
lines = textFile("logs.txt")
errors = lines.filter(startsWith("ERROR"))
messages = errors.map(parse_message)
messages.cache()
mysqlCount = messages.filter(contains("mysql")).count()
phpCount = messages.filter(contains("php")).count()
Here, messages.cache() tells Spark to persist the messages RDD after it is first computed. The second action can then reuse the cached partitions instead of re-reading and re-parsing the raw logs.
Caching is especially useful for:
-
Iterative algorithms
-
Interactive analysis
-
Workloads with shared intermediate data
-
Machine learning pipelines
Caching is not free. It uses memory, and if memory is insufficient Spark may spill data to disk depending on the persistence mode. But when a dataset is reused, caching can change the cost structure dramatically.
Fault Tolerance Through Lineage
Spark’s fault tolerance is based on lineage rather than eager replication of every intermediate result.
If an executor fails and one of its partitions is lost, Spark determines how that partition was created and recomputes only the missing partition.
For example, if:
messages = textFile("logs.txt").filter(...).map(...)
and partition 7 of messages is lost, Spark does not recompute the whole RDD. It recomputes partition 7 by replaying the necessary transformations on the corresponding input partition.
This works well because most transformations are deterministic.
Spark Transformations and Actions
Spark distinguishes between transformations, which create new RDDs lazily, and actions, which trigger execution and either return a result or write output.
Common Spark transformations
| Transformation | What it does | Notes |
|---|---|---|
map(f) |
Applies f to each element |
One output per input element |
flatMap(f) |
Applies f and flattens the results |
Useful when one input yields zero or more outputs |
filter(pred) |
Keeps only elements satisfying pred |
Narrow transformation |
sample(...) |
Returns a random sample | Useful for approximate analysis or testing |
union(other) |
Concatenates two RDDs | Does not remove duplicates |
intersection(other) |
Keeps elements present in both RDDs | Usually requires shuffle |
distinct() |
Removes duplicate elements | Often shuffle-heavy |
groupByKey() |
Groups all values with the same key | Can be expensive because it materializes full value lists |
reduceByKey(f) |
Combines values with the same key using f |
Usually preferred over groupByKey for aggregation |
aggregateByKey(...) |
Generalized per-key aggregation | Useful when local and global combination differ |
sortByKey() |
Sorts keyed data by key | Requires global ordering work |
join(other) |
Joins two keyed RDDs on matching keys | Often requires shuffle |
cogroup(other) |
Groups values from multiple keyed RDDs by key | Useful for more general joins |
cartesian(other) |
Produces all pairs from two datasets | Usually very expensive |
Common Spark actions
| Action | What it does | Notes |
|---|---|---|
count() |
Returns the number of elements | Triggers evaluation |
collect() |
Returns all elements to the driver | Dangerous for very large datasets |
first() |
Returns the first element | Useful for quick inspection |
take(n) |
Returns the first n elements |
Safer than collect() for inspection |
reduce(f) |
Aggregates elements using f |
Best when the reduction operation is commutative and associative for distributed use |
countByKey() |
Counts the number of values per key | Returns results to the driver |
saveAsTextFile(...) |
Writes results to storage | Common terminal action |
saveAsSequenceFile(...) |
Writes results in Hadoop SequenceFile format | Useful in Hadoop ecosystems |
foreach(f) |
Applies f for side effects |
Use carefully in distributed settings |
A useful rule of thumb is that operations involving regrouping by key or global ordering are the expensive ones, because they usually require shuffle.
Narrow and Wide Dependencies
Spark performance becomes easier to understand when transformations are viewed through their dependencies.
A narrow dependency means each output partition depends on a small number of input partitions, often just one. Examples include:
-
map -
filter -
Some forms of
flatMap
These are often cheap because they can be pipelined without repartitioning the data.
A wide dependency means an output partition depends on data from many input partitions. Some examples are:
-
groupByKey -
reduceByKey -
join -
sortByKey
These operations usually require a shuffle, which means the framework repartitions data across the cluster so that related records end up on the same worker, or so that records can be placed into a new global order. That process typically involves moving data over the network, reorganizing it by key or sort order, and then continuing the computation on the new partitions.
This is Spark’s version of an important MapReduce lesson: regrouping data across the cluster is often where much of the cost appears.
Shared Variables
Spark includes a few controlled forms of shared state.
Broadcast variables
A broadcast variable is read-only data sent once to executors and cached there. This is useful when many tasks need the same lookup table, dictionary, model parameters, or configuration object.
Without broadcasting, the same data may be sent repeatedly for many tasks.
Accumulators
An accumulator supports distributed aggregation into a counter or sum-like variable. Workers can add to it, and the driver can read the final value.
This is useful for counts, statistics, and debugging summaries, but it is not general shared mutable state.
A Spark Example: Error Log Analysis
Consider a large application log stored across a cluster. A common task is to isolate error records, extract the message portion of each record, and reuse that extracted dataset for several analyses.
For example, one analysis might count how many error messages mention MySQL, another might count how many mention PHP, and a third might identify the most frequent terms appearing in error messages. This workflow benefits from keeping an intermediate dataset available for reuse.
A possible pipeline is:
lines = textFile("app.log")
errors = lines.filter(line starts with "ERROR")
fields = errors.map(split line on tab)
messages = fields.map(extract message field)
messages.cache()
mysqlCount = messages.filter(message contains "mysql").count()
phpCount = messages.filter(message contains "php").count()
topTerms = messages.flatMap(tokenize)
.map(term -> (term, 1))
.reduceByKey(add)
.sortBy(count descending)
.take(20)
The code carries out the analysis in a sequence of steps:
-
lines = textFile("app.log")creates an RDD whose elements are lines from the log file. -
errors = lines.filter(...)keeps only the lines that represent error records. -
fields = errors.map(...)parses each error record into fields, such as by splitting on tab characters. -
messages = fields.map(...)extracts just the message portion of each parsed record. -
messages.cache()tells Spark to keep themessagesRDD available after it is first computed, so later actions can reuse it. -
mysqlCount = messages.filter(...).count()keeps only messages containing"mysql"and counts how many such messages there are. The call tocount()is an action, so it triggers execution of the pipeline. -
phpCount = messages.filter(...).count()performs a second count for messages containing"php". Becausemessageshas been cached, Spark can reuse it instead of repeating the earlier filtering and parsing steps. -
topTerms = messages.flatMap(tokenize)breaks each message into individual terms and flattens the results into one stream of terms. -
.map(term -> (term, 1))turns each term into a key-value pair so that terms can be counted. -
.reduceByKey(add)adds together all the counts for identical terms. -
.sortBy(count descending)orders the(term, count)pairs from highest count to lowest. -
.take(20)returns the first 20 results from that sorted output, which in this case are the 20 most frequent terms.
Some key aspects of Spark illustrated by this example are:
-
Transformations build a lineage of computation without immediate execution.
-
Execution begins only when an action such as
count()ortake(20)is invoked. -
Intermediate datasets can be cached to avoid recomputing earlier stages.
-
Multiple analyses can reuse the same cached dataset efficiently.
-
Operations such as
reduceByKeyand sorting support richer multi-stage pipelines than a fixed map-then-reduce structure.
Spark Beyond Core RDDs
Although RDDs are the historical core, Spark has grown into a broader ecosystem, and the original RDD model remains the clearest way to explain its execution semantics.
Important extensions include:
-
Spark SQL, which adds relational and schema-aware processing
-
MLlib, which provides machine learning algorithms and utilities
-
GraphX, which adds graph computation on top of Spark
-
Spark Streaming, which originally processed live data as a sequence of micro-batches
These additions do not change the core ideas introduced here. Spark is still best understood as a distributed dataflow engine built around partitioned datasets, lazy evaluation, lineage, and optional persistence.
Distributed Machine Learning
What Machine Learning Changes
The frameworks discussed so far were designed primarily for data processing. They read large inputs, transform or aggregate them, and produce output datasets or summaries. Machine learning training has a different computational shape. The same training data is used repeatedly, and workers must coordinate across many rounds as model parameters are updated.
In this context, a model is a program with adjustable numerical parameters that makes predictions from data. Training means repeatedly adjusting those parameters so the model produces better results. Each round of training computes numerical updates from part of the data, and those updates must be combined across workers.
The difference can be summarized by comparing ordinary data processing with model training.
| Aspect | Data Processing | Machine Learning Training |
|---|---|---|
| Passes over the data | One or a few | Many |
| State across rounds | Usually little or none | Model parameters updated every round |
| Partitioned object | Input dataset | Data, model, or both |
| Typical result | Transformed dataset | Trained model |
This difference explains why general data-processing frameworks solve only part of the problem. They are useful for preparing training data, sampling, filtering, feature construction, and some smaller-scale iterative workloads. Large-scale model training introduces tighter communication, persistent state, and hardware requirements that call for more specialized runtimes.
General Challenges in Distributed Training
Training a model on one machine is already iterative. Distributing that process across many machines introduces additional engineering problems.
Some of the main challenges are:
-
Repeated coordination. Workers do not compute once and finish. They repeatedly exchange updates throughout training.
-
Communication cost. Every worker may produce large update vectors or intermediate numerical data that must be aggregated or exchanged efficiently.
-
Model state. The current model must remain available and consistent enough for the next round of computation.
-
Hardware constraints. Training often depends on GPUs or other accelerators, and the model or its intermediate data may exceed the memory of one device.
-
Load balance. If some workers finish much later than others, the faster workers sit idle at synchronization points.
-
Fault tolerance. Restarting a long training job from the beginning can be extremely expensive, so systems often need checkpointing or other recovery mechanisms.
These challenges lead to a central design question: when training is distributed, what should be partitioned?
Data Parallelism
In data parallelism, the training data is partitioned across workers while each worker holds a copy of the model. Each worker processes its own small subset of data, often called a mini-batch, computes numerical updates, often called gradients, and then participates in combining those updates with the others.
At a high level, the process looks like this:
-
Partition the training data across workers.
-
Give each worker a copy of the current model.
-
Let each worker compute updates on its local mini-batch.
-
Aggregate the updates across workers.
-
Update the model parameters.
-
Repeat for the next training step.
Data parallelism is the most common form of distributed training because it aligns naturally with the way distributed systems partition data. Its main difficulty is communication. If the model has millions or billions of parameters, each worker produces a very large set of updates, and those updates must be combined efficiently at every training step.
Two common ways to organize that aggregation are a parameter server and all-reduce.
Parameter Server
A parameter server keeps the model parameters on one or more central servers. Workers send their updates to the server and receive updated parameters back.
This approach is conceptually straightforward, but the server can become a bottleneck as the number of workers grows.
All-Reduce
In all-reduce, workers cooperate to aggregate updates without relying on one central coordinator. The result of the aggregation becomes available to all workers.
This approach scales better and is the dominant pattern in modern large-scale deep learning systems.
Model Parallelism
Data parallelism assumes that the full model fits on each worker or device. That assumption breaks down for very large models.
In model parallelism, the model itself is partitioned across devices. Different workers hold different parts of the model and cooperate to carry out the sequence of computations used to produce a prediction and compute updates. This reduces the memory burden on any one device, but it increases coordination and communication.
Two common forms of model parallelism are pipeline parallelism and tensor parallelism.
Pipeline Parallelism
In pipeline parallelism, the model is divided into consecutive groups of layers, and each device holds one group. Data flows through the model from device to device.
To keep devices busy, systems often use micro-batching, where several smaller batches are in flight through the pipeline at once.
Tensor Parallelism
In tensor parallelism, the computation within a layer is itself split across devices. Large matrix multiplications are divided so that each device computes part of the result.
This approach allows very large layers to be distributed, but it usually requires more frequent communication than pipeline parallelism because intermediate values must be exchanged within each layer.
Combining Approaches
Large training systems often combine these strategies rather than choosing only one.
A typical large-scale design may use:
-
Data parallelism across groups of workers
-
Pipeline parallelism across groups of layers
-
Tensor parallelism within large layers
This combination reflects the fact that distributed training has to manage both data size and model size.
Where General Frameworks Fit
These distinctions also clarify where earlier frameworks fit.
MapReduce is a poor fit for modern model training because it is built around one-pass batch stages and repeated training is inherently iterative.
Spark is much better suited to data preparation and to some smaller-scale iterative workloads. Its support for multi-stage pipelines, caching, and lineage makes it useful for preprocessing and some machine learning tasks built on distributed data.
Large-scale deep learning training usually relies on more specialized runtimes built around efficient collective communication and accelerator-aware scheduling. In that setting, general distributed frameworks often play a supporting role in the training process rather than serving as the numerical core of training.
Optional Note: Ray
The next section is not part of the core material for this class and is included for awareness. It is worth being aware of, however, because Ray has emerged as a major distributed computing framework for AI and machine learning workloads.
Why Ray Emerged
Spark provides a strong model for large-scale data processing built around partitioned datasets and dataflow. That model works well when a computation can be expressed as a sequence of operations over large collections of data.
Some distributed applications follow a different structure. A program may launch many small pieces of work, start new work based on intermediate results, keep some components running for a long time, or combine batch processing with long-lived services. These patterns appear in machine learning systems, simulations, and other distributed applications.
Ray was designed for this setting. It provides a distributed runtime for programs whose structure is more dynamic than a fixed dataflow pipeline. Spark centers computation on datasets and stages, where each stage transforms partitions of data. Ray centers computation on tasks and long-lived workers, where the program explicitly creates work and coordinates components that may persist over time.
How Ray Works
Ray’s execution model is built from three core mechanisms:
-
Tasks are functions that run remotely and may execute in parallel on different workers.
-
Actors are long-lived workers that preserve state across method calls.
-
Objects are values stored in a distributed object store and passed by reference among tasks and actors.
These mechanisms give two ways to organize computation.
A task runs independently and returns a result. It does not retain state after it finishes. A later task can use its result without needing to know where it was computed.
An actor remains alive and keeps state across calls. This is useful when some part of the system must remember information over time, such as a coordinator, a cache, or a service that maintains an evolving model.
Objects produced by tasks or actors are placed in a shared object store. Other tasks or actors can access them through references. This allows results to be reused across the system without manually transferring data between components.
Execution proceeds as follows:
-
The program creates tasks or actors.
-
The scheduler places them on available machines.
-
Tasks and actor methods produce objects.
-
Other tasks or actors use those objects through references.
-
The runtime tracks dependencies and runs work when inputs are ready.
This model allows a program to mix parallel work with long-lived components in a single system.
Remote Tasks
task parse(file):
return parsed_records
task count_errors(records):
return number_of_errors
r1 = parse("log1")
r2 = parse("log2")
e1 = count_errors(r1)
e2 = count_errors(r2)
Each task runs independently and returns a result that can be used by later tasks.
Actors
actor Counter:
state total = 0
method add(x):
total = total + x
method get():
return total
c = Counter()
c.add(5)
c.add(3)
result = c.get()
The actor keeps state across calls. This makes it useful for coordination, caching, or any component that must evolve over time.
Ray can also place tasks and actors based on available resources, such as CPUs or GPUs. This allows different parts of a program to run on machines that match their requirements.
Where Ray Fits Compared to Spark
Spark is strongest when computation can be expressed as a structured sequence of operations over datasets. In that model, data is partitioned, transformed in stages, and passed through a pipeline.
Ray is aimed at cases where the computation is more dynamic or involves long-lived state.
Some examples are:
-
Work that launches new tasks during execution rather than following a fixed sequence
-
Components that must remain alive and update their state over time
-
Systems that combine batch processing with services that continue running
-
Workloads that mix different kinds of computation within one program
These patterns do not map cleanly to a stage-based dataflow model. Ray supports them directly by allowing the program to create tasks, maintain state in actors, and share results through the object store as the computation evolves.
Ray does not replace Spark for large-scale data processing. It complements it by supporting a different style of distributed program.
What Ray Is Best Suited For
Ray is best suited for distributed applications that:
-
Launch work dynamically as the program runs
-
Maintain state across multiple steps
-
Combine parallel tasks with long-lived components
Some examples are:
-
Distributed machine learning training and coordination
-
Large sets of parallel experiments
-
Simulation systems
-
Batch inference
-
Model serving
The central idea is that the program behaves as a collection of interacting components rather than a fixed pipeline over data. Components create work, share results, and maintain state as the computation progresses.
Putting the Frameworks Together
These systems are easiest to compare by the shape of the workload they support.
| Framework | Best fit | Main idea | Main weakness |
|---|---|---|---|
| MapReduce | Large batch jobs | map -> shuffle/sort -> reduce |
Rigid, poor for iteration |
| BSP | Round-based parallel algorithms | supersteps with barriers | Waiting at each barrier |
| Pregel/Giraph | Iterative graph algorithms | vertex-centric message passing | Specialized model, synchronization cost |
| Spark | Multi-stage analytics and iterative data processing | lazy dataflow over partitioned datasets with caching | Shuffles and memory pressure still matter |
| Ray (awareness only) | General distributed task orchestration | tasks, actors, distributed objects | Lower-level model, more responsibility for the programmer |
Summary
The move from distributed storage to distributed computation forced systems to answer a new set of questions: how to partition input, how to exploit locality, how to move intermediate data, how to recover from failures, and how to keep one slow worker from delaying everything else.
MapReduce answered those questions for large batch jobs by imposing a rigid but powerful structure. BSP made repeated rounds of communication explicit. Pregel adapted BSP to graph processing by making the vertex the unit of thought. Spark generalized distributed processing into a richer dataflow model with lazy evaluation, lineage-based recovery, and in-memory caching. For broader awareness, Ray extended the landscape toward flexible distributed execution for heterogeneous workloads built from tasks, actors, and distributed objects.
No single framework is the answer to every large-scale computation problem. Each one makes a particular computational shape manageable.
The deeper lesson is that distributed computation depends on organizing data movement, synchronization, and recovery so that parallelism remains useful at scale.