pk.org: CS 417/Lecture Notes

Scalable Distributed Computation

Frameworks for Parallel Data Processing Beyond Storage

Paul Krzyzanowski – April 3, 2026

From Storage to Computation

Distributed systems first had to solve the storage problem. Once a dataset became too large for one machine, the obvious answer was to partition it across many machines and replicate it for fault tolerance. That made large-scale storage practical, but it immediately created a second problem. Once the data is spread out, computation has to be spread out too.

That second problem is easy to underestimate until the scale becomes concrete. If a computation takes only 100 milliseconds per item, then one billion items take more than three years on one machine. The only realistic answer is parallelism. The work has to be divided into many pieces and executed concurrently across many systems.

At first glance, the solution seems straightforward. Split the data into chunks, send each chunk to a worker, run the same program on each worker, and combine the results. That works for embarrassingly parallel problems, where each chunk can be processed independently, and the final answer is just a straightforward combination of partial results. But many useful workloads are more complicated. They need to regroup records by key, exchange intermediate state among workers, revisit the same dataset repeatedly, or preserve derived data for later stages.

Once those requirements appear, the main cost is no longer just CPU time. The hard parts become data movement, synchronization, fault recovery, and load imbalance. A good distributed computation framework does not eliminate those costs. It organizes them.

The development of large-scale data processing frameworks can be viewed as a sequence of attempts to impose the right structure on those costs.

The important lesson is that these frameworks are not interchangeable descriptions of the same idea. Each one is built around a particular style of computation.

The Structure of Distributed Computation

Despite their differences, these systems all aim to address the same underlying issues.

  1. Partitioning. A large dataset has to be cut into pieces so different workers can process it concurrently. Those pieces may be file blocks, individual files, ranges of rows, partitions of key-value data, or partitions of an in-memory dataset. Whatever form they take, they become the unit of scheduling.

  2. Locality. If the input data already resides on one machine, it is usually cheaper to move the computation there than to pull the data across the network. This is most important when reading raw input.

  3. Communication. Many computations cannot be finished independently on each partition. Data has to be regrouped by key, updates have to be propagated among workers, or partial results have to be merged. Communication often dominates the runtime.

  4. Synchronization. Some systems run in loosely connected stages. Others advance in explicit rounds with barriers. Synchronization simplifies reasoning, but it also means that fast workers may spend time waiting.

  5. Fault tolerance. Failures are normal in a cluster. Machines crash, disks stall, and networks degrade. A useful framework must recover without forcing the entire job to restart from the beginning.

  6. Skew and stragglers. Parallelism is effective only when work is reasonably balanced. If one partition is much heavier than the others, or one worker runs unusually slowly, the entire job finishes at the pace of the slowest remaining task.

Those six ideas reappear throughout this topic: partitioning, locality, communication, synchronization, fault tolerance, and stragglers.

MapReduce

Why MapReduce Was Created

MapReduce was created at Google to support large batch computations over enormous datasets, especially workloads related to web search, indexing, log analysis, and the construction of derived data products. Those jobs often had the same broad shape: scan a massive input, extract useful intermediate information, regroup that information by some key, and compute aggregated results.

Before MapReduce, engineers had to write custom distributed programs to do this work. That meant managing worker processes, dividing the input, restarting failed tasks, moving intermediate results, and coordinating the final output. Much of the program was infrastructure rather than application logic.

MapReduce changed that by imposing a strict computational structure. The user wrote a map function and a reduce function. The framework handled the mechanics of parallelization, task distribution, data placement, load balancing, monitoring, and failure recovery.

The model was inspired by the map and reduce functions from functional programming, especially Lisp, but its importance came from turning those ideas into a practical cluster-scale runtime.

The key design choice was restriction. By limiting the structure of the computation, the runtime gained enough control to manage a large cluster automatically.

The MapReduce Model

MapReduce operates on key-value pairs.

The user supplies two functions:

  1. map(k1, v1) -> list(k2, v2)

  2. reduce(k2, list(v2)) -> list(v3)

The map function processes input records and emits intermediate key-value pairs.

The reduce function receives one intermediate key together with all values associated with that key and emits output.

That description hides the middle of the computation, which is where most of the work happens. Before reduction can occur, the framework must gather all intermediate pairs with the same key. That requires partitioning the intermediate data, moving it to reducers, sorting it by key, and grouping equal keys together.

A more accurate high-level picture of what goes on is:

  1. Read input partitions.

  2. Run map tasks.

  3. Partition intermediate pairs by key.

  4. Shuffle partitions to reducers.

  5. Sort and group identical keys.

  6. Run reduce tasks.

  7. Write final output.

The shuffle-and-sort step is the center of MapReduce. It is the point where the system reorganizes the data from its original storage layout into a layout defined by computation.

How Input Is Partitioned

The input to a MapReduce job is usually divided into input splits, also called shards. The exact form depends on the storage system and the input format.

Common cases include:

Two separate ideas are important here.

The first is the scheduling unit. That is the split assigned to a map task.

The second is the logical record seen by the user code. A split may contain many lines, records, or key-value pairs. The input format determines how the split is parsed into records before the map function sees them.

This is why it is not enough to say that MapReduce processes “files.” The framework schedules work over splits, while the user-level computation processes logical records inside those splits.

Word Count

Word count is still the classic first example because it makes the basic pattern easy to see.

Input:

Map (user code):

map(_, text):
    for each word w in text:
        emit(w, 1)

The map function extracts each word from the input text and emits the pair (word, 1). The word becomes the key, and the value 1 records one occurrence of that word.

Shuffle and sort:

Reduce (user code):

reduce(word, counts):
    total = 0
    for c in counts:
        total += c
    emit(word, total)

During shuffle and sort, the framework gathers all values associated with the same word into one list. Each unique word therefore becomes a key whose list contains one 1 for each occurrence. The reduce function adds those values to produce the final count.

This example is useful because it isolates the core pattern: local extraction first, regrouping by key second, aggregation third.

Building an Inverted Index

A more realistic example is the construction of an inverted index, which maps each term to the documents that contain it. This is much closer to the type of workload that originally motivated MapReduce.

Suppose the input consists of documents:

Map:

map(docID, text):
    for each term t in text:
        emit(t, docID)

Every word (whatever our software decides is a “term”) is output as a key, with the document ID as its value. This is a way of saying “this term appears in this document.”

Intermediate output:

Shuffle and sort:

The framework combines the data from all identical keys (the terms) into a single list, calling the reduce function once with each unique key (term) along with the list of all the values (docIDs) associated with it.

Reduce:

reduce(term, docIDs):
    postings = unique_sorted(docIDs)
    emit(term, postings)

Final output:

This example makes the shuffle more meaningful. The documents are initially partitioned by where they happen to be stored. The output index must instead be partitioned by term. The shuffle is the step that changes the data’s organization from storage-based to key-based ownership.

Average Salary by ZIP Code

A second follow-on example shows that MapReduce is not limited to document processing.

Suppose the input is a set of employee records containing salary and ZIP code, and the goal is to compute the average salary for each ZIP code.

Map:

map(_, employee_record):
    emit(employee_record.zip, employee_record.salary)

The map function extracts the ZIP code and salary from each record and emits the ZIP code as the key and the salary as the value.

Reduce:

reduce(zip, salaries):
    total_salary = 0
    count = 0
    for s in salaries:
        total_salary += s
        count += 1
    emit(zip, total_salary / count)

The structure is the same as before. The map stage extracts a grouping key and a local contribution. The reduce stage combines all contributions for that key.

In this case, reduce is called once for each unique ZIP code, and salaries is the list of all salaries associated with that ZIP code.

Execution Model

A typical MapReduce execution proceeds as follows:

  1. Split the input into M shards.

  2. Launch worker processes.

  3. Assign map tasks to workers.

  4. Each mapper reads its shard and emits intermediate key-value pairs.

  5. The mapper partitions those pairs into R reducer partitions, often using:

  6. hash(key) mod R

  7. Reducers fetch their partitions from all mappers.

  8. Each reducer sorts its data by key and groups identical keys.

  9. The reduce function runs once per key.

  10. Each reducer writes an output file.

The use of M map tasks and R reducer partitions is worth keeping explicit. It makes the structure of the job concrete. The system is coordinating data flow from many input shards to many reducer partitions.

Locality

MapReduce tries to schedule map tasks on machines that already hold the needed input data, or at least on nearby machines. This is data locality.

The idea is that moving code is cheap, but moving huge input datasets is not.

This optimization matters most for the map phase because maps read raw input directly from storage. Reducers cannot exploit locality in the same way because they must gather intermediate data from many mappers scattered across the cluster. Also, in many applications the map operation ends up discarding most of the data, emitting only the useful components as keys and values.

Locality, therefore, helps at the front of the job, but the regrouping in the middle still requires communication.

Shuffle, Sort, and Cost

The shuffle is usually the most expensive phase of a MapReduce job.

During mapping, data is processed where it already resides. Before the reduce step, the data must be reorganized using an intermediate key (the key output by the mapper). That means:

This is more than just data movement. It is a distributed repartitioning step.

That is why shuffle-heavy jobs are expensive. They combine network traffic, disk I/O, and synchronization delay. If some keys are much more frequent than others, reducers become imbalanced. One overloaded reducer can become the bottleneck that determines the completion time of the entire job.

Stragglers and Speculative Execution

In a large cluster, some tasks inevitably run slower than others. A worker may be contended, a disk may be slow, or a task may simply receive a heavier partition.

These slow tasks are called stragglers. It’s important to be aware of them because a job completes only when its last required tasks complete.

MapReduce addresses this with speculative execution. If a task appears unusually slow, the framework may launch a duplicate copy of that task on another worker. The first result to finish is used.

This is a practical response to a statistical fact: at scale, something is always likely to be slow.

Fault Tolerance

MapReduce handles failure mainly through re-execution.

This works because map and reduce tasks are expected to be deterministic and free of externally visible side effects. If rerunning a task produces the same result, recovery is straightforward.

Why MapReduce Was Important

MapReduce made large batch jobs much easier to write.

The programmer no longer had to manage:

The model was narrow, but in a productive way. It gave the runtime enough structure to do a large amount of distributed systems work automatically.

MapReduce quickly escaped its original Google setting. Its ideas became the basis of Apache Hadoop MapReduce, which made large-scale batch processing widely accessible in open-source systems, and managed cloud services such as Amazon EMR, originally called Elastic MapReduce, and Google Cloud Dataproc made Hadoop-style processing available without running the cluster infrastructure by hand.

Limitations of MapReduce

MapReduce is excellent for one-pass batch computation. It is much less natural for workloads that require repeated passes over the same data, many dependent stages, or fine-grained communication.

The most important weakness is iterative computation.

Algorithms such as PageRank, shortest paths, clustering, gradient descent, and many machine learning workloads require repeated rounds over mostly the same data. In plain MapReduce, each round is usually expressed as a new job. That means repeated input reading, repeated output writing, repeated scheduling overhead, and repeated shuffle cost.

MapReduce is also a poor fit for graph processing. Graph algorithms often involve propagating state along edges over many rounds. Expressing that, as a sequence of independent batch jobs, is possible but awkward and expensive.

Finally, the fixed map-then-reduce structure is too rigid for many multi-stage pipelines. Some problems need a richer dataflow model than alternating map and reduce phases.

BSP and Pregel

Why Iterative Computation Needs a Different Model

Once a computation becomes iterative, the MapReduce model starts working against the problem rather than helping it. Reconstructing state through repeated jobs is awkward because it has to be passed along through repeated map and reduce runs. This is especially true for graphs.

Graphs appear everywhere:

Graph algorithms usually do little local work per vertex, but they require repeated propagation of information across edges. That is a poor fit for a batch model based on regrouping key-value pairs from scratch in every round.

Bulk Synchronous Parallel

The Bulk Synchronous Parallel, or BSP, model was introduced by Leslie Valiant in 1990 as a general model of parallel computation, particularly for supercomputers. At the time, large parallel machines were difficult to program efficiently because of complex communication patterns and timing variability. BSP provided a structured model that divided computation into supersteps separated by global synchronization barriers, making performance easier to reason about. Although originally developed for high-performance computing, the model later proved useful for distributed systems, especially for iterative and graph-based workloads.

BSP organizes a computation into repeated supersteps.

Each superstep has three parts:

  1. Local computation

  2. Communication

  3. Barrier synchronization

During one superstep, a worker computes using its current local state and the messages that arrived from the previous superstep. While computing, it may send messages to other workers. Those messages do not become visible immediately. They become inputs to the next superstep. At the end of the superstep, all workers wait at a barrier before the next round begins.

This round-based structure simplifies reasoning. The system no longer has arbitrary message timing. Instead, time advances in explicit steps.

The cost is equally clear. Fast workers must wait for slow ones at the barrier.

Why BSP Is Useful

BSP separates three concerns cleanly:

That separation makes it easier to understand and implement iterative distributed programs. It also creates natural checkpoints for fault recovery. A system can save the state of a computation at the end of every N supersteps and restart from the last completed checkpoint after a failure.

BSP does not remove communication cost or load imbalance. It makes them easier to reason about by organizing them into explicit rounds.

Where BSP Fits

BSP is a good match for applications that proceed in repeated rounds, where each round consists of local work, message exchange, and a clearly defined synchronization point. That includes graph algorithms, iterative linear algebra, some numerical simulations, and certain machine learning workloads that alternate between local computation and global coordination. The model is especially attractive when the computation is easier to reason about in phases than as a fully asynchronous system.

An open-source system in this family was Apache Hama. Hama implemented a BSP-style execution model on top of the Hadoop ecosystem and targeted graph, matrix, and machine learning workloads.

BSP never became the dominant general framework for data processing, but its core ideas remained important. Supersteps, message passing, and barrier synchronization proved to be a particularly good fit for iterative graph algorithms, which led directly to systems such as Pregel.

Pregel: Vertex-Centric Graph Computation

Pregel was introduced at Google in 2010 as a system for large-scale graph processing. Its immediate motivation was that many important graph algorithms, including PageRank and shortest-path computation over very large graphs, were awkward and inefficient to express in MapReduce. Google needed a model that could keep the graph structure alive across iterations and let computation propagate through the graph directly instead of rebuilding state as fresh key-value data in every round.

Pregel’s ideas later influenced Apache Giraph, an open-source system based on the Pregel model. Giraph was used at Facebook to process large social graphs, demonstrating that the vertex-centric BSP style became a practical way to organize computation over very large graphs outside Google as well.

Pregel applies the BSP model specifically to graph processing.

The central idea is vertex-centric computation: think like a vertex.

Each vertex has:

  1. An identifier

  2. A modifiable state value

  3. Outgoing edges, often with associated values

Computation proceeds in supersteps. In each superstep, the same user-defined function runs on each active vertex. That function:

The computation terminates when all vertices are inactive, and no messages remain in transit. That means that all vertices voted to halt and no vertex has incoming messages.

This is a much more natural way to express graph algorithms than rebuilding graph state as key-value data in a sequence of MapReduce jobs.

Pregel Example: Single-Source Shortest Path

A standard Pregel example is finding the shortest path from one source vertex.

Initialization:

Superstep logic (this runs for each vertex):

if vertex is source and superstep == 0:
    vertex.distance = 0
    for each outgoing edge (vertex -> neighbor, weight):
        send(neighbor, weight)
else:
    best = min(incoming_messages)
    if best < vertex.distance:
        vertex.distance = best
        for each outgoing edge (vertex -> neighbor, weight):
            send(neighbor, best + weight)

vote_to_halt()

What this code does on each vertex is:

  1. Receive candidate distances.

  2. Keep the smallest one.

  3. If the distance improved, propagate that improvement to neighbors.

  4. Otherwise, stay inactive unless a future message arrives.

Eventually, no vertex discovers a shorter path, no useful messages remain, and the computation terminates.

Pregel Example: PageRank

PageRank is an algorithm for ranking pages in a directed graph such as the web graph. The basic idea is that a page should receive a high rank if other highly ranked pages link to it. Each page repeatedly distributes part of its current rank across its outgoing links, and each page updates its own rank from the contributions it receives along its incoming links. Because these values are propagated along edges over many rounds until they stabilize, PageRank is a natural example of iterative graph computation.

In a Pregel-style system, each vertex represents a page and each directed edge represents a hyperlink. During each superstep, a vertex sends part of its current rank to its outgoing neighbors. It then computes a new rank from the contributions it received in the previous superstep.

compute(vertex, incoming_messages):
    if superstep == 0:
        vertex.rank = initial_value
    else:
        sum = 0
        for m in incoming_messages:
            sum += m
        vertex.rank = base + damping * sum

    if superstep < max_iterations:
        contribution = vertex.rank / out_degree(vertex)
        for each neighbor:
            send(neighbor, contribution)
    else:
        vote_to_halt()

The repeated exchange of partial rank values fits naturally into the round-based message passing model used by BSP and Pregel.

Synchronization, Checkpointing, and Cost

BSP and Pregel simplify reasoning, but they do not eliminate cost.

The barrier at the end of each superstep means:

On the other hand, barriers provide natural checkpoint boundaries. A system can save the state of each partition every few supersteps and restart from the last checkpoint after a failure.

BSP and Pregel trade some efficiency for a model that is easier to understand, coordinate, and recover.

Why Pregel Fits Graphs Better Than MapReduce

Pregel fits graph algorithms better in both performance and structure.

MapReduce requires repeatedly reconstructing the graph state as key-value data. Pregel keeps the graph alive and lets vertices exchange messages directly across rounds.

That is a better match for algorithms whose natural unit of work is: what should this vertex do next, given what its neighbors told it?

Spark

Why Spark Was Needed

Spark was created at UC Berkeley’s AMPLab around 2009. The project grew out of frustration with MapReduce on workloads that needed repeated passes over the same data, especially iterative machine learning and interactive data analysis. The first Spark paper appeared in 2010, and the system quickly became influential because it attacked one of the most obvious bottlenecks in the Hadoop era: writing every intermediate stage back to storage before the next stage could begin.

In MapReduce, each phase typically writes its output to stable storage before the next phase begins. That design works well for one large batch computation, but it is inefficient for:

Spark keeps the basic idea of distributed processing over partitions, but it replaces the rigid map-then-reduce structure with a richer dataflow model. It also allows intermediate datasets to be cached in memory, which is crucial for repeated computation.

Spark Architecture

Spark has four main runtime components:

The driver constructs the computation. Executors perform the work over partitions of data. Each executor can also keep cached partitions in memory for reuse by later stages.

This architecture reflects Spark’s design as a distributed runtime that coordinates computation across many machines while keeping data partitioned among them.

Resilient Distributed Datasets

Spark’s designers wanted an abstraction for working with large datasets spread across a cluster without forcing every stage of computation to write its results back to storage. The result was the Resilient Distributed Dataset, or RDD, Spark’s original core abstraction.

An RDD is a logical collection of data partitioned across many machines. It is designed so that the system can process those partitions in parallel, recover lost partitions if a worker fails, and optionally keep frequently used data in memory for reuse.

An RDD has several important properties:

Resilient
The word resilient refers to the ability to recover. If a partition is lost because an executor fails, Spark can recompute that partition from the sequence of operations that created it.
Distributed
The word distributed means the dataset is split across many machines.
Dataset
The word dataset should be interpreted broadly. An RDD may contain lines of text, tuples, records, parsed objects, key-value pairs, or many other types of elements.

How Spark Inputs Are Partitioned

Spark partitions input in ways that resemble MapReduce, but the abstraction is broader because Spark applies partitioning not only to source data but also to intermediate results.

An RDD may be created from:

The partition is again the scheduling unit.

Some examples are:

As in MapReduce, the partition is the unit assigned to workers. The logical records inside a partition may be lines, rows, tuples, objects, or key-value pairs.

Lazy Evaluation

One of Spark’s most important ideas is lazy evaluation.

Transformations do not execute immediately. Instead, Spark records them as a lineage graph describing how the result can be computed.

For example:

lines      = textFile("logs.txt")
errors     = lines.filter(startsWith("ERROR"))
fields     = errors.map(split_on_tabs)
messages   = fields.map(extract_message)
mysqlCount = messages.filter(contains("mysql")).count()

When lines, errors, fields, and messages are defined, Spark does not yet run the job. It records the transformations.

Only when the action count() is invoked does Spark build an execution plan and schedule tasks.

Lazy evaluation has several benefits:

An easy way to think about it is that transformations build a recipe. An action tells Spark to actually run it.

Caching RDDs in Memory

Caching is one of Spark’s defining features.

Suppose a program extracts error messages from a large log file and then asks several questions about those messages. Without caching, every action may trigger the recomputation of the same filtering and parsing steps.

With caching:

lines    = textFile("logs.txt")
errors   = lines.filter(startsWith("ERROR"))
messages = errors.map(parse_message)
messages.cache()

mysqlCount = messages.filter(contains("mysql")).count()
phpCount   = messages.filter(contains("php")).count()

Here, messages.cache() tells Spark to persist the messages RDD after it is first computed. The second action can then reuse the cached partitions instead of re-reading and re-parsing the raw logs.

Caching is especially useful for:

Caching is not free. It uses memory, and if memory is insufficient Spark may spill data to disk depending on the persistence mode. But when a dataset is reused, caching can change the cost structure dramatically.

Fault Tolerance Through Lineage

Spark’s fault tolerance is based on lineage rather than eager replication of every intermediate result.

If an executor fails and one of its partitions is lost, Spark determines how that partition was created and recomputes only the missing partition.

For example, if:

messages = textFile("logs.txt").filter(...).map(...)

and partition 7 of messages is lost, Spark does not recompute the whole RDD. It recomputes partition 7 by replaying the necessary transformations on the corresponding input partition.

This works well because most transformations are deterministic.

Spark Transformations and Actions

Spark distinguishes between transformations, which create new RDDs lazily, and actions, which trigger execution and either return a result or write output.

Common Spark transformations

Transformation What it does Notes
map(f) Applies f to each element One output per input element
flatMap(f) Applies f and flattens the results Useful when one input yields zero or more outputs
filter(pred) Keeps only elements satisfying pred Narrow transformation
sample(...) Returns a random sample Useful for approximate analysis or testing
union(other) Concatenates two RDDs Does not remove duplicates
intersection(other) Keeps elements present in both RDDs Usually requires shuffle
distinct() Removes duplicate elements Often shuffle-heavy
groupByKey() Groups all values with the same key Can be expensive because it materializes full value lists
reduceByKey(f) Combines values with the same key using f Usually preferred over groupByKey for aggregation
aggregateByKey(...) Generalized per-key aggregation Useful when local and global combination differ
sortByKey() Sorts keyed data by key Requires global ordering work
join(other) Joins two keyed RDDs on matching keys Often requires shuffle
cogroup(other) Groups values from multiple keyed RDDs by key Useful for more general joins
cartesian(other) Produces all pairs from two datasets Usually very expensive

Common Spark actions

Action What it does Notes
count() Returns the number of elements Triggers evaluation
collect() Returns all elements to the driver Dangerous for very large datasets
first() Returns the first element Useful for quick inspection
take(n) Returns the first n elements Safer than collect() for inspection
reduce(f) Aggregates elements using f Best when the reduction operation is commutative and associative for distributed use
countByKey() Counts the number of values per key Returns results to the driver
saveAsTextFile(...) Writes results to storage Common terminal action
saveAsSequenceFile(...) Writes results in Hadoop SequenceFile format Useful in Hadoop ecosystems
foreach(f) Applies f for side effects Use carefully in distributed settings

A useful rule of thumb is that operations involving regrouping by key or global ordering are the expensive ones, because they usually require shuffle.

Narrow and Wide Dependencies

Spark performance becomes easier to understand when transformations are viewed through their dependencies.

A narrow dependency means each output partition depends on a small number of input partitions, often just one. Examples include:

These are often cheap because they can be pipelined without repartitioning the data.

A wide dependency means an output partition depends on data from many input partitions. Some examples are:

These operations usually require a shuffle, which means the framework repartitions data across the cluster so that related records end up on the same worker, or so that records can be placed into a new global order. That process typically involves moving data over the network, reorganizing it by key or sort order, and then continuing the computation on the new partitions.

This is Spark’s version of an important MapReduce lesson: regrouping data across the cluster is often where much of the cost appears.

Shared Variables

Spark includes a few controlled forms of shared state.

Broadcast variables

A broadcast variable is read-only data sent once to executors and cached there. This is useful when many tasks need the same lookup table, dictionary, model parameters, or configuration object.

Without broadcasting, the same data may be sent repeatedly for many tasks.

Accumulators

An accumulator supports distributed aggregation into a counter or sum-like variable. Workers can add to it, and the driver can read the final value.

This is useful for counts, statistics, and debugging summaries, but it is not general shared mutable state.

A Spark Example: Error Log Analysis

Consider a large application log stored across a cluster. A common task is to isolate error records, extract the message portion of each record, and reuse that extracted dataset for several analyses.

For example, one analysis might count how many error messages mention MySQL, another might count how many mention PHP, and a third might identify the most frequent terms appearing in error messages. This workflow benefits from keeping an intermediate dataset available for reuse.

A possible pipeline is:

lines    = textFile("app.log")
errors   = lines.filter(line starts with "ERROR")
fields   = errors.map(split line on tab)
messages = fields.map(extract message field)
messages.cache()

mysqlCount = messages.filter(message contains "mysql").count()
phpCount   = messages.filter(message contains "php").count()
topTerms   = messages.flatMap(tokenize)
                     .map(term -> (term, 1))
                     .reduceByKey(add)
                     .sortBy(count descending)
                     .take(20)

The code carries out the analysis in a sequence of steps:

  1. lines = textFile("app.log") creates an RDD whose elements are lines from the log file.

  2. errors = lines.filter(...) keeps only the lines that represent error records.

  3. fields = errors.map(...) parses each error record into fields, such as by splitting on tab characters.

  4. messages = fields.map(...) extracts just the message portion of each parsed record.

  5. messages.cache() tells Spark to keep the messages RDD available after it is first computed, so later actions can reuse it.

  6. mysqlCount = messages.filter(...).count() keeps only messages containing "mysql" and counts how many such messages there are. The call to count() is an action, so it triggers execution of the pipeline.

  7. phpCount = messages.filter(...).count() performs a second count for messages containing "php". Because messages has been cached, Spark can reuse it instead of repeating the earlier filtering and parsing steps.

  8. topTerms = messages.flatMap(tokenize) breaks each message into individual terms and flattens the results into one stream of terms.

  9. .map(term -> (term, 1)) turns each term into a key-value pair so that terms can be counted.

  10. .reduceByKey(add) adds together all the counts for identical terms.

  11. .sortBy(count descending) orders the (term, count) pairs from highest count to lowest.

  12. .take(20) returns the first 20 results from that sorted output, which in this case are the 20 most frequent terms.

Some key aspects of Spark illustrated by this example are:

Spark Beyond Core RDDs

Although RDDs are the historical core, Spark has grown into a broader ecosystem, and the original RDD model remains the clearest way to explain its execution semantics.

Important extensions include:

These additions do not change the core ideas introduced here. Spark is still best understood as a distributed dataflow engine built around partitioned datasets, lazy evaluation, lineage, and optional persistence.

Distributed Machine Learning

What Machine Learning Changes

The frameworks discussed so far were designed primarily for data processing. They read large inputs, transform or aggregate them, and produce output datasets or summaries. Machine learning training has a different computational shape. The same training data is used repeatedly, and workers must coordinate across many rounds as model parameters are updated.

In this context, a model is a program with adjustable numerical parameters that makes predictions from data. Training means repeatedly adjusting those parameters so the model produces better results. Each round of training computes numerical updates from part of the data, and those updates must be combined across workers.

The difference can be summarized by comparing ordinary data processing with model training.

Aspect Data Processing Machine Learning Training
Passes over the data One or a few Many
State across rounds Usually little or none Model parameters updated every round
Partitioned object Input dataset Data, model, or both
Typical result Transformed dataset Trained model

This difference explains why general data-processing frameworks solve only part of the problem. They are useful for preparing training data, sampling, filtering, feature construction, and some smaller-scale iterative workloads. Large-scale model training introduces tighter communication, persistent state, and hardware requirements that call for more specialized runtimes.

General Challenges in Distributed Training

Training a model on one machine is already iterative. Distributing that process across many machines introduces additional engineering problems.

Some of the main challenges are:

  1. Repeated coordination. Workers do not compute once and finish. They repeatedly exchange updates throughout training.

  2. Communication cost. Every worker may produce large update vectors or intermediate numerical data that must be aggregated or exchanged efficiently.

  3. Model state. The current model must remain available and consistent enough for the next round of computation.

  4. Hardware constraints. Training often depends on GPUs or other accelerators, and the model or its intermediate data may exceed the memory of one device.

  5. Load balance. If some workers finish much later than others, the faster workers sit idle at synchronization points.

  6. Fault tolerance. Restarting a long training job from the beginning can be extremely expensive, so systems often need checkpointing or other recovery mechanisms.

These challenges lead to a central design question: when training is distributed, what should be partitioned?

Data Parallelism

In data parallelism, the training data is partitioned across workers while each worker holds a copy of the model. Each worker processes its own small subset of data, often called a mini-batch, computes numerical updates, often called gradients, and then participates in combining those updates with the others.

At a high level, the process looks like this:

  1. Partition the training data across workers.

  2. Give each worker a copy of the current model.

  3. Let each worker compute updates on its local mini-batch.

  4. Aggregate the updates across workers.

  5. Update the model parameters.

  6. Repeat for the next training step.

Data parallelism is the most common form of distributed training because it aligns naturally with the way distributed systems partition data. Its main difficulty is communication. If the model has millions or billions of parameters, each worker produces a very large set of updates, and those updates must be combined efficiently at every training step.

Two common ways to organize that aggregation are a parameter server and all-reduce.

Parameter Server

A parameter server keeps the model parameters on one or more central servers. Workers send their updates to the server and receive updated parameters back.

This approach is conceptually straightforward, but the server can become a bottleneck as the number of workers grows.

All-Reduce

In all-reduce, workers cooperate to aggregate updates without relying on one central coordinator. The result of the aggregation becomes available to all workers.

This approach scales better and is the dominant pattern in modern large-scale deep learning systems.

Model Parallelism

Data parallelism assumes that the full model fits on each worker or device. That assumption breaks down for very large models.

In model parallelism, the model itself is partitioned across devices. Different workers hold different parts of the model and cooperate to carry out the sequence of computations used to produce a prediction and compute updates. This reduces the memory burden on any one device, but it increases coordination and communication.

Two common forms of model parallelism are pipeline parallelism and tensor parallelism.

Pipeline Parallelism

In pipeline parallelism, the model is divided into consecutive groups of layers, and each device holds one group. Data flows through the model from device to device.

To keep devices busy, systems often use micro-batching, where several smaller batches are in flight through the pipeline at once.

Tensor Parallelism

In tensor parallelism, the computation within a layer is itself split across devices. Large matrix multiplications are divided so that each device computes part of the result.

This approach allows very large layers to be distributed, but it usually requires more frequent communication than pipeline parallelism because intermediate values must be exchanged within each layer.

Combining Approaches

Large training systems often combine these strategies rather than choosing only one.

A typical large-scale design may use:

This combination reflects the fact that distributed training has to manage both data size and model size.

Where General Frameworks Fit

These distinctions also clarify where earlier frameworks fit.

MapReduce is a poor fit for modern model training because it is built around one-pass batch stages and repeated training is inherently iterative.

Spark is much better suited to data preparation and to some smaller-scale iterative workloads. Its support for multi-stage pipelines, caching, and lineage makes it useful for preprocessing and some machine learning tasks built on distributed data.

Large-scale deep learning training usually relies on more specialized runtimes built around efficient collective communication and accelerator-aware scheduling. In that setting, general distributed frameworks often play a supporting role in the training process rather than serving as the numerical core of training.

Optional Note: Ray

The next section is not part of the core material for this class and is included for awareness. It is worth being aware of, however, because Ray has emerged as a major distributed computing framework for AI and machine learning workloads.

Why Ray Emerged

Spark provides a strong model for large-scale data processing built around partitioned datasets and dataflow. That model works well when a computation can be expressed as a sequence of operations over large collections of data.

Some distributed applications follow a different structure. A program may launch many small pieces of work, start new work based on intermediate results, keep some components running for a long time, or combine batch processing with long-lived services. These patterns appear in machine learning systems, simulations, and other distributed applications.

Ray was designed for this setting. It provides a distributed runtime for programs whose structure is more dynamic than a fixed dataflow pipeline. Spark centers computation on datasets and stages, where each stage transforms partitions of data. Ray centers computation on tasks and long-lived workers, where the program explicitly creates work and coordinates components that may persist over time.

How Ray Works

Ray’s execution model is built from three core mechanisms:

  • Tasks are functions that run remotely and may execute in parallel on different workers.

  • Actors are long-lived workers that preserve state across method calls.

  • Objects are values stored in a distributed object store and passed by reference among tasks and actors.

These mechanisms give two ways to organize computation.

A task runs independently and returns a result. It does not retain state after it finishes. A later task can use its result without needing to know where it was computed.

An actor remains alive and keeps state across calls. This is useful when some part of the system must remember information over time, such as a coordinator, a cache, or a service that maintains an evolving model.

Objects produced by tasks or actors are placed in a shared object store. Other tasks or actors can access them through references. This allows results to be reused across the system without manually transferring data between components.

Execution proceeds as follows:

  1. The program creates tasks or actors.

  2. The scheduler places them on available machines.

  3. Tasks and actor methods produce objects.

  4. Other tasks or actors use those objects through references.

  5. The runtime tracks dependencies and runs work when inputs are ready.

This model allows a program to mix parallel work with long-lived components in a single system.

Remote Tasks

task parse(file):
    return parsed_records

task count_errors(records):
    return number_of_errors

r1 = parse("log1")
r2 = parse("log2")
e1 = count_errors(r1)
e2 = count_errors(r2)

Each task runs independently and returns a result that can be used by later tasks.

Actors

actor Counter:
    state total = 0

    method add(x):
        total = total + x

    method get():
        return total

c = Counter()
c.add(5)
c.add(3)
result = c.get()

The actor keeps state across calls. This makes it useful for coordination, caching, or any component that must evolve over time.

Ray can also place tasks and actors based on available resources, such as CPUs or GPUs. This allows different parts of a program to run on machines that match their requirements.

Where Ray Fits Compared to Spark

Spark is strongest when computation can be expressed as a structured sequence of operations over datasets. In that model, data is partitioned, transformed in stages, and passed through a pipeline.

Ray is aimed at cases where the computation is more dynamic or involves long-lived state.

Some examples are:

  • Work that launches new tasks during execution rather than following a fixed sequence

  • Components that must remain alive and update their state over time

  • Systems that combine batch processing with services that continue running

  • Workloads that mix different kinds of computation within one program

These patterns do not map cleanly to a stage-based dataflow model. Ray supports them directly by allowing the program to create tasks, maintain state in actors, and share results through the object store as the computation evolves.

Ray does not replace Spark for large-scale data processing. It complements it by supporting a different style of distributed program.

What Ray Is Best Suited For

Ray is best suited for distributed applications that:

  • Launch work dynamically as the program runs

  • Maintain state across multiple steps

  • Combine parallel tasks with long-lived components

Some examples are:

  • Distributed machine learning training and coordination

  • Large sets of parallel experiments

  • Simulation systems

  • Batch inference

  • Model serving

The central idea is that the program behaves as a collection of interacting components rather than a fixed pipeline over data. Components create work, share results, and maintain state as the computation progresses.

Putting the Frameworks Together

These systems are easiest to compare by the shape of the workload they support.

Framework Best fit Main idea Main weakness
MapReduce Large batch jobs map -> shuffle/sort -> reduce Rigid, poor for iteration
BSP Round-based parallel algorithms supersteps with barriers Waiting at each barrier
Pregel/Giraph Iterative graph algorithms vertex-centric message passing Specialized model, synchronization cost
Spark Multi-stage analytics and iterative data processing lazy dataflow over partitioned datasets with caching Shuffles and memory pressure still matter
Ray (awareness only) General distributed task orchestration tasks, actors, distributed objects Lower-level model, more responsibility for the programmer

Summary

The move from distributed storage to distributed computation forced systems to answer a new set of questions: how to partition input, how to exploit locality, how to move intermediate data, how to recover from failures, and how to keep one slow worker from delaying everything else.

MapReduce answered those questions for large batch jobs by imposing a rigid but powerful structure. BSP made repeated rounds of communication explicit. Pregel adapted BSP to graph processing by making the vertex the unit of thought. Spark generalized distributed processing into a richer dataflow model with lazy evaluation, lineage-based recovery, and in-memory caching. For broader awareness, Ray extended the landscape toward flexible distributed execution for heterogeneous workloads built from tasks, actors, and distributed objects.

No single framework is the answer to every large-scale computation problem. Each one makes a particular computational shape manageable.

The deeper lesson is that distributed computation depends on organizing data movement, synchronization, and recovery so that parallelism remains useful at scale.


Next: Week 10 Study Guide

Back to CS 417 Documents