Disclaimer: This study guide attempts to touch upon the most important topics that may be covered on the exam but does not claim to necessarily cover everything that one needs to know for the exam. Finally, don’t take the one hour time window in the title literally.

Last update: Sun Apr 19 23:52:08 2026

Week 9: Distributed Databases

NoSQL and Data Model Categories

NoSQL systems are databases designed to scale horizontally across many machines. They often relax some of the assumptions of traditional relational databases, such as fixed schemas or full transactional support across all data, in order to improve scalability, availability, or flexibility. The term does not describe one single design. It covers several different data models.

The main categories are the following:

Key-value stores store data as a mapping from a key to a value. The system treats the value as opaque. This makes the model simple and easy to distribute, but limits queries beyond direct key lookup. Amazon Dynamo and distributed hash tables (DHTs) are the main examples to remember here.
Column-family stores, also called wide-column stores, organize data into rows and columns, but allow different rows to have different columns. They are designed for large-scale storage and high write throughput. Bigtable, HBase, and Cassandra are the main examples.
Document stores store structured documents, usually in a JSON-like form. They keep more structure than key-value stores and often support richer queries over document fields.
Graph databases store data as nodes and edges and are optimized for traversing relationships among entities.

In this material, the main focus is on column-family stores and on Spanner, which brings strong transactional guarantees back into a distributed setting.

Bigtable

Bigtable is a distributed column-family store that organizes data as a sparse, sorted map indexed by row key, column, and time. It was designed for large-scale structured data and for applications that need high throughput across many machines.

Data Model

Bigtable’s data model is easiest to understand by breaking it into its three indexing dimensions.

Row keys are arbitrary byte strings stored in sorted lexicographic order. That sorted order is important because it makes range scans efficient. Rows are also the unit of atomicity, so a read or write to a single row is atomic.
Column families group related columns and must be declared when the table is created. Within a family, column qualifiers can be created dynamically by any client. Column names therefore have the form family:qualifier.
Timestamps allow multiple versions of a value to be stored in a single cell. Version retention is configurable, so the system can keep several recent versions or discard older ones automatically.

Bigtable is also sparse. Empty cells are never stored. A row can have no columns at all in a given family, and only the cells that actually contain data occupy storage.

Partitioning and Structure

Bigtable scales by partitioning the table by row-key range.

A tablet is a contiguous range of rows. Each tablet is served by one tablet server, and tables are split into multiple tablets as they grow. Since rows are stored in sorted row-key order, each tablet is a slice of that sorted key space.

A master server tracks which tablet server owns which tablet and handles reassignment when failures occur. The important point is that data does not flow through the master. Clients use metadata to locate the right tablet server and then talk to that server directly.

Storage and Writes

Bigtable stores recent writes in memory and older data on disk.

The write path works as follows:

A write is recorded in a log for recovery.
The write is placed in a memtable, which is an in-memory sorted buffer.
When the memtable fills, it is flushed to disk as an SSTable.

An SSTable is an immutable, sorted file of key-value pairs stored in GFS. Since SSTables are immutable, the system periodically compacts them to merge files and discard obsolete data. Reads may therefore need to combine information from the memtable and multiple SSTables.

Key Concepts in Bigtable

The main ideas to remember are:

Row keys determine both identity and sorted storage order.
Tablets are key-range partitions.
Tablet servers serve tablets, while the master manages assignment and recovery.
Memtables and SSTables form the storage engine.

Bigtable scales very well, but it does not provide general transactional support across rows and does not support multiple tables.

Cassandra

Cassandra is a distributed column-family store designed for high availability, decentralized control, and horizontal scalability. It combines ideas from Bigtable, especially the wide-column model and SSTable-style storage, with ideas from Dynamo, especially decentralized partitioning and replication.

Architecture and the Hash Ring

Cassandra uses a peer-to-peer architecture. There is no master node, and all nodes are equal. Any node can accept a client request and coordinate the work needed to satisfy it.

Cassandra distributes data using a distributed hash table organized as a hash ring. Each node owns one or more tokens, which correspond to positions on that ring. A node is responsible for the range of hash values between its token and the previous token on the ring.

When Cassandra stores a row, it hashes the row’s partition key. The row is then routed to the node whose token is the first one that follows that hash value on the ring.

This design has an important scaling property. When a node is added or removed, only the data in the affected adjacent ring arc has to move. The rest of the data stays where it is. That is one of the main reasons DHT-style partitioning is attractive.

Data Model

Cassandra separates distribution across the cluster from ordering within one partition.

The key ideas are these:

The partition key determines where the data lives. Rows with the same partition key are placed in the same partition and stored on the same set of replica nodes.
The clustering columns determine how rows are ordered within that partition.
Columns store the actual data values within each row.

That separation is central to understanding Cassandra.

An example is a database of McDonald’s restaurants. If the partition key is restaurant_id, each restaurant will likely be distributed independently across the cluster. That is good for load balancing, but not very useful if queries often want all restaurants in one country.

If the partition key is country_code, then all restaurants in the same country live in the same partition and therefore on the same set of machines. If the clustering columns are state and county, then the rows within that country partition are sorted first by state and then by county. This makes it efficient to iterate through all restaurants in one country, one state, or one county.

So the division of roles is:

The partition key controls which rows live together on the same set of machines.
The clustering columns control how those rows are ordered within that partition.

Replication and Consistency

Cassandra replicates data across multiple nodes for fault tolerance. Each partition is stored on several nodes according to a chosen replication factor. This allows the system to continue operating even when some nodes fail.

Cassandra provides tunable consistency. The application can choose how many replicas must respond to a read or write. Waiting for more replicas gives stronger consistency, while waiting for fewer improves availability and often reduces latency.

The basic trade-off is:

More replica responses improve consistency.
Fewer replica responses improve availability and latency.

Storage Model

Cassandra’s storage engine resembles Bigtable’s.

Writes are first recorded durably and placed in memory. Later, they are flushed to disk as immutable files. Reads may have to combine recent in-memory data with older on-disk files. Compaction merges those files over time.

The important connection is that Cassandra resembles Dynamo in how it distributes and replicates data across the cluster, but resembles Bigtable in how each node stores and retrieves data locally.

What to Remember About Cassandra

The main ideas to remember are:

Cassandra uses a peer-to-peer architecture with no master.
Data is distributed by hashing the partition key onto a hash ring.
Each node owns a range of token values on that ring.
Adding or removing a node moves only data in the adjacent ring arc.
The partition key controls placement across machines.
The clustering columns control sorted order within a partition.
Cassandra emphasizes availability and scalability, with tunable consistency chosen by the application.

Spanner (NewSQL)

Spanner is a globally distributed relational database that combines horizontal scalability with strong transactional guarantees. It keeps the key-range partitioning ideas of systems like Bigtable, but adds distributed transactions, multiversion data, and globally meaningful timestamps.

A compact mental model is this: Spanner combines Bigtable-style partitioning, distributed transaction mechanisms, and carefully managed time.

What Spanner Combines

Spanner is best understood as a combination of several mechanisms, each solving a different problem.

The main components are:

Key-range partitioning into splits
Paxos replication within each split
Two-phase commit (2PC) for transactions that span multiple splits
Strict two-phase locking (2PL) for read-write transactions
Wound-wait for deadlock prevention
Multiversion concurrency control (MVCC) for versioned data
TrueTime for bounded clock uncertainty and globally meaningful timestamps

Partitioning gives scale. Paxos gives fault tolerance and a consistent order of updates within each split. Two-phase commit coordinates work across splits. Strict 2PL and wound-wait control concurrency. MVCC supports snapshot reads. TrueTime and commit wait make timestamp order line up with real time.

Physical and Logical Structure

Spanner has both a physical and a logical structure.

At the physical level, data is stored on spanservers organized into zones. A zone is a large administrative and failure domain, roughly at the datacenter level. Replicating data across zones allows the system to survive larger failures than a single machine crash.

At the logical level, a Spanner deployment is called a universe. A universe contains databases, databases contain tables, and tables are divided into splits.

A split is a contiguous range of keys. This is directly analogous to a Bigtable tablet. Rows are stored in sorted key order, and a split is one slice of that sorted key space.

Replication and Paxos Groups

Each split is replicated across multiple servers. The set of servers maintaining replicas of one split is called a Paxos group. These replicas use Paxos to agree on the order of updates.

The core structural idea is worth stating clearly:

Partitioning gives scale.
Paxos replication gives fault tolerance and a consistent order of updates within each split.

As long as a majority of replicas in a Paxos group are available, that split can continue to make progress.

Transactions Across Splits

If a transaction touches data in only one split, that split’s Paxos group can handle the transaction locally.

If a transaction touches data in more than one split, Spanner uses two-phase commit. This is the same protocol discussed earlier: a coordinator drives a prepare phase and then a commit phase across all participating groups.

This leads to an important distinction:

Paxos handles replication and ordering within one split.
2PC coordinates one transaction across multiple splits.

Read-Write Transactions, Locks, and Wound-Wait

For read-write transactions, Spanner uses strict two-phase locking.

A read acquires a shared lock, and a write acquires an exclusive lock. Locks are held until commit. Holding locks until commit prevents other transactions from seeing partial results and gives serializable behavior.

Shared and exclusive locks affect concurrency in different ways. Multiple readers can hold shared locks at the same time, but an exclusive lock prevents other reads and writes on that data item.

Locking creates the possibility of deadlock, so Spanner uses wound-wait.

In wound-wait:

If an older transaction needs a lock held by a younger transaction, the younger transaction is aborted and retried.
If a younger transaction wants a lock held by an older transaction, it waits.

That policy prevents deadlock while giving older transactions priority.

Snapshot Reads and MVCC

Spanner stores multiple committed versions of data, each tagged with a timestamp. That is the role of MVCC.

A snapshot read returns the state of the database at a specific time \(t\). More precisely, for each item, the system returns the most recent committed version whose timestamp is less than or equal to \(t\).

That gives a consistent cut through time.

Snapshot reads are especially useful for large searches, scans, and reports. Because they read committed older versions, they usually do not need to lock the current data. Writers can continue updating the newest versions while the read-only operation sees a stable snapshot.

This is one of the main practical advantages of MVCC. Large read-only workloads can proceed without blocking current writes.

Snapshot reads also motivate one of Spanner’s most important requirements. Since a snapshot may span many splits and datacenters, commit timestamps must be meaningful across the whole system. Otherwise, “the state of the database at time \(t\)” would not define one coherent global state.

External Consistency

Spanner provides external consistency, also called strict serializability.

The guarantee is this: if transaction \(T_1\) commits before transaction \(T_2\) begins in real time, then \(T_1\)’s commit timestamp must be less than \(T_2\)’s commit timestamp.

This is the transaction-level version of linearizability. Linearizability is usually stated for single operations on shared objects: if one operation finishes before another begins, the first must appear before the second. External consistency applies that same real-time ordering idea to transactions.

This guarantee is stronger than ordinary serializability. Serializability only requires that the result be equivalent to some serial order. That order does not have to match wall-clock time. External consistency requires both a serial order and agreement with real time.

An example is a bank transfer. Suppose one transaction transfers money from checking to savings and returns “committed” to the client. If another transaction starts afterward and reads the balances, external consistency requires that the second transaction see the transfer. A merely serializable system could still choose some serial order, but that order might not match real time. External consistency rules that out.

TrueTime

Spanner cannot enforce external consistency using naive local clock readings, because clocks in distributed systems are never perfectly synchronized.

Spanner’s solution is TrueTime, which represents time as an interval rather than a single exact value:

\[TT.now() = [earliest, latest]\]

The real current time is guaranteed to lie somewhere inside that interval.

The two bounds provided by the TrueTime API are:

TT.now().earliest is definitely not later than the real current time.
TT.now().latest is definitely not earlier than the real current time.

The width of the interval reflects clock uncertainty. Google keeps that uncertainty small by synchronizing clocks using GPS receivers at each data center, with atomic clocks as backup references.

The key idea is that Spanner does not need perfect synchronization. It only needs the uncertainty to be bounded.

Commit Wait

TrueTime by itself is not enough. Spanner also uses commit wait.

When a read-write transaction is ready to commit, Spanner follows this pattern:

It chooses a commit timestamp \(t = TT.now().latest\).
It waits until \(TT.now().earliest > t\).
It then makes the transaction visible.

That waiting step ensures that the chosen timestamp is definitely in the past relative to real time before the commit becomes visible.

The logic is straightforward. If transaction \(T_1\) becomes visible only after time \(t\) is definitely in the past, then any later transaction \(T_2\) that begins afterward must receive a later timestamp. That is how Spanner enforces external consistency.

Commit Wait and Replication

Commit wait sounds expensive, but Spanner overlaps it with replication.

The sequence is:

Acquire locks and perform the transaction’s work.
Choose a commit timestamp.
In parallel:
Run Paxos so the commit is replicated durably.
Perform commit wait so real time advances past the chosen timestamp.
Once both are complete, commit and release locks.

This overlap hides much of the waiting cost. In Google’s environment, thanks to accurate time sources at each datacenter, the uncertainty window is typically only a few milliseconds, and the average commit wait time is on the order of 4 ms.

What to Remember About Spanner

The main ideas to remember are:

Spanner is a distributed relational database with strong transactional guarantees.
Splits are key-range partitions, like Bigtable tablets.
A Paxos group manages replicas of one split.
2PC coordinates transactions across multiple splits.
Strict 2PL controls read-write concurrency.
Wound-wait prevents deadlock.
MVCC enables snapshot reads.
Snapshot reads require globally meaningful timestamps.
External consistency is stronger than ordinary serializability because it also respects real time.
TrueTime and commit wait are the mechanisms that make that possible.

Key Points

Across these systems, the main design questions are about partitioning, replication, concurrency control, and time.

The high-level distinction among the systems is:

Bigtable uses sorted row keys and key-range partitioning into tablets.
Cassandra uses hash partitioning with a DHT-style ring.
Spanner uses sorted key ranges like Bigtable, but adds full distributed transactions and globally meaningful time.

You should also be comfortable with these ideas:

Key-range partitioning keeps nearby keys together and supports efficient range scans.
Hash partitioning spreads load evenly but does not preserve global key order.
Paxos provides replication and ordered updates within one partition.
2PC coordinates transactions across multiple partitions.
Strict 2PL and wound-wait manage read-write concurrency.
MVCC enables snapshot reads by keeping multiple committed versions.
TrueTime and commit wait allow Spanner to align transaction order with real time.

You Do Not Need to Study

Some details in the notes are useful for understanding the systems, but are too detailed to matter for the exam.

The following details are not required for upcoming exams:

Implementation details of storage formats like SSTable encoding
Specific APIs or query syntax for any system
Paxos message flows or optimizations
Procedures for leader election within Paxos
Detailed timing information beyond understanding the purpose of commit wait
Implementation details of GPS or atomic clocks beyond their role in bounding uncertainty
Any cloud product names or deployment details

Week 10: Distributed Computation

Distributed computation frameworks solve the same core problem in different ways: how to divide a large computation across many machines, move data where it needs to go, recover from failures, and keep the overall job efficient. The main goal of this topic is to understand the execution model each framework introduces and the kinds of workloads that model supports well.

MapReduce

MapReduce is a batch-processing framework built around a master-worker architecture. The master coordinates the job, assigns map and reduce tasks to workers, tracks their progress, and reassigns work when failures occur. The programmer supplies a map function that emits intermediate key-value pairs and a reduce function that processes one key together with all values associated with that key.

The execution model has a fixed structure. Input is divided into map shards, map workers process those shards in parallel, intermediate key-value pairs are partitioned by key, reducers fetch their assigned partitions, and the framework then performs shuffle and sort. In this phase, data is moved across the cluster so that all records with the same key reach the same reducer, and the reducer’s input is sorted and grouped by key before reduction begins. Reduce workers then process each key and produce the final output.

Key concepts in MapReduce include:

Master-worker coordination: The master assigns work, tracks completion, and restarts failed tasks.
Map tasks and reduce tasks: Map tasks generate intermediate key-value pairs; reduce tasks combine values for each key.
Partitioning of intermediate keys: The framework assigns keys to reducers, often by hashing the key.
Shuffle and sort: Intermediate data is moved, reorganized, sorted, and grouped so that one reducer receives all values for each key.
Locality: Map tasks are preferably placed near the input data they read.
Stragglers and speculative execution: Slow tasks may be duplicated, and the first completed result is used.

Failure handling follows the structure of the job. If a map worker fails, its map task is rerun, and any lost intermediate output is regenerated by re-executing that map task. If a reduce worker fails, the reduce task is rerun after the reducer again fetches its required intermediate partitions. Stragglers are handled through speculative execution, where a slow task may be launched on another worker and the first result to finish is used.

MapReduce is best suited for large batch jobs. It is a poor fit for iterative workloads because each stage typically writes its output to storage before the next stage begins.

BSP

Bulk Synchronous Parallel, or BSP, organizes a distributed computation into supersteps. Each superstep has three parts: local computation, communication, and barrier synchronization. Messages sent during one superstep become available in the next.

The key idea is that BSP gives computation a round-based structure. That makes communication easier to reason about and creates natural points for synchronization and checkpointing. The cost is that fast workers must wait for slow workers at each barrier.

Key concepts in BSP include:

Supersteps: Repeated rounds of local computation, communication, and synchronization.
Message passing between rounds: Messages sent in one round are processed in the next.
Barrier synchronization: All workers wait at the end of a superstep before the next begins.
Checkpointing at synchronization points: The barrier provides a natural place to save state for recovery.

BSP itself is a general model of round-based computation, not a specific graph-processing system with a built-in rule such as vote to halt. In a BSP-style program, the stopping condition is defined by the algorithm or the framework built on top of BSP. A computation may stop after a fixed number of rounds or when no worker has any further useful work to do.

BSP is a better fit than MapReduce for iterative algorithms because it keeps repeated rounds of computation explicit.

Pregel and Giraph

Pregel applies the BSP model to graph processing. Its computation is vertex-centric: each vertex receives messages, updates its state, sends messages to other vertices, and may vote to halt. The graph remains present across iterations rather than being reconstructed as key-value data at each round.

In Pregel, vote to halt means that a vertex declares that it currently has no more work to do. The vertex becomes inactive and is skipped in later supersteps unless a new message arrives for it. If a message does arrive, the vertex becomes active again.

This structure is a natural fit for graph algorithms such as shortest paths and PageRank, where information repeatedly propagates along edges. A vertex becomes active when it has work to do, may become inactive when it does not, and may become active again if it later receives a message. The computation terminates only when every vertex has voted to halt, all vertices are inactive, and no messages remain in transit anywhere in the system.

Key concepts in Pregel and Giraph include:

Vertex-centric computation: The unit of computation is the vertex.
Supersteps: Computation proceeds in synchronized rounds.
Message passing: Vertices communicate by sending messages.
Vote to halt: A vertex declares that it currently has no more work to do.
Active and inactive vertices: A vertex can become inactive and later reactivate if it receives a message.
Global termination: The job ends only when every vertex has voted to halt and no messages remain in transit.

Giraph is an Apache open-source system based on the Pregel model.

Spark

Spark was designed for computations that require more flexibility than MapReduce, especially multi-stage and iterative workloads. Its original core abstraction is the Resilient Distributed Dataset, or RDD, which represents a partitioned collection of data distributed across a cluster.

Spark includes several major architectural components:

Driver program: The program that defines the computation and coordinates execution.
Cluster manager: The component that allocates resources across the cluster.
Workers: Machines that run computation.
Executors: Processes on worker machines that run tasks and can cache data partitions.

Spark organizes computation as a dataflow graph. Transformations create new RDDs lazily, while actions trigger execution. This design allows Spark to support multi-stage pipelines without forcing every stage to materialize its output before the next one begins.

An RDD has several important properties:

Partitioned: The data is divided across the cluster so partitions can be processed in parallel.
Immutable: Once created, an RDD is not modified in place; new RDDs are derived from old ones.
Derived from stable input or earlier RDDs: An RDD comes from source data or from operations on other RDDs.
Fault-tolerant through lineage: Spark records how an RDD was created so lost partitions can be recomputed.
Optionally cached or persisted: An RDD may be kept in memory or on disk for reuse.

Spark performance depends heavily on dependencies between partitions. Narrow dependencies usually allow computation to proceed without redistributing data. Wide dependencies usually require a shuffle. In Spark, a shuffle means that data is moved across workers and reorganized into new partitions so that related records end up together for the next stage. It is therefore both communication across the cluster and repartitioning of the data.

Key concepts in Spark include:

RDDs
Driver program, workers, and executors
Partitioning
Immutability
Lazy evaluation
Transformations and actions
Lineage
Caching and persistence
Narrow and wide dependencies
Shuffle

Comparing the Frameworks

Each framework organizes distributed computation around a different core abstraction.

MapReduce centers computation on map tasks, shuffle and sort, and reduce tasks.
BSP centers computation on supersteps and barrier synchronization.
Pregel centers computation on vertices and messages.
Spark centers computation on partitioned datasets and dataflow.

The main goal is to understand what each framework makes easy, what costs it exposes, and what kinds of workloads it handles best.

What You Do Not Need to Study

Focus on the core abstractions, execution models, and major tradeoffs of each framework. You do not need to study details that are outside that scope.

You should not need to study:

API details or language-specific syntax
The exact names of libraries, commands, or method signatures
Dates, company histories, or product timelines
Which company introduced a framework, except where the historical origin helps explain the model itself
Detailed pseudocode beyond understanding what the example is meant to illustrate
Fine-grained implementation details that were discussed in the notes but are not summarized here
The Ray framework (which was an optional section in the lecture notes)
I will not ask questions about distributed machine learning.

Week 11: Data In Motion

Part 1: Message Queues and Event Streaming

The Publish-Subscribe Model

A message broker decouples producers from consumers: neither side needs to know anything about the other, they do not need to be running at the same time, and they do not need to operate at the same speed. This last point matters: a producer can generate messages faster than consumers can process them, and the broker absorbs the difference. Producers write messages to the broker; consumers read from it. Messages are organized by topic (a named category or stream).

Delivery Semantics

Three guarantees apply to messaging systems, the same ones that apply to RPC:

At-most-once: sent once, not retried. Fast, but messages can be lost.
At-least-once: retried until acknowledged. No loss, but duplicates are possible. Consumers must be idempotent or deduplicate.
Exactly-once effect: what most systems actually provide. The observable result is as if each message were processed once, achieved by combining at-least-once delivery with idempotent or transactional processing. This requires cooperation from the source, the broker, and the sink. Always ask: under what assumptions, and at what cost?

RabbitMQ

RabbitMQ is a message broker where routing logic lives entirely in the broker. Producers publish messages to an exchange, which routes them to one or more queues based on configured rules. Once a consumer acknowledges a message, the broker removes it. This model works well for task queues and routing-heavy workflows, but messages cannot be replayed and RabbitMQ is not designed for Kafka-style scale-out.

Apache Kafka

Kafka treats the log as the central architectural primitive: an append-only, totally ordered sequence of records that persists for a configurable retention period. Unlike a traditional queue, messages are not deleted after consumption. The following core concepts are essential:

Topic: A named log, divided into partitions.
Partition: An ordered log that grows only by appending; records are never modified once written. Each record is identified by an offset.
Ordering guarantee: Total order is per-partition only. To preserve order for related events, route them to the same partition using a consistent key.
Offset: A sequential integer that identifies a record’s position in a partition; each consumer group independently tracks the offset it has read up to.
Consumer group: A set of consumers that collectively consume a topic; each partition is assigned to one consumer at a time. Within a group this gives a queuing model; across independent groups it gives a pub-sub model.
Leader/follower replication: Each partition has one leader (handles reads and writes) and zero or more follower replicas. If the leader fails, a follower is elected.
Log compaction: An alternative to time- or size-based retention. Kafka retains only the most recent record per key, making the log a durable store of current state.

Producers control durability via the acks setting: acks=0 means fire and forget; acks=1 means the leader acknowledges on write; acks=all means all in-sync replicas must acknowledge before the producer receives confirmation.

Kafka is fast despite writing to disk because it relies entirely on sequential I/O, which is orders of magnitude faster than random I/O, and it exploits the OS page cache aggressively.

Stream Processing

Backpressure is the problem that arises when producers generate data faster than consumers can handle it. Systems address it in three main ways: buffering (absorbing bursts in a queue, which is what Kafka does by design), dropping (discarding messages when the buffer is full, only acceptable for loss-tolerant data), and slowing the producer (explicit flow control, which is backpressure in its strict sense).

Event time is when an event occurred; processing time is when the system received it. Using event time gives correct results for time-based aggregations, while processing time is easier to implement but inaccurate when data arrives late or out of order.

A window defines how events are grouped for aggregation over time. Stream processors support three main window types:

Tumbling: fixed-size, non-overlapping. Each event belongs to exactly one window.
Sliding: fixed-size, overlapping by a configurable step. Events near boundaries appear in multiple windows.
Session: groups events by gaps in activity; window size varies with user behavior.

A watermark is the system’s estimate of how far event time has progressed. Events with timestamps earlier than the watermark are considered unlikely to still arrive, and the system uses the watermark to decide when to close a window and emit results. The stream processor derives it by taking the maximum event timestamp seen so far and subtracting a configured lag. The lag is a tradeoff: too small and late-arriving events are dropped; too large and result latency and memory use increase.

Spark Structured Streaming

Spark Structured Streaming uses a micro-batch model: events are collected into small batches and processed using the standard Spark API. The stream is treated as an unbounded table that grows over time. Event-time windows and watermarks are supported.

Spark provides three output modes for writing results: append writes only newly completed rows; complete rewrites the full result table on each trigger; update writes only rows that changed since the last trigger.

Exactly-once semantics require checkpointing plus an idempotent or transactional sink. Checkpointing prevents skipping events on recovery (at-least-once). When the source supports offset-based replay and the sink supports idempotent writes, they can together provide exactly-once effect, but only under those specific conditions.

Apache Flink

Apache Flink is designed around continuous record-at-a-time streaming rather than Spark’s micro-batch model, giving lower latency at the cost of higher operational complexity.

Part 2: Content Delivery Networks

The Flash Crowd Problem

A flash crowd occurs when a sudden surge in demand overwhelms a single origin server. CDNs solve this by distributing cached copies of content globally so requests are served by nearby servers rather than the origin.

Pre-CDN Approaches and Their Limits

Before CDNs, operators tried several techniques to handle load, each with significant limitations:

Browser caching: private to one client; does not help other users.
Caching proxies: only help users who share that proxy; do not address global distribution.
Load balancing: increases capacity but not geographic performance.
Mirroring: serves content from multiple geographic copies; keeping copies synchronized is difficult.

CDN Architecture

A CDN has three tiers: edge servers (close to users, often inside ISPs), parent servers (regional aggregators), and the origin (the content provider’s infrastructure). The tiered lookup reduces origin load because popular content is served entirely from caches.

CDNs come in two operational models. A push CDN requires the content provider to pre-position content on storage nodes before demand arrives, which is suitable for large files such as software packages or video assets. A pull CDN has edge servers fetch from the origin on the first cache miss and then cache the result, which is simpler to operate and works well for general web assets.

CDN Providers

There are several high volume CDN providers. Three popular ones are:

Akamai was the first commercial CDN and remains one of the largest CDN providers . It uses a DNS-based routing architecture.
Cloudflare built its CDN around anycast routing. - Amazon CloudFront is the default choice for AWS-hosted applications. Most organizations rent CDN capacity rather than build it, since the capital cost of deploying servers inside ISPs worldwide is prohibitive.

Request Routing

CDNs use two main approaches to direct users to the nearest edge server. With DNS-based routing, the content provider creates a CNAME record pointing to the CDN, and the CDN’s dynamic DNS servers return different IP addresses based on user location, server load, and server health.

With anycast routing, all CDN nodes share the same IP address, and the connection is directed to the nearest advertising node based on network routing state rather than a DNS lookup. Many CDNs combine both approaches.

Caching: Content Types

CDNs cache different types of content with different strategies. The main cases are:

Static content: cached according to rules set by the content provider via the Cache-Control response header, which can specify how long a response may be cached, whether it may be stored at all, whether it must be revalidated before use, and whether shared caches like CDN edge servers are permitted to cache it. URL versioning invalidates caches on update.
Dynamic content: assembled at the edge using Edge Side Includes (ESI), which breaks a page into fragments with independent cache lifetimes.
Streaming video: delivered as short HTTP segments using HLS (Apple devices) or MPEG-DASH (Netflix, YouTube, and most other platforms on Android and desktop). Adaptive bitrate (ABR) encoding provides multiple quality levels; the player selects based on current network conditions. From a CDN perspective both protocols are identical: regular HTTP file delivery.

CDN Overlay Network

When an edge server must contact the origin, the CDN routes traffic through its own overlay network rather than the public internet. Nodes continuously measure latency and packet loss to their peers and select paths based on measured performance rather than BGP routing policy.

Security Benefits

A CDN shields the origin’s real IP address from the public internet, so attack traffic hits the CDN’s distributed infrastructure rather than the origin. TLS termination at the edge reduces handshake latency and offloads cryptographic work from the origin.

BitTorrent: Peer-to-Peer Content Delivery

BitTorrent inverts the CDN model: every downloader becomes an uploader, so as more peers join, supply grows automatically. The protocol works through the following mechanisms:

.torrent file: Contains file metadata and cryptographic hashes for each piece.
Tracker: A central server that maintains the list of peers in the swarm; often supplemented or replaced by DHT in modern implementations.
Pieces: Fixed-size chunks of the file, each independently verified by hash.
Rarest-first: Peers preferentially download the pieces that the fewest other peers currently have, ensuring rare pieces spread quickly through the swarm.
DHT: A decentralized alternative to the tracker; peer-list information is distributed across the swarm using a protocol conceptually similar to Chord.

BitTorrent is not well-suited for streaming because pieces are downloaded out of order, it requires upload bandwidth from clients, and it depends on community participation.

Edge Computing

Edge computing runs application logic on CDN edge nodes rather than at the origin, reducing round-trip latency for dynamic operations. Cloudflare Workers runs JavaScript in V8 isolates, which are lightweight sandboxes that provide memory isolation between concurrent workers. Workers can handle authentication, routing, personalization, and similar tasks at the edge, sometimes returning a response directly and sometimes modifying the request before forwarding it to the origin.

The key constraint is state. Reaching a central database from an edge node can add enough latency to erase the benefit, so edge platforms provide local data stores for low-latency state. Complex transactional logic stays at the origin. Edge compute is a complement to the origin, not a replacement.

Key Comparisons

Kafka vs. RabbitMQ: Kafka is a durable, replayable log where consumers track their own position; RabbitMQ routes messages to queues via exchanges and deletes them after acknowledgment. Kafka scales better and supports replay; RabbitMQ offers more flexible routing but does not distribute workload the way Kafka does.

CDN vs. BitTorrent: CDNs are centrally operated, commercially provisioned, and predictably performant; BitTorrent is decentralized, requires no infrastructure investment, and scales with the number of participants.

DNS routing vs. anycast: DNS routing selects a server at resolution time based on observed conditions; anycast routes based on network routing state rather than DNS lookup. Many CDNs use both.

What You Don’t Need to Study

Focus on architectural concepts and how the systems compare. You do not need to memorize:

RabbitMQ exchange types (direct, fanout, topic, headers) or binding configuration details
Flink internals, checkpoint mechanics, or operator pipeline details
Kafka Streams, ksqlDB, specific API calls, or configuration syntax
Any specific version numbers or performance benchmarks
Cache-Control header specific directive names and values (know that the header controls caching behavior, not the individual directive names)
The Staples DNS example or the specific IP addresses in it
The specific signals used by Akamai’s mapping system (user location, BGP tables, traceroute, etc.)
The exact TTL value on CDN DNS responses
Cloudflare product details beyond the basic idea of executing logic in isolates at the edge

Week 12: Security in Distributed Systems

Focus on the concepts and design principles, not the cryptographic details. You should understand why these mechanisms exist in distributed systems and what goes wrong when they are misapplied, not whether you can recall algorithm parameters, protocol steps, or product names.

What you don’t need to study

Algorithm internals, key sizes, or mode variants. You do need to know which category each named algorithm belongs to: AES and ChaCha20 are symmetric ciphers; RSA, ECDH, and ECDSA are public-key algorithms; SHA-256 is a hash function and HMAC is a MAC construction.
OAuth authorization code flow in detail: redirects, authorization codes, or back-channel requests. Know that OAuth delegates access and produces an access token; you do not need to know how the protocol achieves it.
The OIDC protocol steps. Know that OIDC adds user identity on top of OAuth and produces an ID token.
The JWT format: the three-part structure, base64url encoding, or standard claim names (sub, iss, exp, iat, aud). Know that JWTs are a compact, signed token format used to carry claims (such as user identiy information or service authorizations).
Anything about how SPIRE operates or how any SPIFFE implementation works internally. Know what SPIFFE is and what problem it solves.
The SPIFFE URI format.
Service mesh proxy internals or configuration, or the differences between Istio and Linkerd.
What north-south vs. east-west traffic refers to.
Specific products: which company makes which secret manager, which identity provider is most widely deployed, or how TLS versions differ at the protocol level.
Mechanisms used to revoke certificates.
The evolution of the OAuth protocol to its current version.

Security goals

Know these five goals and be able to explain what each one means.

Confidentiality means keeping data secret from parties who are not authorized to see it. Encryption is the primary mechanism.

Integrity means detecting unauthorized modification. A message has integrity if the receiver can confirm it has not been altered in transit.

Authentication means establishing who is on the other end of a connection or who created a message. A service needs to know whether it is really talking to the orders service or to an attacker pretending to be it.

Authorization means deciding what an authenticated party is allowed to do. A principal is any entity that can be identified and granted access: a user, a service, a device, or a background process.

Non-repudiation means being able to prove that a specific party created or approved some data. Digital signatures and audit logs are the typical mechanisms.

These are distinct properties. Encrypting a message does not prevent it from being modified in transit if there is no integrity check. A message can arrive unmodified but fully visible to an attacker. Authentication can succeed while authorization fails. Getting one right does not mean you have the others.

Why distributed systems change the security problem

In a monolithic application, many trust decisions are implicit. The OS enforces boundaries. There is one identity model and one administrator. Distributed systems remove those guarantees.

Several properties define where the problem gets harder:

There is no single trust boundary. Data may cross services, clouds, and organizations.
Identity is not just about users. Services, containers, batch jobs, and background processes all need identities.
Authorization is not a one-time check at the edge. Every service in a call chain needs to make its own decision.
Secrets are distributed across many machines, creating problems of provisioning, rotation, revocation, and blast radius (how far the damage spreads when a credential leaks).
Partial compromise is the realistic threat. One leaked credential can enable lateral movement through the rest of the system.

The cryptographic building blocks

Understand the conceptual role of each mechanism, not the math.

Symmetric encryption uses the same key on both sides. It is fast and suited for bulk data, but it requires both parties to already share a key. It cannot establish a secure channel between strangers on its own.

Asymmetric cryptography uses a key pair. The public key can be shared openly. The private key stays secret. It makes key establishment practical without pre-shared secrets: two parties who have never shared a secret can establish a shared key. It also enables digital signatures.

Hashes, MACs, and digital signatures serve different purposes. A hash function maps data of any length to a fixed-size digest. The same input always produces the same digest, and it is computationally infeasible to reverse the process or find two inputs with the same digest. A plain hash provides no protection against an active attacker because anyone can modify a message and recompute the hash. It is useful for detecting accidental corruption, but it proves nothing about who sent the message.

MACs and digital signatures both extend the hash concept to provide authentication as well as integrity.

A MAC (message authentication code) computes a hash of the message combined with a shared secret key. A receiver with the same key can verify the hash, confirming both that the message was not modified and that it came from someone who knows the key.

A digital signature also operates on a hash of the message, but instead of a shared secret, the sender computes the signature using their private key. Any party with the corresponding public key can verify the signature, confirming the message’s origin and integrity without a pre-shared secret. Signatures also support non-repudiation.

The following table summarizes the key differences:

Mechanism	Protects against active attacker	Provides origin authentication	Requires pre-shared secret
Hash	No	No	No
MAC	Yes	Yes, to parties with the shared secret	Yes
Digital signature	Yes	Yes, verifiable via public key	No

Replay attacks are easy to miss. Integrity checks tell you a message was not modified. They tell you nothing about whether it is a copy of an older message. A valid signed request captured and retransmitted later will still pass an integrity check. Distributed protocols add freshness mechanisms: nonces, timestamps, sequence numbers, or expiration times.

Transport Layer Security (TLS) combines asymmetric key exchange, symmetric bulk encryption, and integrity checks into a secure channel. It provides confidentiality, integrity, and server authentication. With mutual configuration, it also authenticates the client. Know what TLS provides, not the handshake state machine.

Certificates bind a public key to an identity. A certificate authority (CA) signs the binding. Trust is organized as a chain from a root CA through intermediate CAs to the certificates issued to specific services, workloads, or users. In distributed systems, the identity in a certificate may be a service name or workload identifier rather than a domain name. Certificate lifecycle (expiration, rotation, revocation) is a security design constraint, not just an operational concern.

Mutual TLS (mTLS)

Mutual TLS (mTLS) authenticates both sides of a connection. Standard TLS authenticates only the server. With mTLS, each side presents a certificate and verifies the other’s. The result is a cryptographically verified identity for both caller and receiver, not just an IP address.

In microservice systems, mTLS turns “inside the cluster” from a passive trust assumption into a verifiable property of each connection. Once both sides have cryptographically verified identities, authorization decisions can be based on who is calling rather than where the call came from.

Authentication and authorization

Know this distinction cold. Authentication asks who you are. Authorization asks what you are allowed to do.

A system can authenticate correctly but authorize incorrectly. The most common failure pattern is broken object-level authorization (BOLA): a service checks the token but does not verify that the caller is allowed to access the specific resource. An attacker with any valid token can read any object by changing an ID in the request.

In distributed systems, authorization must happen at multiple layers independently. The API gateway can check whether a client is allowed to call the API at all. A backend service must check whether the caller is allowed to invoke this specific operation. A data service must check whether this caller can access this specific record. None of these can be fully delegated to the layer above.

OAuth, OIDC, and JWTs

These three are not interchangeable.

OAuth is an authorization framework. It produces an access token that a service validates to determine what access to grant. It answers “what access is being delegated?” not “who is this user?”

OpenID Connect (OIDC) is an identity layer on top of OAuth. It produces an ID token that describes who the authenticated user is. It answers the authentication question that OAuth does not.

A JWT (JSON Web Token) is a compact token format, not a protocol. A JWT carries claims: key-value assertions about the subject, such as user identity, expiration time, issuer, and granted scopes. OAuth access tokens and OIDC ID tokens are often encoded as JWTs, but a JWT by itself has no security meaning without knowing the protocol context.

Signed JWTs can be validated locally without contacting the issuer, which reduces latency but makes revocation hard. The tradeoff is between local validation and centralized revocation control: local validation is fast and requires no runtime dependency on the issuer, but enforcing revocation globally requires coordination with the issuer. The most widely used response is short-lived access tokens paired with a refresh token, a longer-lived credential the client uses to obtain a new access token at renewal; revocation takes effect when the client next tries to renew.

Workload identity

Users are not the only principals. Services, containers, and batch jobs also need identities. Workload identity lets services prove which workload they are cryptographically rather than by network location. SPIFFE (Secure Production Identity Framework for Everyone) is a widely adopted standard for workload identity credentials. The key concept is that workload identities should be short-lived and automatically rotated, just like access tokens.

Cloud IAM

Cloud identity and access management (IAM) policies bind identities to permissions on cloud resources. Least privilege means granting each service only the permissions it actually needs and no more.

Over-privileged service accounts and roles are a major escalation path: a microservice with broad permissions becomes a much larger incident when compromised than one with least-privilege access. Short-lived credentials issued via workload identity federation are preferred over long-lived service account keys embedded in container images.

Zero Trust

Zero Trust is an architectural principle: network location does not imply trust. A request from inside the cluster still needs to be authenticated and authorized. The slogan “never trust, always verify” is a shorthand for the principle that trust must be earned by presenting verified credentials, not assumed based on where a connection originates.

Micro-segmentation

Micro-segmentation divides a system into fine-grained trust domains and explicitly controls which services can communicate with which others. It limits the blast radius of a compromise: if the orders service cannot connect directly to the database, compromising the orders service does not automatically give an attacker database access.

Service meshes and API gateways

Know the difference in scope.

An API gateway handles external traffic entering or leaving the system (called north-south traffic). It centralizes TLS termination (often decrypting incoming HTTPS connections at the boundary so internal services communicate over separate connections), token validation, rate limiting, and coarse-grained authorization at the edge.

A service mesh handles internal service-to-service calls (called east-west traffic). It provides mTLS, workload identity, and authorization enforcement consistently across all internal calls without requiring each service to reimplement those mechanisms. A sidecar proxy running alongside each service intercepts its network traffic to enforce these policies transparently.

A gateway protects the front door. A service mesh protects what happens inside.

Secret management

Secrets (API keys, database passwords, TLS private keys) must be distributed, rotated, revoked, and audited. Baking secrets into container images or source repositories is one of the most common and damaging mistakes. A well-designed system issues short-lived credentials to authenticated workloads at runtime rather than using long-lived static secrets.

Base64 encoding is not encryption. A base64-encoded secret is fully readable to anyone who can see the string.

Key and certificate rotation should be routine and automated. A system that requires manual edits or downtime to rotate a certificate is brittle by design.

Common design mistakes

Each of these reflects a trust assumption that was wrong from the start, not just a coding mistake.

Assuming the internal network is trusted.
Authenticating users but not services.
Using long-lived shared secrets everywhere instead of short-lived credentials.
Relying on the API gateway for all security and ignoring authorization inside the system.
Checking authentication but not authorization on individual resources and operations.
Over-privileging service accounts and cloud roles.
Storing secrets in source code, container images, or environment variable dumps.
Treating JWTs as an architecture rather than a token format.
Failing to plan for credential revocation, rotation, and resistance against replay attacks.

Understanding why each one is wrong is understanding the security design principles covered in these notes.

Exam 3 Study Guide

Week 9: Distributed Databases

NoSQL and Data Model Categories

Bigtable

Data Model

Partitioning and Structure

Storage and Writes

Key Concepts in Bigtable

Cassandra

Architecture and the Hash Ring

Data Model

Replication and Consistency

Storage Model

What to Remember About Cassandra

Spanner (NewSQL)

What Spanner Combines

Physical and Logical Structure

Replication and Paxos Groups

Transactions Across Splits

Read-Write Transactions, Locks, and Wound-Wait

Snapshot Reads and MVCC

External Consistency

TrueTime

Commit Wait

Commit Wait and Replication

What to Remember About Spanner

Key Points

You Do Not Need to Study

Week 10: Distributed Computation

MapReduce

BSP

Pregel and Giraph

Spark

Comparing the Frameworks

What You Do Not Need to Study

Week 11: Data In Motion

Part 1: Message Queues and Event Streaming

The Publish-Subscribe Model

Delivery Semantics

RabbitMQ

Apache Kafka

Stream Processing

Spark Structured Streaming

Apache Flink

Part 2: Content Delivery Networks

The Flash Crowd Problem

Pre-CDN Approaches and Their Limits

CDN Architecture

CDN Providers

Request Routing

Caching: Content Types

CDN Overlay Network

Security Benefits

BitTorrent: Peer-to-Peer Content Delivery

Edge Computing

Key Comparisons

What You Don’t Need to Study

Week 12: Security in Distributed Systems

What you don’t need to study

Security goals

Why distributed systems change the security problem

The cryptographic building blocks

Mutual TLS (mTLS)

Authentication and authorization

OAuth, OIDC, and JWTs

Workload identity

Cloud IAM

Zero Trust

Micro-segmentation

Service meshes and API gateways

Secret management

Common design mistakes