pk.org: CS 417/Lecture Notes

Data In Motion

Study Guide

Paul Krzyzanowski – 2026-04-14

Part 1: Message Queues and Event Streaming

The Publish-Subscribe Model

A message broker decouples producers from consumers: neither side needs to know anything about the other, they do not need to be running at the same time, and they do not need to operate at the same speed. This last point matters: a producer can generate messages faster than consumers can process them, and the broker absorbs the difference. Producers write messages to the broker; consumers read from it. Messages are organized by topic (a named category or stream).

Delivery Semantics

Three guarantees apply to messaging systems, the same ones that apply to RPC:

RabbitMQ

RabbitMQ is a message broker where routing logic lives entirely in the broker. Producers publish messages to an exchange, which routes them to one or more queues based on configured rules. Once a consumer acknowledges a message, the broker removes it. This model works well for task queues and routing-heavy workflows, but messages cannot be replayed and RabbitMQ is not designed for Kafka-style scale-out.

Apache Kafka

Kafka treats the log as the central architectural primitive: an append-only, totally ordered sequence of records that persists for a configurable retention period. Unlike a traditional queue, messages are not deleted after consumption. The following core concepts are essential:

Topic
A named log, divided into partitions.
Partition
An ordered log that grows only by appending; records are never modified once written. Each record is identified by an offset.
Ordering guarantee
Total order is per-partition only. To preserve order for related events, route them to the same partition using a consistent key.
Offset
A sequential integer that identifies a record’s position in a partition; each consumer group independently tracks the offset it has read up to.
Consumer group
A set of consumers that collectively consume a topic; each partition is assigned to one consumer at a time. Within a group this gives a queuing model; across independent groups it gives a pub-sub model.
Leader/follower replication
Each partition has one leader (handles reads and writes) and zero or more follower replicas. If the leader fails, a follower is elected.
Log compaction
An alternative to time- or size-based retention. Kafka retains only the most recent record per key, making the log a durable store of current state.

Producers control durability via the acks setting: acks=0 means fire and forget; acks=1 means the leader acknowledges on write; acks=all means all in-sync replicas must acknowledge before the producer receives confirmation.

Kafka is fast despite writing to disk because it relies entirely on sequential I/O, which is orders of magnitude faster than random I/O, and it exploits the OS page cache aggressively.

Stream Processing

Backpressure is the problem that arises when producers generate data faster than consumers can handle it. Systems address it in three main ways: buffering (absorbing bursts in a queue, which is what Kafka does by design), dropping (discarding messages when the buffer is full, only acceptable for loss-tolerant data), and slowing the producer (explicit flow control, which is backpressure in its strict sense).

Event time is when an event occurred; processing time is when the system received it. Using event time gives correct results for time-based aggregations, while processing time is easier to implement but inaccurate when data arrives late or out of order.

A window defines how events are grouped for aggregation over time. Stream processors support three main window types:

A watermark is the system’s estimate of how far event time has progressed. Events with timestamps earlier than the watermark are considered unlikely to still arrive, and the system uses the watermark to decide when to close a window and emit results. The stream processor derives it by taking the maximum event timestamp seen so far and subtracting a configured lag. The lag is a tradeoff: too small and late-arriving events are dropped; too large and result latency and memory use increase.

Spark Structured Streaming

Spark Structured Streaming uses a micro-batch model: events are collected into small batches and processed using the standard Spark API. The stream is treated as an unbounded table that grows over time. Event-time windows and watermarks are supported.

Spark provides three output modes for writing results: append writes only newly completed rows; complete rewrites the full result table on each trigger; update writes only rows that changed since the last trigger.

Exactly-once semantics require checkpointing plus an idempotent or transactional sink. Checkpointing prevents skipping events on recovery (at-least-once). When the source supports offset-based replay and the sink supports idempotent writes, they can together provide exactly-once effect, but only under those specific conditions.

Apache Flink

Apache Flink is designed around continuous record-at-a-time streaming rather than Spark’s micro-batch model, giving lower latency at the cost of higher operational complexity.

Part 2: Content Delivery Networks

The Flash Crowd Problem

A flash crowd occurs when a sudden surge in demand overwhelms a single origin server. CDNs solve this by distributing cached copies of content globally so requests are served by nearby servers rather than the origin.

Pre-CDN Approaches and Their Limits

Before CDNs, operators tried several techniques to handle load, each with significant limitations:

CDN Architecture

A CDN has three tiers: edge servers (close to users, often inside ISPs), parent servers (regional aggregators), and the origin (the content provider’s infrastructure). The tiered lookup reduces origin load because popular content is served entirely from caches.

CDNs come in two operational models. A push CDN requires the content provider to pre-position content on storage nodes before demand arrives, which is suitable for large files such as software packages or video assets. A pull CDN has edge servers fetch from the origin on the first cache miss and then cache the result, which is simpler to operate and works well for general web assets.

CDN Providers

There are several high volume CDN providers. Three popular ones are:

Request Routing

CDNs use two main approaches to direct users to the nearest edge server. With DNS-based routing, the content provider creates a CNAME record pointing to the CDN, and the CDN’s dynamic DNS servers return different IP addresses based on user location, server load, and server health.

With anycast routing, all CDN nodes share the same IP address, and the connection is directed to the nearest advertising node based on network routing state rather than a DNS lookup. Many CDNs combine both approaches.

Caching: Content Types

CDNs cache different types of content with different strategies. The main cases are:

CDN Overlay Network

When an edge server must contact the origin, the CDN routes traffic through its own overlay network rather than the public internet. Nodes continuously measure latency and packet loss to their peers and select paths based on measured performance rather than BGP routing policy.

Security Benefits

A CDN shields the origin’s real IP address from the public internet, so attack traffic hits the CDN’s distributed infrastructure rather than the origin. TLS termination at the edge reduces handshake latency and offloads cryptographic work from the origin.

BitTorrent: Peer-to-Peer Content Delivery

BitTorrent inverts the CDN model: every downloader becomes an uploader, so as more peers join, supply grows automatically. The protocol works through the following mechanisms:

.torrent file
Contains file metadata and cryptographic hashes for each piece.
Tracker
A central server that maintains the list of peers in the swarm; often supplemented or replaced by DHT in modern implementations.
Pieces
Fixed-size chunks of the file, each independently verified by hash.
Rarest-first
Peers preferentially download the pieces that the fewest other peers currently have, ensuring rare pieces spread quickly through the swarm.
DHT
A decentralized alternative to the tracker; peer-list information is distributed across the swarm using a protocol conceptually similar to Chord.

BitTorrent is not well-suited for streaming because pieces are downloaded out of order, it requires upload bandwidth from clients, and it depends on community participation.

Edge Computing

Edge computing runs application logic on CDN edge nodes rather than at the origin, reducing round-trip latency for dynamic operations. Cloudflare Workers runs JavaScript in V8 isolates, which are lightweight sandboxes that provide memory isolation between concurrent workers. Workers can handle authentication, routing, personalization, and similar tasks at the edge, sometimes returning a response directly and sometimes modifying the request before forwarding it to the origin.

The key constraint is state. Reaching a central database from an edge node can add enough latency to erase the benefit, so edge platforms provide local data stores for low-latency state. Complex transactional logic stays at the origin. Edge compute is a complement to the origin, not a replacement.

Key Comparisons

Kafka vs. RabbitMQ: Kafka is a durable, replayable log where consumers track their own position; RabbitMQ routes messages to queues via exchanges and deletes them after acknowledgment. Kafka scales better and supports replay; RabbitMQ offers more flexible routing but does not distribute workload the way Kafka does.

CDN vs. BitTorrent: CDNs are centrally operated, commercially provisioned, and predictably performant; BitTorrent is decentralized, requires no infrastructure investment, and scales with the number of participants.

DNS routing vs. anycast: DNS routing selects a server at resolution time based on observed conditions; anycast routes based on network routing state rather than DNS lookup. Many CDNs use both.

What You Don’t Need to Study

Focus on architectural concepts and how the systems compare. You do not need to memorize:


Next: Terms you should know

Back to CS 417 Documents