Data In Motion

Part 1: Message Queues and Event Streaming

The Publish-Subscribe Model

A message broker decouples producers from consumers: neither side needs to know anything about the other, they do not need to be running at the same time, and they do not need to operate at the same speed. This last point matters: a producer can generate messages faster than consumers can process them, and the broker absorbs the difference. Producers write messages to the broker; consumers read from it. Messages are organized by topic (a named category or stream).

Delivery Semantics

Three guarantees apply to messaging systems, the same ones that apply to RPC:

At-most-once: sent once, not retried. Fast, but messages can be lost.
At-least-once: retried until acknowledged. No loss, but duplicates are possible. Consumers must be idempotent or deduplicate.
Exactly-once effect: what most systems actually provide. The observable result is as if each message were processed once, achieved by combining at-least-once delivery with idempotent or transactional processing. This requires cooperation from the source, the broker, and the sink. Always ask: under what assumptions, and at what cost?

RabbitMQ

RabbitMQ is a message broker where routing logic lives entirely in the broker. Producers publish messages to an exchange, which routes them to one or more queues based on configured rules. Once a consumer acknowledges a message, the broker removes it. This model works well for task queues and routing-heavy workflows, but messages cannot be replayed and RabbitMQ is not designed for Kafka-style scale-out.

Apache Kafka

Kafka treats the log as the central architectural primitive: an append-only, totally ordered sequence of records that persists for a configurable retention period. Unlike a traditional queue, messages are not deleted after consumption. The following core concepts are essential:

Topic: A named log, divided into partitions.
Partition: An ordered log that grows only by appending; records are never modified once written. Each record is identified by an offset.
Ordering guarantee: Total order is per-partition only. To preserve order for related events, route them to the same partition using a consistent key.
Offset: A sequential integer that identifies a record’s position in a partition; each consumer group independently tracks the offset it has read up to.
Consumer group: A set of consumers that collectively consume a topic; each partition is assigned to one consumer at a time. Within a group this gives a queuing model; across independent groups it gives a pub-sub model.
Leader/follower replication: Each partition has one leader (handles reads and writes) and zero or more follower replicas. If the leader fails, a follower is elected.
Log compaction: An alternative to time- or size-based retention. Kafka retains only the most recent record per key, making the log a durable store of current state.

Producers control durability via the acks setting: acks=0 means fire and forget; acks=1 means the leader acknowledges on write; acks=all means all in-sync replicas must acknowledge before the producer receives confirmation.

Kafka is fast despite writing to disk because it relies entirely on sequential I/O, which is orders of magnitude faster than random I/O, and it exploits the OS page cache aggressively.

Stream Processing

Backpressure is the problem that arises when producers generate data faster than consumers can handle it. Systems address it in three main ways: buffering (absorbing bursts in a queue, which is what Kafka does by design), dropping (discarding messages when the buffer is full, only acceptable for loss-tolerant data), and slowing the producer (explicit flow control, which is backpressure in its strict sense).

Event time is when an event occurred; processing time is when the system received it. Using event time gives correct results for time-based aggregations, while processing time is easier to implement but inaccurate when data arrives late or out of order.

A window defines how events are grouped for aggregation over time. Stream processors support three main window types:

Tumbling: fixed-size, non-overlapping. Each event belongs to exactly one window.
Sliding: fixed-size, overlapping by a configurable step. Events near boundaries appear in multiple windows.
Session: groups events by gaps in activity; window size varies with user behavior.

A watermark is the system’s estimate of how far event time has progressed. Events with timestamps earlier than the watermark are considered unlikely to still arrive, and the system uses the watermark to decide when to close a window and emit results. The stream processor derives it by taking the maximum event timestamp seen so far and subtracting a configured lag. The lag is a tradeoff: too small and late-arriving events are dropped; too large and result latency and memory use increase.

Spark Structured Streaming

Spark Structured Streaming uses a micro-batch model: events are collected into small batches and processed using the standard Spark API. The stream is treated as an unbounded table that grows over time. Event-time windows and watermarks are supported.

Spark provides three output modes for writing results: append writes only newly completed rows; complete rewrites the full result table on each trigger; update writes only rows that changed since the last trigger.

Exactly-once semantics require checkpointing plus an idempotent or transactional sink. Checkpointing prevents skipping events on recovery (at-least-once). When the source supports offset-based replay and the sink supports idempotent writes, they can together provide exactly-once effect, but only under those specific conditions.

Apache Flink

Apache Flink is designed around continuous record-at-a-time streaming rather than Spark’s micro-batch model, giving lower latency at the cost of higher operational complexity.

Part 2: Content Delivery Networks

The Flash Crowd Problem

A flash crowd occurs when a sudden surge in demand overwhelms a single origin server. CDNs solve this by distributing cached copies of content globally so requests are served by nearby servers rather than the origin.

Pre-CDN Approaches and Their Limits

Before CDNs, operators tried several techniques to handle load, each with significant limitations:

Browser caching: private to one client; does not help other users.
Caching proxies: only help users who share that proxy; do not address global distribution.
Load balancing: increases capacity but not geographic performance.
Mirroring: serves content from multiple geographic copies; keeping copies synchronized is difficult.

CDN Architecture

A CDN has three tiers: edge servers (close to users, often inside ISPs), parent servers (regional aggregators), and the origin (the content provider’s infrastructure). The tiered lookup reduces origin load because popular content is served entirely from caches.

CDNs come in two operational models. A push CDN requires the content provider to pre-position content on storage nodes before demand arrives, which is suitable for large files such as software packages or video assets. A pull CDN has edge servers fetch from the origin on the first cache miss and then cache the result, which is simpler to operate and works well for general web assets.

CDN Providers

There are several high volume CDN providers. Three popular ones are:

Akamai was the first commercial CDN and remains one of the largest CDN providers . It uses a DNS-based routing architecture.
Cloudflare built its CDN around anycast routing. - Amazon CloudFront is the default choice for AWS-hosted applications. Most organizations rent CDN capacity rather than build it, since the capital cost of deploying servers inside ISPs worldwide is prohibitive.

Request Routing

CDNs use two main approaches to direct users to the nearest edge server. With DNS-based routing, the content provider creates a CNAME record pointing to the CDN, and the CDN’s dynamic DNS servers return different IP addresses based on user location, server load, and server health.

With anycast routing, all CDN nodes share the same IP address, and the connection is directed to the nearest advertising node based on network routing state rather than a DNS lookup. Many CDNs combine both approaches.

Caching: Content Types

CDNs cache different types of content with different strategies. The main cases are:

Static content: cached according to rules set by the content provider via the Cache-Control response header, which can specify how long a response may be cached, whether it may be stored at all, whether it must be revalidated before use, and whether shared caches like CDN edge servers are permitted to cache it. URL versioning invalidates caches on update.
Dynamic content: assembled at the edge using Edge Side Includes (ESI), which breaks a page into fragments with independent cache lifetimes.
Streaming video: delivered as short HTTP segments using HLS (Apple devices) or MPEG-DASH (Netflix, YouTube, and most other platforms on Android and desktop). Adaptive bitrate (ABR) encoding provides multiple quality levels; the player selects based on current network conditions. From a CDN perspective both protocols are identical: regular HTTP file delivery.

CDN Overlay Network

When an edge server must contact the origin, the CDN routes traffic through its own overlay network rather than the public internet. Nodes continuously measure latency and packet loss to their peers and select paths based on measured performance rather than BGP routing policy.

Security Benefits

A CDN shields the origin’s real IP address from the public internet, so attack traffic hits the CDN’s distributed infrastructure rather than the origin. TLS termination at the edge reduces handshake latency and offloads cryptographic work from the origin.

BitTorrent: Peer-to-Peer Content Delivery

BitTorrent inverts the CDN model: every downloader becomes an uploader, so as more peers join, supply grows automatically. The protocol works through the following mechanisms:

.torrent file: Contains file metadata and cryptographic hashes for each piece.
Tracker: A central server that maintains the list of peers in the swarm; often supplemented or replaced by DHT in modern implementations.
Pieces: Fixed-size chunks of the file, each independently verified by hash.
Rarest-first: Peers preferentially download the pieces that the fewest other peers currently have, ensuring rare pieces spread quickly through the swarm.
DHT: A decentralized alternative to the tracker; peer-list information is distributed across the swarm using a protocol conceptually similar to Chord.

BitTorrent is not well-suited for streaming because pieces are downloaded out of order, it requires upload bandwidth from clients, and it depends on community participation.

Edge Computing

Edge computing runs application logic on CDN edge nodes rather than at the origin, reducing round-trip latency for dynamic operations. Cloudflare Workers runs JavaScript in V8 isolates, which are lightweight sandboxes that provide memory isolation between concurrent workers. Workers can handle authentication, routing, personalization, and similar tasks at the edge, sometimes returning a response directly and sometimes modifying the request before forwarding it to the origin.

The key constraint is state. Reaching a central database from an edge node can add enough latency to erase the benefit, so edge platforms provide local data stores for low-latency state. Complex transactional logic stays at the origin. Edge compute is a complement to the origin, not a replacement.

Key Comparisons

Kafka vs. RabbitMQ: Kafka is a durable, replayable log where consumers track their own position; RabbitMQ routes messages to queues via exchanges and deletes them after acknowledgment. Kafka scales better and supports replay; RabbitMQ offers more flexible routing but does not distribute workload the way Kafka does.

CDN vs. BitTorrent: CDNs are centrally operated, commercially provisioned, and predictably performant; BitTorrent is decentralized, requires no infrastructure investment, and scales with the number of participants.

DNS routing vs. anycast: DNS routing selects a server at resolution time based on observed conditions; anycast routes based on network routing state rather than DNS lookup. Many CDNs use both.

What You Don’t Need to Study

Focus on architectural concepts and how the systems compare. You do not need to memorize:

RabbitMQ exchange types (direct, fanout, topic, headers) or binding configuration details
Flink internals, checkpoint mechanics, or operator pipeline details
Kafka Streams, ksqlDB, specific API calls, or configuration syntax
Any specific version numbers or performance benchmarks
Cache-Control header specific directive names and values (know that the header controls caching behavior, not the individual directive names)
The Staples DNS example or the specific IP addresses in it
The specific signals used by Akamai’s mapping system (user location, BGP tables, traceroute, etc.)
The exact TTL value on CDN DNS responses
Cloudflare product details beyond the basic idea of executing logic in isolates at the edge