Distributed Systems Foundations and Communication

Core distributed systems concepts

Distributed system: A collection of independent computers connected by a network that coordinate to accomplish a common goal.
Autonomous computer: A computer that has its own processor, memory, operating system, and clock, and operates independently of others.
Single system image: An abstraction in which a distributed system appears to users as a single coherent system, hiding the complexity of distribution.
Message passing: Explicit communication between processes using network messages rather than shared memory.
Partial failure: A failure mode in which some components fail while others continue operating.
All-or-nothing failure: A failure mode in which the entire system stops functioning when a failure occurs.
Single point of failure: A component whose failure causes the entire system to fail.
Horizontal scaling: Increasing system capacity by adding more machines and distributing work across them.
Vertical scaling: Increasing system capacity by adding resources to a single machine.

Laws and principles

Moore’s Law: The historical observation that transistor counts, and thus computing capacity, roughly doubled every 18 to 24 months.
Amdahl’s Law: A principle stating that the speedup from parallelism is limited by the portion of a task that must remain sequential.
Metcalfe’s Law: The idea that the value of a network grows roughly with the square of the number of its participants.
End-to-end principle: A network design principle that places functionality such as reliability and security at the communicating endpoints rather than in the network.
Fate sharing: The principle that communication state should reside at the endpoints so failures affect only the components already involved.
Best-effort delivery: A network service model in which packets are attempted but not guaranteed to be delivered, ordered, or delivered within a fixed time.

Transparency

Transparency: The design goal of hiding the fact that resources are distributed across multiple computers, allowing users and applications to interact with the system as if it were a single machine.

Failure models

Fail-stop failure: A failure in which a component halts execution and produces no further output, and the failure can be detected.
Fail-silent failure: A failure in which a component produces no output, but other components cannot reliably distinguish failure from delay.
Fail-restart failure: A failure in which a component crashes and later restarts, possibly with lost or stale state.
Stale state: Outdated information held by a component that has missed updates while it was unavailable.
Network partition: A failure that divides a system into disconnected groups that cannot communicate.
Partition: Short for network partition
Byzantine failure: A failure in which a component continues running but does not follow the system specification, producing incorrect or inconsistent behavior.

Fault tolerance and availability

Fault tolerance: The ability of a system to continue operating correctly despite component failures.
Redundancy: The use of multiple components to tolerate failures and improve availability.
Availability: The fraction of time a system is usable from a client’s perspective.
Nines: A way of expressing availability as a count of nines (e.g., “five nines” means 99.999% availability).
Reliability: A measure of correctness and time-to-failure of a system or component.
Series system: A system structure in which failure of any component causes system failure.
Parallel system: A system structure in which the system continues operating as long as some components remain functional.

Network timing models

Synchronous network: A network model with a known upper bound on message delivery time.
Partially synchronous network: A network model where an upper bound on message delivery exists but is not known in advance.
Asynchronous network: A network model where messages can take arbitrarily long to arrive, with no upper bound on delivery time.

Service architectures

Client-server model: An architecture where clients send requests to servers, which process them and return responses.
Multi-tier architecture: An architecture that separates functionality into layers, each handling a specific concern and communicating with adjacent layers.
Microservices architecture: An architecture that decomposes an application into small, autonomous services with well-defined interfaces.
Peer-to-peer (P2P): An architecture where all participants communicate directly with each other without relying on a central server.
Hybrid P2P model: A peer-to-peer architecture that uses a central server for coordination while peers handle data transfer.
Worker pool: A collection of computing resources that can be assigned to tasks on demand; also called a processor pool or compute cluster.

Cloud computing

Cloud computing: Providing computing resources as a network service rather than owning physical infrastructure.
Infrastructure as a Service (IaaS): Cloud services providing virtual machines, storage, and networking, with the customer controlling the operating system.
Platform as a Service (PaaS): Cloud services providing a managed environment for running applications, with the provider managing the operating system and runtime.
Software as a Service (SaaS): Cloud services providing complete applications over the network, with the customer managing no infrastructure.

Networking fundamentals

Packet switching: A networking approach in which data is divided into packets that are routed independently through the network.
Layered architecture: A design approach that separates networking functionality into layers with well-defined responsibilities.
OSI model: A conceptual seven-layer model used to describe and reason about network protocol design.
Data link layer: The layer responsible for communication on a single physical network.
Network layer: The layer responsible for routing packets between machines across networks.
Transport layer: The layer responsible for process-to-process communication.

Internet and IP networking

Internet Protocol (IP): A network-layer protocol that provides connectionless, best-effort delivery of packets between machines.
Datagram: An independent packet of data sent over a network without guarantees of delivery or ordering.
Port: A transport-layer identifier used to deliver data to the correct process on a machine.

Transport protocols and sockets

Transmission Control Protocol (TCP): A transport protocol that provides reliable, ordered, congestion-controlled byte-stream communication.
User Datagram Protocol (UDP): A transport protocol that provides connectionless, best-effort datagram delivery with minimal overhead.
Head-of-line blocking: A delay that occurs when later data must wait for earlier data to be delivered in order.
Socket: An operating system abstraction that provides an interface for network communication.
Connection-oriented communication: Communication that involves explicit connection setup and teardown.
Connectionless communication: Communication in which messages are sent independently without establishing a connection.
QUIC: A transport protocol built on UDP that provides reliable, multiplexed communication in user space.

Data placement

Replication: The creation of multiple authoritative copies of data to improve availability and fault tolerance.
Caching: The storage of temporary copies of data to reduce latency and load, potentially serving stale data.

Distributed Systems Foundations and Communication - Keywords

Core distributed systems concepts

Laws and principles

Transparency

Failure models

Fault tolerance and availability

Network timing models

Service architectures

Cloud computing

Networking fundamentals

Internet and IP networking

Transport protocols and sockets

Data placement

Back to CS 417 Documents