Exam 3 study guide
The one-hour study guide for exam 3
April 26, 2009
Disclaimer: This study guide attempts to touch upon the most important topics that may be covered on the exam but does not claim to necessarily cover everything that one needs to know for the exam. Finally, don't take the one hour time window in the title literally.
Distributed transactions
A key facet of a transaction is that it has to be atomic — all results have to be made permanent (commit) and appear as an indivisible action. If a transaction cannot complete, it must abort, reverting the state of the system to that before it ran. If several transactions run concurrently, the end result must be the same as if they ran in some (any) serial order.
A write-ahead log (or transaction log) is used for rollback: reverting to the previous state when aborting a transaction. It is also crucial for maintaining state in a stable place in case the system should die; it allows the system to recover from where it was in the transaction.
The two-phase commit protocol uses atomic multicasts to reach a consensus among the group on whether to commit or abort. It uses a coordinator to send a request ("can you commit?") to every member of the group (reliably, retransmitting as often as needed until all replies are received). Phase 1 is complete when every member of the group responds. If the coordinator gets even a single abort response, it will have to tell all group members to abort the entire transaction. Otherwise, it can tell everybody to commit it. In phase 2, the coordinator sends the commit or abort order and waits for a response from everyone. The write-ahead log is crucial here to attain atomic multicasts (not for rollback!). For example, if a machine sent the coordinator a commit response for phase 1 and then died, it must be able to reboot and reconstruct the transaction state from the log; it cannot change its mind.
Concurrency control
The goal of concurrency control is to allow multiple transactions to run concurrently but to ensure that data access is scheduled such that the net effect is the same as the transactions all ran in some serialized order. That is, we cannot have the net result be one where a transaction read an interim state of data from another transaction.
Exclusive locks, via a lock manager, are an easy way to accomplish this. However, to ensure serializability, it is important that a transaction does not acquire any new locks after it has released a lock. This is known as two-phase locking. The first phase is the growing phase, in which locks are acquired. The second phase is the shrinking phase, in which locks are released. The problem with two-phase locking is that, if a transaction that has released some locks aborts, there is a chance that other transactions have used the data held by the released lock. In that case, those transactions (and all transactions that depend on them) have to abort as well. Strict two-phase locking avoids this problem by requiring all locks to be held until the end. The shrinking phase, in effect, is an atomic operation that occurs at the very end of the transaction.
Secure communication
To communicate securely using a symmetric cipher, both parties need to have a shared secret key. Alice will encode a message to Bob using the key and Bob will use the key to decode the message. If Alice wants to communicate with Charles, she and Charles will also need a secret key. The fact that every pair of entities will need a secret key leads to a phenomenon known as key explosion. Overall, in a system with n users, there will be O(n2) keys.
The biggest problem with symmetric cryptography is dealing with key distribution: how can Alice and Bob establish a key so they can communicate securely? The Diffie-Hellman exponential key exchange algorithm allows one to do this. Each party will generate a private key and a public key (these are not encryption keys; they're just numbers – Diffie-Hellman does not implement public key cryptography – it is unfortunate that the word was used to describe these numbers). It uses the one-way function abmod c in a way that allows Alice to compute a common key using her private key and Bob's public key. Bob can compute the same common key by using his private key and Alice's public key.
Using public key cryptography, such as RSA, if Alice encrypts a message with Bob's public key, Bob will be the only one who can decrypt it since doing so will require Bob's private key. Likewise, Bob can encrypt messages with Alice's public key, knowing that only Alice will be able to decrypt them with her private key.
Session keys
A session key is a random key that is generated for encryption during one communication session. It is useful because if the key is ever compromised, no lasting information is obtained: future communication sessions will use different keys. A hybrid cryptosystem uses public key cryptography to send a session key securely. The originator generates a random session key and encrypts it with the recipient's public key. The recipient decrypts the message with the corresponding private key to extract the session key. After that, symmetric cryptography is used for communication, with messages encrypted with the session key. This has the advantages of higher performance (public key cryptography is a lot slower than symmetric cryptography), ease of communicating with multiple parties (just encrypt the session key with the public keys of each of the recipients), and allows the bulk of data to be encrypted with session keys instead of the hardly-ever-changing public keys.
Digital signatures
Digital signatures employing symmetric cryptography must turn to a trusted third party. Messages are encrypted with the owner's key and sent to the third party who decrypts the contents and re-encrypts them for the recipient. The trusted third party avoids the problem of key explosion where every pair of communicating parties must have a secret key.
With public key cryptography, a digital signature is simply the act of encrypting a hash of a message with the creator's private key. Anyone who has the public key can decrypt the hash and thus validate it against the message. Other parties cannot recreate the signature since they do not have the private key even though they can create the hash.
Authentication
The three factors of authentication are: something you have (such as a key or a card), something you know (such as a password or PIN), and something you are (biometrics). Combining these into a multi-factor authentication scheme can increase security against the chance that any one of the factors is compromised.
Password Authentication Protocol (PAP)
The classic authentication method is the use of reusable passwords. This is known as the password authentication protocol, or PAP. The system asks you to identify yourself (login name) and then enter a password. If the password matches that which is associated with the login name on the system then you’re authenticated.
One problem with the protocol is that if someone gets hold of the password file on the system, then they have all the passwords. The common way to thwart this is to store hashes of passwords instead of the passwords themselves. This is taking advantage of one-way functions. To authenticate a user, check if hash(password) = stored_hashed_password. If someone got hold of the password file, they’re still stuck since they won’t be able to reconstruct the original password from the hash. They’ll have to resort to an exhaustive search or a dictionary attack and search for a password that hashes to the value in the file.
The other problem with reusable passwords is that, if a network is insecure, an eavesdropper may sniff the password from the network. A potential intruder may also simply observe the user typing a password. To thwart this, we turn to one-time passwords. If someone sees you type a password (or gets it from the network stream), it won't matter because that password will be useless for future logins.
S/Key Authentication
S/Key authentication allows the use of one-time passwords by generating a list via one-way functions. The list is created such that password n is generated as f(password[n-1]). The list of passwords is used backwards. Given a password p, it is impossible for an observer to compute the next valid password because a one-way function f makes it improbably difficult to compute f-1(p).
SecurID®
RSA's SecureID is a two-factor authentication system that generates one-time passwords for response to a user login prompt. It relies on a user password (Personal ID Number, PIN) and a token (an authenticator card or fob). The token generates a new number every 30 seconds. The number is a function of a seed that is unique for each card and the time of day. To authenticate to a server, you send a combination of your PIN and the number from the number from the token in lieu of a password. A legitimate remote system will have your PIN as well as the token seed and will be able to compute the same value to validate your password. An intruder would not know your PIN or the token’s seed and never see it on the network.
Challenge-Handshake Authentication Protocol (CHAP)
The Challenge-Handshake Authentication Protocol (CHAP) is an authentication protocol that allows a server to authenticate a user without sending a password over the network. Both the client and server share a secret (such as a password). The server sends a challenge (a string of random bits) to the client. The client creates a hash of the challenge and the password. Since the server has the challenge and the user's password as well, it can recreate the same hash to validate the user. A potential intruder would not be able to get this data from the network. Note that this technique requires passwords to be accessible at the server and the security rests on the password file remaining secure.
SKID2/3
SKID2 and SKID3 authentication introduce us to the use of a nonce, a random message (e.g. number or string) that is used to challenge the other party: “prove to me that you can encrypt this.” If Bob wants to validate that Alice knows the shared secret key, he creates a nonce and asks Alice to encrypt it using the shared key. If Bob can decrypt the result with the key and get the nonce that he sent her then he is convinced that Alice really does know the secret key. SKID 3 performs mutual authentication, where Alice validates Bob as well.
Public key authentication
The use of a nonce is also central to public key authentication (e.g., the kind used in SSL, the secure sockets layer). If I send you a nonce and you encrypt it with your private key and give me the results, I can decrypt that message using your public key. If the decryption matches the original nonce, this will convince me that only you could have encrypted the message.
Wide-Mouth Frog authentication
The wide-mouth frog technique uses symmetric cryptography and a trusted third party (who has everyone’s keys) for passing a random session key. If Alice wants to communicate with Bob, she picks a random session key, encrypts it with her own secret key, and sends it to the trusted third party, Trent. If Trent can decrypt it with Alice’s key then he knows the message must have come from Alice. He re-encrypts the session key with Bob’s secret key and sends it to Bob. Bob can then decrypt it with his key, knowing that the message was created by the trusted party since nobody else has his secret key.
Kerberos authentication
Kerberos is a trusted third party authentication and key exchange scheme using symmetric cryptography. When you want to access a service, you first need to ask Kerberos. If access is granted, you get two messages. One is encrypted with your secret key and contains the session key for your communication with the service. The other message is encrypted with the service’s secret key. You cannot read this message. It is known as a ticket or sealed envelope. It contains the same session key that you received but is encrypted for the service. When the service decrypts it, it knows that the message must have been generated by an entity that had its secret key – Kerberos.
Digital certificates
While public keys simplify authentication (just decrypt this with my public key and you know that I was the only one who could have encrypted it), identity binding of the public key must be preserved. X.509 digital certificates provide a solution for this. A certificate is a data structure that contains user information and the user’s public key. This data structure also contains a signature (hash encrypted using a private key) of the certification authority. The certification authority (CA) is responsible for setting policies to validate identity of the person who presents the public key for encapsulation in a certificate.
Secure Sockets Layer
Secure Sockets Layer (SSL, also known as TLS &ndash Transport Layer Security) is a layer of software designed to provide authentication and secure communication over the abstraction of a sockets interface. It makes it easy to add a secure transport to insecure socket-based protocols (e.g., HTTP and FTP). SSL uses a hybrid cryptosystem and relies on public keys for authentication. If both the sender and receiver have X.509 digital certificates, SSL can validate them and use nonce-based public key authentication to validate that each party has the corresponding private key. In some cases, it may validate the server only. If the server does not have a certificate, SSL will then use a public key simply to allow a symmetric session key to be passed securely from client to server. After that, communication takes place using a symmetric algorithm and the client-generated session key.
Biometric authentication
Biometric authentication differs dramatically from other techniques in that it relies on statistical pattern recognition and will never yield a 100% true or false answer. Comparisons are based on thresholds. The ROC (receiver operator characteristic) plot defines the trade-off between false accepts and false rejects for a particular biometric system. The authentication process comprises four steps: (1) sensing the user’s biometric feature; (2) extracting the features of interest, selecting features that are distinctive and repeatable; (3) matching the pattern against stored signals, searching for small distances between the matches; and (4) deciding if the match is close enough.
CAPTCHA
CAPTCHA (Completely Automated Public Turning test to tell Computers and Humans Apart) is not a technique to authenticate users but rather a technique to identify whether a system is interacting with a human being or with automated software. The idea behind it is that humans can recognize highly distorted characters far better than character recognition software can.
Steganography
Cryptography’s goal is to hide the contents of a message. Steganography’s goal is to hide the very existence of the message. Classic techniques included the use of invisible ink, writing a message on one’s head and allowing the hair to cover it, microdots, and carefully-clipped newspaper articles that together communicate the message.
A null cipher is one where the actual message is hidden among irrelevant data. For example, the message may comprise the first letter of each word (or each sentence, or every second letter, etc.). Chaffing and winnowing entails the transmission of a bunch of messages, of which only certain ones are legitimate. Each message is signed with a key known only to trusted parties. Intruders can see the messages but can’t validate the signatures to distinguish the valid messages from the bogus ones.
Image watermarking relates to hiding a message in an image. A straightforward method is to use low-order bits of an image, where the user is unlikely to notice slight changes in color. A large number of laser printers embed a serial number and date simply by printing very faint color splotches. A frequency-domain watermark is one where an image is transformed into the frequency domain via a discrete cosine transform (the real part of a fast Fourier transform). The steganographic data is then added to certain high-frequency regions. High frequency regions correspond to “noisy” areas in an image and humans are unlikely to notice slight variations there.
While the terms “steganography” and “watermarking” are often used interchangeably, the primary goal of watermarking is to create an indelible imprint on a message such that an intruder cannot remove or replace the message. The primary goal of steganography is to allow primarily one-to- one communication while hiding the existence of a message.
Smart Cards
The term smart card means different things to different people. It’s generally considered to be a small device (card or key fob) that can communicate with some reader either through physical electric contacts or wirelessly over a short distance. The most primitive smart cards are simply storage devices, containing a few kilobytes of nonvolatile memory. The smarter ones often provide various security capabilities, such as on-board encryption, hashing, and signing data. They may store your keys and present a public key to the reader yet never divulge your private key. Instead, they will use the private key internally to generate encrypted hashes or encrypt a nonce based on data coming into the card.
An example of a smart card is a passport. Most of the world’s passports now contain electronics and support contactless communication with a reader. While they don’t do anything smart (yet), they contain data about the owner as well as a digitized image of the owner’s face and a digital certificate. This data cannot be accessed until a session key is negotiated using the passport number, date of birth, and the passport expiration date. A reader has to obtain this data from an optical scan of the passport so that it can generate a key to communicate with the chip on the passport.
Fault Tolerance
There are three classes of faults: transient, intermittent, and permanent. Faults are either fail-silent (where the component does not produce results) or Byzantine (where the component produces faulty results). A popular approach to fault tolerance is redundancy, which may be information redundancy (e.g., error correcting codes), time redundancy (e.g., retransmission), or physical redundancy (redundant components – standby systems or triple-modular redundancy for active voting). A technique to deal with components that may exhibit Byzantine faults is Triple Modular Redundancy (TMR). This requires a three-way replication of the component with election logic to weed out a component that is producing faulty results (the two good components will overrule the faulty one).
The two-army problem demonstrates that reliable communication can never be achieved with faulty communication lines. The Byzantine generals problem illustrates the difficulty of achieving reliable communication in the presence of faulty processors exhibiting Byzantine faults.
Security
Most security attacks on systems are not cryptographic – they do not find weaknesses in cryptographic algorithms and try to break ciphers. Most rely on bugs in software or the trust of individuals.
For protocols with no encryption that use a public (or sniffable) network, one can sniff the network to extract logins and passwords (there’s a lot of software that makes this easy; snort, for example, is one of the oldest and most popular). If one cannot find a password for a user, one can try guessing it. If one can’t do that then one can try all combinations. An exhaustive search through every possible password may be time-prohibitive. A dictionary attack is one where you go through a dictionary of words and names and test them as potential passwords, applying common tricks such as prefixing and suffixing digits and substituting numbers that look like letters). Performing this attack becomes a lot easier if you’re lucky enough to get the hash password that you are searching for. Then you can perform the search on your own machine without going through the login service on the target system.
A social engineering attack is one where you con a person into giving you the necessary credentials. You might do this by impersonating as a system administrator and simply asking. A Trojan horse is a program that masquerades as a legitimate program and tricks you into believing that you are interacting with the trusted program. A common example is a program that masquerades as a login program and obtains an unsuspecting user’s login and password. Phishing is an example of email that purports to come from a trusted party (such as your bank) and attempts to convince the user to click on a link that will take them to what they believe is the party's web site. In reality, it is an intruder's site that is designed to look like the legitimate one. The goal here is also to collect data such as your login and password or perhaps your bank account, social security, or credit card numbers.
A buffer overflow bug is one where software expects to read a small amount of data into a fixed-size buffer but never checks to see whether the incoming data is bigger than the buffer. What ends up happening is that the software continues to append data to the buffer but is now writing into memory beyond that which is allocated to the buffer. If the buffer was declared within a function, then it was allocated memory on the stack. At some point, the overflow data may end up clobbering the return address for that function. A carefully crafted data stream can ensure that the return address gets modified with the address of other code in this data stream, which will get executed as soon as the function attempts to return.
A SYN flooding attack is a form of a denial of service attack where an intruder attempts to render a machine unable to accept any TCP/IP connections. Every TCP/IP session consumes a certain number of system resources, which are allocated when the first connection request is made (via a SYN packet). The intruder creates requests that come from unreachable hosts. If enough of these are sent to a host, the computer will reach a point where the operating system will not accept any more connections. There is a window of time before the machine will decide to give up on these pending connections. BSD systems typically allot 7.5 seconds for this period. Microsoft Windows Server 2003 advised using a value of two minute.
A rootkit is code that hides the presence of users or additional software from the user. Sometimes, this is done by replacing commands that would present this data (e.g., ps, ls, who, … on Unix/Linux systems). Other times, this may be done within the operating system itself by intercepting system calls.
The four A’s of security are:
- Authentication: the process of binding an identity to the user. Note the distinction between authentication and identification. Identification is simply the process of asking you to identify yourself (for example, ask for a login name). Authentication is the process of proving that the identification is correct.
- Authorization: given an identity, making a decision on what access the user is permitted. Authentication is responsible for access control.
- Accounting: logging system activity so that any breaches can be identified (intrusion detection) or a post facto analysis can be performed.
- Auditing: inspecting the software and system configuration for security flaws.
Signed Software
The idea of signing software is to ensure that it has not been tampered (for example, a virus is not attached to it). The most basic form of signing software is to do the same as we would for any other digital data: generate a hash of the entire package, encrypt it with the publisher’s private key, and attach it to the software as a signature. To validate the signature, you would compute the hash and compare it with the decrypted signature. You would decrypt the signature using the software publisher’s public key, which would be present in a digital certificate obtained from that publisher.
Computing a hash for software before we run it would involve scanning through the entire code base. Demand-paged virtual memory systems load pages of the program as needed, so this would greatly increase the startup time of the software. Instead, signed software will often contain a signature for every memory page. When a particular page is loaded, the operating system would check the signature for that page.
Sandboxes
A sandbox is an environment designed for running programs while restricting their access to certain resources, such as disk space, file access, amount of memory, and network connections. It was created to allow users to run untrusted code without the risk of harming their system. The quintessential example of a sandbox is the Java Virtual Machine, initially designed for running Java applets, code that would be loaded dynamically upon fetching a web page. The Java sandbox has three parts to it: (1) the byte-code verifier tries to ensure that the code looks like valid Java byte code with no attempts to circumvent access restrictions, convert data illegally, or forge pointers; (2) the class loader enforces restrictions on whether a program is allowed to load additional applets and that an applet is not allowed to access classes belonging to other applets; (3) the security manager is invoked to provide run-time verification of whether a program has rights to invoke certain methods, such as file I/O or network access.
Firewalls
A firewall protects the junction between an untrusted network (external) and a trusted network (internal). Two approaches to this are packet filtering and proxies. A packet filter, or screening router, determines not only the route of a packet but whether the packet should be dropped based on the IP header (as well as TCP/UDP headers) and interface the packet came on. With stateless inspection, a packet is examined on its own. Stateful inspection allows the router to keep track of TCP connections and understand the relationship between packets (for example, a port that needs to be enabled for the FTP data channel once an FTP connection has been established or that return packets should be allowed in response to outbound requests).
Proxy services (also known as application proxies) live on bastion hosts. These are stripped down machines to give potential intruders as few tools as possible (few user accounts, no compilers, minimal commands). An application proxy is software that presents the same protocol to the outside network as the application for which it is a proxy (e.g., a mail server proxy will listen on port 25 and understand SMTP, the simple mail transfer protocol). The job of the proxy is to validate the application protocol. Valid requests are forwarded to the real application that is running on another server and is not accessible from the outside network. The bastion hosts on which proxies live are dual-homed hosts. This means the computer has two network cards. This is important since packets will not flow directly between the outside network to the inside network. It is up to the proxy software to forward requests.
A typical firewalled environment is a screened subnet architecture, containing a subnet between the internal and external networks called the DMZ (demilitarized zone). The DMZ contains all the bastion hosts that may be offering services to the external network (usually the Internet). Screening routers on both sides of the DMZ ensure that no packet from the outside network is permitted into the inside network.
All machines within an organization will be either in the DMZ or in the internal network. The exterior router will allow IP packets only to the machines/ports in the DMZ that are offering valid services. It would also reject any packets that are masqueraded to appear to come from the internal network. The interior router would allow only packets to come from designated machines in the DMZ that may need to access services in the internal network. Any packets not targeting the appropriate services in the internal network will be rejected. Both routers will generally allow traffic to flow from the internal network to the Internet, although an organization may block certain services (ports) or force users to use a proxy (for web access, for example).
VPNs
Virtual private networks (VPNs) allow disconnected local area networks to be connected together securely, saving money by using the shared public network (Internet) instead of leased lines. This is achieved by tunneling – the encapsulation of a packet within another packet. In this case, a packet that is destined for a remote subnet (which may have local source and destination IP addresses that are not routable over the public Internet) will be treated as payload and placed in IP packets that go over the public Internet (with source and destination addresses being the VPN endpoint at both sides, usually the router). When the router receives this encapsulated packet, it extracts the data – which is an IP packet – and routes it on the local network. This tunneling behavior gives us the virtual network part of the VPN. To achieve security, an administrator setting up a VPN will usually be concerned that the data contents are not readable and the data has not been tampered with. To ensure this, packets can be encrypted and signed. Signing a packet ensures that the data is not tampered. Encrypting ensures that external parties would not be able to make sense of the packets. VPNs usually provide several options for key management: shared private keys, Diffie-Hellman key exchange, or RSA public keys.
Clusters
Clustering is the aggregation of multiple independent computers to provide increased reliability and/or performance. There are four distinct approaches to clustering: high-performance computing, batch processing, load balancing, and high availability.
High-Performance Computing (HPC)
High-performance clusters are generally custom efforts but there are a number of components that are common across many implementations. They are designed for traditional supercomputing applications that focus on a large amount of computation on large data sets. These applications are designed to be partitioned into multiple communicating processes. The Message Passing Interface (MPI) is a popular programming interface for sending and receiving messages that handles point-to-point and group communication and provides support for barrier-based synchronization. It is often used together with the Parallel Virtual Machine (PVM), a layer of software that provides an interface for creating tasks, managing global task IDs, and managing groups of tasks on arbitrary collections of processors. Beowulf and Rocks Cluster are examples of a popular HPC clusters based on Linux and other libraries. Microsoft offers an HPC Server 2008. There are many other HPC systems as well. The common thread among them all is that they provide a front-end server for scheduling jobs and monitoring processes and offer an MPI library for programming.
Batch Processing – Single-Queue Work Distribution
Batch processing is often used for applications such as renderfarms for computer animation, where a central coordinator (dispatcher) sends job requests to a collection of machines. When a system completes a job (e.g., “render frame #4,178”), the dispatcher will send it the next job (e.g., “render frame #12,724”). The dispatcher will have the ability to list jobs, delete jobs, dispatch jobs, and get notified when a job is complete.
Load Balancing
Web-services load balancing is a somewhat trivial but very highly used technique for distributing the load of many network requests among a collection of machines. The simplest form is to have all requests go to a single machine that then returns an HTTP REDIRECT error. This is part of the HTTP protocol and will lead the client to re-issue the request to the machine specified by the REDIRECT error.
A more sophisticated software-only approach was implemented by IBM’s Interactive Network Dispatcher. As with the REDIRECT approach, all requests go to one machine. Now, however, the machine forwards the entire stream of IP packets to one of several load-balanced back-end systems. All of these machines share the same IP address and the coordinator has special kernel extensions that allow it to forward TCP and UDP packets to different machines by changing the MAC address of the packets.
Finally, one can turn to hardware and use a load-balancing router to map incoming requests to multiple back-end machines.
High Availability
High-availability clusters strive to provide a high level of system uptime by taking into account the fact that machines may fail. In such a case, applications running on those machines will be moved to other machines that are still up.
Low-level software to support high-availability clustering includes facilities for IP address takeover (allowing a machine to listen on multiple IP addresses) and to access shared disks (using a distributed lock manager). Mid-layer software includes distributed elections to pick a coordinator, propagation of status information, and figuring out which systems and applications are alive. Higher-layer software includes the ability to restart applications, let a user assign applications to machines, and let a user see what’s going on.
The key to detecting machine failure is the heartbeat network: exchanging messages between machines to ensure that they are alive and capable of responding. Since a main network may go down, one or more secondary networks (such as direct cable connections between two machines) are often used as dedicated heartbeat networks.
Failed applications may be restarted on a standby machine (active/passive configuration) or may be moved to a machine that’s already running other services (active/active configuration). A cold failover is an application restart – the application is started from the beginning. A warm failover is one where the application is checkpointed periodically and is restarted from the last checkpoint. In a hot failover, a copy of the application is always running in lockstep synchrony with the main application on another machine. This is difficult and not always practical, especially where communications are involved. Cascading failover refers to the ability of an application to fail over even after it already has failed over in the past (a double restart). Multi-directional failover refers to the ability to restart applications from a failed system on multiple available systems instead of a specific targeted backup machine.
Clustered systems may share access to the same disk (e.g., via a Y-SCSI connector or via a fibre channel switch or iSCSI configuration). This is a shared-disk configuration. Having two machines access the same disk is not a good idea, of course, since that can put the filesystem in an incoherent state. Therefore, systems that do this must ensure that they have mutually exclusive access to the disk (and flushing their caches when they don’t have access). They do this via a distributed lock manager (DLM). A simpler solution is a shared-nothing architecture where there is no shared storage and a system forwards any requests for disk I/O to the system that owns the disk. In the case of failure, however, the access to such storage may be lost or may have to be gained by using a shared link that was simply not used when both systems were up.
Some clusters may be tightly coupled via system area network (SAN). This is a switched network to allow low-latency I/O between machines without incurring any processor intervention. It also avoids the overhead of TCP/IP by providing a reliable communication channel. Remote DMA (RDMA) allows data to be copied directly to the memory of another processor. SANs are often used for HPC clusters, with SAN/RDMA communication incorporated into the Message Passing Interface (MPI) library.
Storage in a cluster is often separated from the actual machines via a storage area network (SAN). This is a network that is dedicated to disk I/O traffic between machines and the disks, which reside in separate storage arrays. The communications between machines and disks is typically SCSI over fibre channel or the iSCSI over ethernet. Because of this dissociation, the switch between the machines and the storage can be configured to control access and allow specific hosts to read and write from/to specific disks.
To make storage highly available, RAID (redundant array of independednt disks) is often employed. RAID 1 is mirroring. Anything that is written to one disk gets written to a secondary disk. If one fails then you still have the other. RAID 5 stripes the data across several disks and also adds in error correcting codes (e.g., parity in the case of allowing one to correct for a single failure) so that it data could be reconstructed from the available segments if one would die. RAID 0 is a technique used to improve performance by striping data across multiple disks. It does not improve fault tolerence.
Virtualization
As a general concept, virtualization is a form of abstraction to interfacing with physical devices. With virtual memory, a process has the impression that it owns the entire memory address space. Different processes can all access the same “virtual” memory location and the memory management unit (MMU) on the processor maps each access to the unique physical memory locations that are assigned to the process.
With storage virtualization, a computer gets a logical view of disks connected to a machine. In reality, those “disks” may be networked to a computer via a fibre channel switch or ethernet interface and may be parts of physical disks or collections of disks that appear to the computer as one disk.
CPU virtualization allows programs to execute on a machine that does not really exist. The instructions are interpreted by a program that simulates the architecture of the pseudo machine. Early pseudo-machines included o-code for BCPL and P-code for Pascal. The most popular pseudo-machine today is the Java virtual machine.
Machine virtualization allows a physical computer to act like several real machines with each machine running its own operating system (on a virtual machien) and applications that interact with that operating system. The key to machine virtualization is to not allow each guest operating system to have direct access to certain privileged instructions in the processor. These instructions would allow an operating system to directly access I/O ports, MMU settings, the task register, the halt instruction and other parts of the processor that could interfere with the processor’s behavior and with other operating systems. Instead, such instructions as well as system interrupts are intercepted by the Virtual Machine Monitor (VMM), also known as a hypervisor. The hypervisor arbitrates access to physical resources and presents a set of virtual device interfaces to each guest operating system (including the memory management unit, I/O ports, disks, and network interfaces).
Two configurations of virtual machines are hosted virtual machines and native virtual machines. With a hosted virtual machine, the computer has a primary operating system installed that has access to the raw machine (all devices, memory, and file system). One or more guest operating systems can be run on virtual machines. The VMM serves as a proxy, converting requests from the virtual machine into operations that get executed on the host operating system. A native virtual machine is one where there is no "primary" operating system that owns the system hardware. The hypervisor is in charge of access to the devices and provides each operating system drivers for an abstract view of all the devices.
The latest processors from Intel and AMD support the concept of a virtual machine layer and the ability to intercept privileged instructions. Prior to that, one of two approaches was used to implement virtualization. Binary translation pre-scans the instruction stream of any code that has to run in privileged mode and replaces all privileged instructions with interrupts to the VMM. Paravirtualization requires modifiying the operating system to replace privileged instructions with calls to a VMM API. This, of course, requires access to the source code of the operating system.