Exam 1 study guide

The one-hour study guide for exam 1

Paul Krzyzanowski

Latest update: Fri Oct 2 10:10:51 EDT 2020

Disclaimer: This study guide attempts to touch upon the most important topics that may be covered on the exam but does not claim to necessarily cover everything that one needs to know for the exam. Finally, don't take the one hour time window in the title literally.

Introduction

Computer security is about keeping computers, their programs, and the data they manage “safe.” Specifically, this means safeguarding three areas: confidentiality, integrity, and availability. These three are known as the CIA Triad (no relation to the Central Intelligence Agency).

Confidentiality
Confidentiality means that we do not make a system’s data and its resources (the devices it connects to and its ability to run programs) available to everyone. Only authorized people and processes should have access. Privacy specifies limits on what information can be shared with others while confidentiality provides a means to block access to such information. Privacy is a reason for confidentiality. Someone being able to access a protected file containing your medical records without proper access rights is a violation of confidentiality.
Integrity

Integrity refers to the trustworthiness of a system. This means that everything is as you expect it to be: users are not imposters and processes are running correctly.

  • Data integrity means that the data in a system has not been corrupted.

  • Origin integrity means that the person or system sending a message or creating a file truly is that person and not an imposter.

  • Recipient integrity means that the person or system receiving a message truly is that person and not an imposter.

  • System integrity means that the entire computing system is working properly; that it has not been damaged or subverted. Processes are running the way they are supposed to.

Maintaining integrity means not just defending against intruders that want to modify a program or masquerade as others. It also means protecting the system against against accidental damage, such as from user or programmer errors.

Availability
Availability means that the system is available for use and performs properly. A denial of service (DoS) attack may not steal data or damage any files but may cause a system to become unresponsive.

Security is difficult. Software is incredibly complex. Large systems may comprise tens or hundreds of millions of lines of code. Systems as a whole are also complex. We may have a mix of cloud and local resources, third-party libraries, and multiple administrators. If security was easy, we would not have massive security breaches year after year. Microsoft wouldn’t have monthly security updates. There are no magic solutions … but there is a lot that can be done to mitigate the risk of attacks and their resultant damage.

We saw that computer security addressed three areas of concern. The design of security systems also has three goals.

Prevention
Prevention means preventing attackers from violating established security policies. It means that we can implement mechanisms into our hardware, operating systems, and application software that users cannot override – either maliciously or accidentally. Examples of prevention include enforcing access control rules for files and authenticating users with passwords.
Detection
Detection detects and reports security attacks. It is particularly important when prevention mechanisms fail. It is useful because it can identify weaknesses with certain prevention mechanisms. Even if prevention mechanisms are successful, detection mechanisms are useful to let you know that attempted attacks are taking place. An example of detection is notifying an administrator that a new user has been added to the system. Another example is being notified that there have been several consecutive unsuccessful attempts to log in.
Recovery
If a system is compromised, we need to stop the attack and repair any damage to ensure that the system can continue to run correctly and the integrity of data is preserved. Recovery includes forensics, the study of identifying what happened and what was damaged so we can fix it. An example of recovery is restoration from backups.

Security engineering is the task of implementing the necessary mechanisms and defining policies across all the components of the system. Like other engineering disciplines, designing secure systems involves making compromises. A highly secure system will be disconnected from any communication network, sit in an electromagnetically shielded room that is only accessible to trusted users, and run software that has been thoroughly audited. That environment is not acceptable for most of our computing needs. We want to download apps, carry our computers with us, and interact with the world. Even in the ultra-secure example, we still need to be concerned with how we monitor access to the room, who wrote the underlying operating system and compilers, and whether authorized users can be coerced to subvert the system. Systems have to be designed with some idea of who are likely potential attackers and what the threats are. Risk analysis is used to understand the difficulty of an attack on a system, who will be affected, and what the worst thing that can happen is. A threat model is a data flow model (e.g., diagram) that identifies each place where information moves into or out of the software or between subsystems of the program. It allows you to identify areas where the most effort should be placed to secure a system.

Secure systems have two parts to them: mechanisms and policies. A policy is a description of what is or is not allowed. For example, “users must have a password to log into the system” is a policy. Mechanisms* are used to implement and enforce policies. An example of a mechanism is the software that requests user IDs and passwords, authenticates the user, and allows entry to the system only if the correct password is used.

A vulnerability is a weakness in the security system. It could be a poorly defined policy, a bribed individual, or a flaw in the underlying mechanism that enforces security. An attack is the exploitation of a vulnerability in a system. An attack vector refers to the specific technique that an attacker uses to exploit a vulnerability. Example attack vectors include phishing, keylogging, and trying common passwords to log onto a system. An attack surface is the sum of possible attack vectors in a system: all the places where an attacker might try to get into the system.

A threat is the potential adversary who may attack the system. Threats may lead to attacks.

Threats fall into four broad categories:

Disclosure: Unauthorized access to data, which covers exposure, interception, interference, and intrusion. This includes stealing data, improperly making data available to others, or snooping on the flow of data.

Deception: Accepting false data as true. This includes masquerading, which is posing as an authorized entity; substitution or insertion of includes the injection of false data or modification of existing data; repudiation, where someone falsely denies receiving or originating data.

Disruption: Some change that interrupts or prevents the correct operation of the system. This can include maliciously changing the logic of a program, a human error that disables a system, an electrical outage, or a failure in the system due to a bug. It can also refer to any obstruction that hinders the functioning of the system.

Usurpation: Unauthorized control of some part of a system. This includes theft of service as well as any misuse of the system such as tampering or actions that result in the violation of system privileges.

The Internet increases opportunities for attackers. The core protocols of the Internet were designed with decentralization, openness, and interoperability in mind rather than security. Anyone can join the Internet and send messages … and untrustworthy entities can provide routing services. It allows bad actors to hide and to attack from a distance. It also allows attackers to amass asymmetric power: harnessing more resources to attack than the victim has for defense. Even small groups of attackers are capable of mounting Distributed Denial of Service (DDoS) attacks that can overwhelm large companies or government agencies.

Adversaries can range from lone hackers to industrial spies, terrorists, and intelligence agencies. We can consider two dimensions: skill and focus. Regarding focus, attacks are either opportunistic or targeted. Opportunistic attacks are those where the attacker is not out to get you specifically but casts a wide net, trying many systems in the hope of finding a few that have a particular vulnerability that can be exploited. Targeted attacks are those where the attacker targets you specifically. The term script kiddies is used to refer to attackers who lack the skills to craft their own exploits but download malware toolkits to try to find vulnerabilities (e.g., systems with poor or default passwords, hackable cameras). Advanced persistent threats (APT) are highly-skilled, well-funded, and determined (hence, persistent) attackers. They can craft their own exploits, pay millions of dollars for others, and may carry out complex, multi-stage attacks.

We refer to the trusted computing base (TCB) as the collection of hardware and software of a computing system that is critical to ensuring the system’s security. Typically, this is the operating system and system software. If the TCB is compromised, you no longer have assurance that any part of the system is secure. For example. the operating system may be modified to ignore the enforcement of file access permissions. If that happens, you no longer have assurance that any application is accessing files properly.

Access control

See lecture notes

Program Hijacking

Program hijacking refers to techniques that can be used to take control of a program and have it do something other than what it was intended to do. One class of techniques uses code injection, in which an adversary manages to add code to the program and change the program’s execution flow to run that code.

The best-known set of attacks are based on buffer overflow. Buffer overflow is the condition where a programmer allocates a chunk of memory (for example, an array of characters) but neglects to check the size of that buffer when moving data into it. Data will spill over into adjacent memory and overwrite whatever is in that memory.

Languages such as C, C++, and assembler are susceptible to buffer overflows since the language does not have a means of testing array bounds. Hence, the compiler cannot generate code to validate that data is only going into the allocated buffer. For example, when you copy a string using strcpy(char *dest, char *src), you pass the function only source and destination pointers. The strcpy function has no idea how big either of the buffers are.

Stack-based overflows

When a process runs, the operating system’s program loader allocates a region for the executable code and static data (called the text and data segments), a region for the stack, and a region for the heap (used for dynamic memory allocation, such as by malloc).

Just before a program calls a function, it pushes the function’s parameters onto the stack. When the call is made, the return address gets pushed on the stack. On entry to the function that was called, the function pushes the current frame pointer (a register in the CPU) on the stack, which forms a linked list to the previous frame pointer and provides an easy way to revert the stack to where it was before making the function call. The frame pointer register is then set to the current top of the stack. The function then adjusts the stack pointer to make room for hold local variables, which live on the stack. This region for the function’s local data is called the stack frame. Ensuring that the stack pointer is always pointing to the top of the stack enables the function to get interrupts or call other functions without overwriting anything useful on the stack. The compiler generates code to reference parameters and local variables as offsets from the current frame pointer register.

Before a function returns, the compiler generates code to:

  • Adjust the stack back to point to where it was before the stack expanded to make room for local variables. This is done by copying the frame pointer to the stack pointer.

  • Restore the previous frame pointer by popping it off the stack (so that local variables for the previous function could be referenced properly).

  • Return from the function. Once the previous frame pointer has been popped off the stack, the stack pointer points to a location on the stack that holds the return address.

Simple stack overflows

Local variables are allocated on the stack and the stack grows downward in memory. Hence, the top of the stack is in lower memory than the start, or bottom, of the stack. If a buffer (e.g., char buf[128]) is defined as a local variable, it will reside on the stack. As the buffer gets filled up, its contents will be written to higher and higher memory addresses. If the buffer overflows, data will be written further down the stack (in higher memory), overwriting the contents of any other variables that were allocated for that function and eventually overwriting the saved frame pointer and the saved return address.

When this happens and the function tries to return, the return address that is read from the stack will contain garbage data, usually a memory address that is not mapped into the program’s memory. As such, the program will crash when the function returns and tries to execute code at that invalid address. This is an availability attack. If we can exploit the fact that a program does not check the bounds of a buffer and overflows the buffer, we can cause a program to crash.

Subverting control flow through a stack overflow

Buffer overflow can be used in a more malicious manner. The buffer itself can be filled with bytes of valid machine code. If the attacker knows the exact size of the buffer, she can write just the right number of bytes to write a new return address into the very same region of memory on the stack that held the return address to the parent function. This new return address points to the start of the buffer that contains the injected code. When the function returns, it will “return” to the new code in the buffer and execute the code at that location.

Off-by-one stack overflows

As we saw, buffer overflow occurs because of programming bugs: the programmer neglected to make sure that the data written to a buffer does not overflow. This often occurs because the programmer used old, unsafe functions that do not allow the programmer to specify limits. Common functions include:

- strcpy(char *dest, char *src)

- strcat(char *dest, char *src)

- sprintf(char *format, ...)

Each of these functions has a safe counterpart that accepts a count parameter so that the function will never copy more than count number of bytes:

- strcpy(char *dest, char *src, int count)

- strcat(char *dest, char *src, int count)

- sprintf(char *format, int count,  ...)

You’d think this would put an end to buffer overflow problems. However, programmers may miscount or they may choose to write their own functions that do not check array bounds correctly. A common error is an off-by-one error. For example, a programmer may declare a buffer as:

char buf[128];

and then copy into it with:

for (i=0; i <= 128; i++)
    buf[i] = stuff[i];

The programmer inadvertently used a <= comparison instead of <.

With off-by-one bounds checking, there is no way that malicious input can overwrite the return address on the stack: the copy operation would stop before that time. However, if the buffer is the first variable that is allocated on the stack, an off-by-one error can overwrite one byte of the saved frame pointer.

The potential for damage depends very much on what the value of that saved frame pointer was and how the compiler generates code for managing the stack. In the worst case, it could be set up to a value that is 255 bytes lower in memory. If the frame pointer is modified, the function will still return normally. However, upon returning, the compiler pops the frame pointer from the stack to restore the saved value of the calling function’s frame pointer, which was corrupted by the buffer overflow. Now the program has a modified frame pointer.

Recall that references to a function’s variables and parameters are expressed as offsets from the current frame pointer. Any references to local variables may now be references to data in the buffer. Moreover, should that function return, it will update its stack pointer to this buffer area and return to an address that the attacker defined.

Heap overflows

Not all data is allocated on the stack: only local variables. Global and static variables are placed in a region of memory right above the executable program. Dynamically allocated memory (e.g., via new or malloc) comes from an area of memory called the heap. In either case, since this memory is not the stack, it does not contain return addresses so there is no ability for a buffer overflow attack to overwrite return addresses.

We aren’t totally safe, however. A buffer overflow will cause data to spill over into higher memory addresses above the buffer that may contain other variables. If the attacker knows the order in which variables are allocated, they could be overwritten. While these overwrites will not change a return address, they can change things such as filenames, lookup tables, or linked lists. Some programs make extensive use of function pointers, which may be stored in global variables or in dynamically-allocated structures such as linked lists on a heap. If a buffer overflow can overwrite a function pointer then it can change the execution of the program: when that function is called, control will be transferred to a location of the attacker’s choosing.

If we aren’t sure of the exact address at which execution will start, we can fill a buffer with a bunch of NOP (no operation) instructions prior to the injected code. If the processor jumps anywhere in that region of memory, it will happily execute these NOP instructions until it eventually reaches the injected code. This is called a NOP slide, or a landing zone.

Format string attacks with printf

The family of printf functions are commonly used in C and C++ to create formatted output. They accept a format string that defines what will be printed, with % characters representing formatting directives for parameters. For example,

printf("value = %05d\n", v);

Will print a string such as

value = 01234

if the value of v is 1234.

Reading arbitrary memory

Occasionally, programs will use a format string that could be modified. For instance, the format string may be a local variable that is a pointer to a string. This local variable may be overwritten by a buffer overflow attack to point to a different string. It is also common, although improper, for a programmer to use printf(s) to print a fixed string s. If s is a string that is generated by the attacker, it may contain unexpected formatting directives.

Note that printf takes a variable number of arguments and matches each % directive in the format string with a parameter. If there are not enough parameters passed to printf, the function does not know that: it assumes they are on the stack and will happily read whatever value is on the stack where it thinks the parameter should be. This gives an attacker the ability to read arbitrarily deep into the stack. For example, with a format string such as:

printf("%08x\n%08x\n%08x\n%08x\n");

printf will expect four parameters, all of which are missing. It will instead read the next four values that are on the top of the stack and print each of those integers as an 8-character-long hexadecimal value prefixed with leading zeros (“%08x\n”).

Writing arbitrary memory

The printf function also contains a somewhat obscure formatting directive: %n. Unlike other % directives that expect to read a parameter and format it, %n instead writes to the address corresponding to that parameter. It writes the number of characters that it has output thus far. For example,

printf(“paul%n says hi”, &printbytes);

will store the number 4 (strlen("paul")) into the variable printbytes. An attacker who can change the format specifier may be able to write to arbitrary memory. Each % directive to print a variable will cause printf to look for the next variable in the next slot in the stack. Hence, format directives such as %x, %lx, %llx will cause printf to skip over the length of an int, long, or long long and get the next variable from the following location on the stack. Thus, just like reading the stack, we can skip through any number of bytes on the stack until we get to the address where we want to modify a value. At that point, we insert a %n directive in the format string, which will modify that address on the stack with the number of bytes that were output. We can precisely control the value that will be written by specifying how many bytes are output as part of the format string. For example, a format of %.55000x tells printf to output a value to take up 55,000 characters. By using formats like that for output values, we can change the count that will be written with %n. Remember, we don’t care what printf actually prints; we just want to force the byte count to be a value we care about, such as the address of a function we want to call.

Defense against hijacking attacks

Better programming

Hijacking attacks are the result of sloppy programming: a lack of bounds checking that results in overflows. They can be eliminated if the programmer never uses unsafe functions (e.g., use strncpy instead of strcpy) and is careful about off-by-one errors.

A programer can use a technique called fuzzing to locate buffer overflow problems. Whenever a string can be provided by the user, the user will enter extremely long strings with well-defined patterns (e.g., “$$$$$$…”). If the app crashes because a buffer overflow destroyed a return address on the stack, the programmer can then load the core dump into a debugger, identify where the program crashed and search for a substring of the entered pattern (“$$$$$”) to identify which buffer was affected.

Buffer overflows can be avoided by using languages with stronger type checking and array bounds checking. Languages such as Java, C#, and Python check array bounds. C and C++ do not. However, it is sometimes difficult to avoid using C or C++.

Tight specification of requirements, coding to those requirements, and constructing tests based on those requirements helps avoid buffer overflow bugs. If input lengths are specified, they are more likely to be coded and checked. Documentation should be explicit, such as "user names longer than 32 bytes must be rejected.”

Data Execution Prevention (DEP)

Buffer overflows affect data areas: either the stack, heap, or static data areas. There is usually no reason that those regions of code should contain executable code. Hence, it makes sense for the operating system to set the processor’s memory management unit (MMU) to turn off execute permission for memory pages in those regions.

This was not possible with early Intel or AMD processors: their MMU did not support enabling or disabling execute permissions. All memory could contain executable code. That changed in 2004, when Intel and AMD finally added an NX (no-execute) bit to their MMU’s page tables. On Intel architectures, this was called the Execute Disable Bit (XD). Operating system support followed. Windows, Linux, and macOS all currently support DEP.

DEP cannot always be used. Some environments, such as some LISP interpreters actually do need execution enabled in their stack and some environments need executable code in their heap section (to support dynamic loading, patching, or just-in-time compilation). DEP also does not guard against data modification attacks, such as heap-based overflows or some printf attacks.

DEP attacks

Attackers came up with some clever solutions to defeat DEP. The first of these is called return-to-libc*. Buffer overflows still allow us to corrupt the stack. We just cannot execute code on the stack. However, there is already a lot of code sitting in the program and the libraries it uses. Instead of adding code into the buffer, the attacker merely overflows a buffer to create a new return address and parameter list on the stack. When the function returns, it switches control to the new return address. This return address will be an address in the standard C library (libc), which contains functions such as printf, system, and front ends to system calls. All that an attacker often needs to do is to push parameters that point to a string in the buffer that contains a command to execute and then “return” to the libc system function, whose function is to execute a parameter as a shell command.

A more sophisticated variant of return-to-libc is Return Oriented Programming (ROP). Return oriented programming is similar to return-to-libc but realizes that execution can branch to any arbitrary point in any function in any loaded library. The function will execute a series of instructions and eventually return. The attacker will overflow the stack with data that now tells this function where to “return”. Its return can jump to yet another arbitrary point in another library. When that returns, it can – once again – be directed to an address chosen by the intruder that has been placed further down the stack, along with frame pointers, local variables, and parameters.

There are lots and lots of return instructions among all the libraries normally used by programs. Each of these tail ends of a function is called a gadget. It has been demonstrated that using carefully chosen gadgets allows an attacker to push a string of return addresses that will enable the execution of arbitrary algorithms. To make life easier for the attacker, tools have been created that search through libraries and identify useful gadgets. A ROP compiler then allows the attacker to program operations using these gadgets.

Address Space Layout Randomization

Stack overflow attacks require knowing and injecting an address that will be used as a target when a function returns. ROP also requires knowing addresses of all the entry points of gadgets. Address Space Layout Randomization (ASLR) is a technique that was developed to have the operating system’s program loader pick random starting points for the executable program, static data, heap, stack, and shared libraries. Since code and data resides in different locations each time the program runs, the attacker is not able to program buffer overflows with useful known addresses. For ASLR to work, the program and all libraries must be compiled to use position independent code (PIC), which uses relative offsets instead of absolute memory addresses.

Stack canaries

A stack canary is a compiler technique to ensure that a function will not be allowed to return if a buffer overflow took place that may have clobbered the return address.

At the start of a function, the compiler adds code to generate a random integer (the canary) and push it onto the stack before allocating space for the function’s local variables (the entire region of the stack used by a local function is called a frame). The canary sits between the return address and these variables. If there is a buffer overflow in a local variable that tries to change the return address, that overflow will have to clobber the value of the canary.

The compiler generates code to have the function check that the canary has a valid value before returning. If the value of the canary is not the original value then a buffer overflow occurred and it’s very likely that the return value has been altered.

However, you may still have a buffer overflow that does not change the value of the canary or the return address. Consider a function that has two local arrays (buffers). They’re both allocated on the stack within the same stack frame. If array A is in lower memory than array B then an overflow in A can affect the contents of B. Depending on the code, that can alter the way the function works. The same thing can happen with scalar variables (non-arrays). For instance, suppose the function allocates space for an integer followed by an array. An overflow in the array can change the value of the integer that’s in higher memory. The canary won’t detect this. Even if the overflow happened to clobber the return value as well, the check is made only when the function is about to return. Meanwhile, it’s possible that the overflow that caused other variables to change also altered the behavior of the function.

Stack canaries cannot fix this problem in general. However, the compiler (which creates the code to generate them and check them) can take steps to ensure that a buffer overflow cannot overwrite non-array variables, such as integers and floats. By allocating arrays first (in higher memory) and then scalar variables, the compiler can make sure that a buffer overflow in an array will not change the value of scalar variables. One array overflowing to another is still a risk, however, but it is most often the scalar variables that contain values that define the control flow of a function.

Command Injection

We looked at buffer overflow and printf format string attacks that enable the modification of memory contents to change the flow of control in the program and, in the case of buffer overflows, inject executable binary code (machine instructions). Other injection attacks enable you to modify inputs used by command processors, such as interpreted languages or databases. We will now look at these attacks.

SQL Injection

It is common practice to take user input and make it part of a database query. This is particularly popular with web services, which are often front ends for databases. For example, we might ask the user for a login name and password and then create a SQL query:

sprintf(buf,
    ”SELECT * from logininfo WHERE username = '%s' AND password = '%s’;",
    uname, passwd);

Suppose that the user entered this for a password:

' OR 1=1 --

We end up creating this query string[1]:

SELECT * from logininfo WHERE username = 'paul' AND password = '' OR 1=1 -- ';

The “--” after “1=1” is a SQL comment, telling it to ignore everything else on the line. In SQL, OR operations have precendence over AND so the query checks for a null password (which the user probably does not have) or the condition 1=1, which is always true. In essence, the user’s “password” turned the query into one that ignores the user’s password and unconditionally validates the user.

Statements such as this can be even more destructive as the user can use semicolons to add multiple statements and perform operations such as dropping (deleting) tables or changing values in the database.

This attack can take place because the programmer blindly allowed user input to become part of the SQL command without validating that the user data does not change the quoting or tokenization of the query. A programmer can avoid the problem by carefully checking the input. Unfortunately, this can be difficult. SQL contains too many words and symbols that may be legitimate in other contexts (such as passwords) and escaping special characters, such as prepending backslashes or escaping single quotes with two quotes can be error prone as these escapes differ for different database vendors. The safest defense is to use parameterized queries, where user input never becomes part of the query but is brought in as parameters to it. For example, we can write the previous query as:

uname = getResourceString("username");
passwd = getResourceString("password");
query = "SELECT * FROM users WHERE username = @0 AND password = @1";
db.Execute(query, uname, passwd);

A related safe alternative is to use stored procedures. They have the same property that the query statement is not generated from user input and parameters are clearly identified.

While SQL injection is the most common code injection attack, databases are not the only target. Creating executable statements built with user input is common in interpreted languages, such as Shell, Perl, PHP, and Python. Before making user input part of any invocable command, the programmer must be fully aware of parsing rules for that command interpreter.

Shell attacks

The various POSIX[2] shells (sh, csh, ksh, bash, tcsh, zsh) are commonly used as scripting tools for software installation, start-up scripts, and tying together workflow that involves processing data through multiple commands. A few aspects of how many of the shells work and the underlying program execution environment can create attack vectors.

system() and popen() functions

Both system and popen functions are part of the Standard C Library and are common functions that C programmers use to execute shell commands. The system function runs a shell command while the popen function also runs the shell command but allows the programmer to capture its output and/or send it input via the returned FILE pointer.

Here we again have the danger of turning improperly-validated data into a command. For example, a program might use a function such as this to send an email alert:

char command[BUFSIZE];
snprintf(command, BUFSIZE, "/usr/bin/mail –s \"system alert\" %s", user);
FILE *fp = popen(command, "w");

In this example, the programmer uses snprintf to create the complete command with the desired user name into a buffer. This incurs the possibility of an injection attack if the user name is not carefully validated. If the attacker had the option to set the user name, she could enter a string such as:

nobody; rm -fr /home/*

which will result in popen running the following command:

sh -c "/usr/bin/mail -s \"system alert\" nobody; rm -fr /home/*"

which is a sequence of commands, the latter of which deletes all user directories.

Other environment variables

The shell PATH environment variable controls how the shell searches for commands. For instance, suppose

PATH=/home/paul/bin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/games

and the user runs the ls command. The shell will search through the PATH sequentially to find an executable filenamed ls:

/home/paul/bin/ls
/usr/local/bin/ls
/usr/sbin/ls
/usr/bin/ls
/bin/ls
/usr/local/games/ls

If an attacker can either change a user’s PATH environment variable or if one of the paths is publicly writable and appears before the “safe” system directories, then he can add a booby-trapped command in one of those directories. For example, if the user runs the ls command, the shell may pick up a booby-trapped version in the /usr/local/bin directory. Even if a user has trusted locations, such as /bin and /usr/bin foremost in the PATH, an intruder may place a misspelled version of a common command into another directory in the path. The safest remedy is to make sure there are no untrusted directories in PATH.

Some shells allow a user to set an ENV or BASH_ENV variable that contains the name of a file that will be executed as a script whenever a non-interactive shell is started (when a shell script is run, for example). If an attacker can change this variable then arbitrary commands may be added to the start of every shell script.

Shared library environment variables

In the distant past, programs used to be fully linked, meaning that all the code needed to run the program, aside from interactions with the operating system, was part of the executable program. Since so many programs use common libraries, such as the Standard C Library, they are not compiled into the code of an executable but instead are dynamically loaded when needed.

Similar to PATH, LD_LIBRARY_PATH is an environment variable used by the operating system’s program loader that contains a colon-separated list of directories where libraries should be searched. If an attacker can change a user’s LD_LIBRARY_PATH, common library functions can be overwritten with custom versions. The LD_PRELOAD environment variable allows one to explicitly specify shared libraries that contain functions that override standard library functions.

LD_LIBRARY_PATH and LD_PRELOAD will not give an attacker root access but they can be used to change the behavior of program or to log library interactions. For example, by overwriting standard functions, one may change how a program generates encryption keys, uses random numbers, sets delays in games, reads input, and writes output.

As an example, let’s suppose we have a trial program that checks the current time against a hard-coded expiration time:

#include <time.h>
#include <stdio.h>
#include <stdlib.h>

int
main(int argc, char **argv)
{
    unsigned long expiration = 1483228800;
    time_t now;

    /* check software expiration */
    now = time(NULL);
    if (time(NULL) > (time_t)expiration) {
        fprintf(stderr, "This software expired on %s", ctime(&expiration));
        fprintf(stderr, "This time is now %s", ctime(&now));
    }
    else
        fprintf(stderr, "You're good to go: %lu days left in your trial.\n",
            (expiration-now)/(60*60*24));
    return 0;
}

When run, we may get output such as:

$ ./testdate
This software expired on Sat Dec 31 19:00:00 2016
This time is now Sun Feb 18 15:50:44 2018

Let us write a replacement time function that always returns a fixed value that is less than the one we test for. We’ll put it in a file called time.c:

unsigned long time() {
    return (unsigned long) 1483000000;
}

We compile it into a shared library:

gcc -shared -fPIC time.c -o newtime.so

Now we set LD_PRELOAD and run the program:

$ export LD_PRELOAD=$PWD/newtime.so
$ ./testdate
You're good to go: 2 days left in your trial.

Note that our program now behaves differently and we never had to recompile it or feed it different data!

Input sanitization

The important lesson in writing code that uses any user input in forming commands is that of input sanitization. Input must be carefully validated to make sure it conforms to the requirements of the application that uses it and does not try to execute additional commands, escape to a shell, set malicious environment variables, or specify out-of-bounds directories or devices.

File descriptors

POSIX systems have a convention that programs expect to receive three open file descriptors when they start up:

  • file descriptor 0: standard input

  • file descriptor 1: standard output

  • file descriptor 2: standard error

Functions such as printf, scanf, puts, getc and others expect these file desciptors to be available for input and output. When a program opens a new file, the operating system searches through the file descriptor table and allocates the first available unused file descriptor. Typically this will be file descriptor 3. However, if any of the three standard file descriptors are closed, the operating system will use one of those as an available, unused file descriptor.

The vulnerability lies in the fact that we may have a program running with elevated privileges (e.g., setuid root) that modifies a file that is not accessible to regular users. If that program also happens to write to the user via, say, printf, there is an opportunity to corrupt that file. The attacker simply needs to close the standard output (file descriptor 1) and run the program. When it opens its secret file, it will be given file descriptor 1 and will be able to do its read and write operations on the file. However, whenever the program will print a message to the user, the output will not be seen by the user as it will be directed to what printf assumes is the standard output: file descriptor 1. Printf output will be written onto the secret file, thereby corrupting it.

The shell command (bash, sh, or ksh) for closing the standard output file is an obscure-looking >&-. For example:

./testfile >&-

Comprehension Errors

The overwhelming majority of security problems are caused by bugs or misconfigurations. Both often stem from comprehension errors. These are mistakes created when someone – usually the programmer or administrator – does not understand the details and every nuance of what they are doing. Some example include:

  • Not knowing all possible special characters that need escaping in SQL commands.

  • Not realizing that the standard input, output, or error file descriptors may be closed.

  • Not understanding how access control lists work or how to configure mandatory access control mechanisms such as type enforcement correctly.

If we consider the Windows CreateProcess function, we see it is defined as:

BOOL WINAPI CreateProcess(
  _In_opt_    LPCTSTR               lpApplicationName,
  _Inout_opt_ LPTSTR                lpCommandLine,
  _In_opt_    LPSECURITY_ATTRIBUTES lpProcessAttributes,
  _In_opt_    LPSECURITY_ATTRIBUTES lpThreadAttributes,
  _In_        BOOL                  bInheritHandles,
  _In_        DWORD                 dwCreationFlags,
  _In_opt_    LPVOID                lpEnvironment,
  _In_opt_    LPCTSTR               lpCurrentDirectory,
  _In_        LPSTARTUPINFO         lpStartupInfo,
  _Out_       LPPROCESS_INFORMATION lpProcessInformation);

We have to wonder whether a programmer who does not use this frequently will take the time to understand the ramifications of correctly setting process and thread security attributes, the current directory, environment, inheritance handles, and so on. There’s a good chance that the programmer will just look up an example on places such as github.com or stackoverflow.com and copy something that seems to work, unaware that there may be obscure side effects that compromise security.

As we will see in the following sections, comprehension errors also apply to the proper understanding of things as basic as various ways to express characters.

Directory parsing

Some applications, notably web servers, accept hierarchical filenames from a user but need to ensure that they restrict access only to files within a specific point in the directory tree. For example, a web server may need to ensure that no page requests go outside of /home/httpd/html.

An attacker may try to gain access by using paths that include .. (dot-dot), which is a link to the parent directory. For example, an attacker may try to download a password file by requesting

http://poopybrain.com/../../../etc/passwd

The hope is that the programmer did not implement parsing correctly and might try simply suffixing the user-requested path to a base directory:

"/home/httpd/html/" + "../../../etc/passwd"

to form

/home/httpd/html/../../../etc/passwd

which will retrieve the password file, /etc/passwd.

A programmer may anticipate this and check for dot-dot but has to realize that dot-dot directories can be anywhere in the path. This is also a valid pathname but one that should be rejected for trying to escape to the parent:

http://poopybrain.com/419/notes/../../416/../../../../etc/passwd

Moreover, the programmer cannot just search for .. because that can be a valid part of a filename. All three of these should be accepted:

http://poopybrain.com/419/notes/some..other..stuff/
http://poopybrain.com/419/notes/whatever../
http://poopybrain.com/419/notes/..more.stuff/

Also, extra slashes are perfectly fine in a filename, so this is acceptable:

http://poopybrain.com/419////notes///////..more.stuff/

The programmer should also track where the request is in the hierarchy. If dot-dot doesn’t escape above the base directory, it should most likely be accepted:

http://poopybrain.com/419/notes/../exams/

These are not insurmountable problems but they illustrate that a quick-and-dirty attempt at filename processing may be riddled with bugs.

Unicode parsing

If we continue on the example of parsing pathnames in a web server, let us consider a bug in early releases of Microsoft’s IIS (Internet Information Services, their web server). IIS had proper pathname checking to ensure that attempts to get to a parent are blocked:

http://www.poopybrain.com/scripts/../../winnt/system32/cmd.exe

Once the pathname was validated, it was passed to a decode function that decoded any embedded Unicode characters and then processed the request.

The problem with this technique was that non-international characters (traditional ASCII) could also be written as Unicode characters. A “/” could also be written in HTML as its hexadecimal value, %2f (decimal 47). It could also be represented as the two-byte Unicode sequence %c0%af.

The reason for this stems from the way Unicode was designed to support compatibility with one-byte ASCII characters. This encoding is called UTF–8. If the first bit of a character is a 0, then we have a one-byte ASCII character (in the range 0..127). However, if the first bit is a 1, we have a multi-byte character. The number of leading 1s determine the number of bytes that the character takes up. If a character starts with 110, we have a two-byte Unicode character.

With a two-byte character, the UTF–8 standard defines a bit pattern of

110a bcde   10fg hijk

The values a-k above represent 11 bits that give us a value in the range 0..2047. The “/” character, 0x2f, is 47 in decimal and 0010 1111 in binary. The value represents offset 47 into the character table (called codepoint in Unicode parlance). Hence we can represent the “/” as 0x2f or as the two byte Unicode sequence:

1100 0000   1010 1111

which is the hexadecimal sequence %c0%af. Technically, this is disallowed. The standard states that codepoints less than 128 must be represented as one byte but the two byte sequence is supported by most Unicode parsers. We can also construct a valid three-byte sequence too.

Microsoft’s bug was that they ignored parsing %c0%af as being equivalent to a / because it should not have been used to represent the character. However, the Unicode parser was happy to translate it and attackers were able to use this to access any file in on a server running IIS. This bug also gave attackers the ability to invoke cmd.com, the command interpreter, and execute any commands on the server.

After Microsoft fixed the multi-byte Unicode bug, another problem came up. The parsing of escaped characters was recursive, so if the resultant string looked like a Unicode hexadecimal sequence, it would be re-parsed.

As an example of this, let’s consider the backslash (\), which Microsoft treats as equivalent to a slash (/) in URLs since their native pathname separator is a backlash[3].

The backslash can be written in a URL in hexadecimal format as %5c. The “%” character can be expressed as %25. The “5” character can be expressed as %35. The “c” character can be expressed as %63. Hence, if the URL parser sees the string %%35c, it would expand the %35 to the character “5”, which would result in %5c, which would then be converted to a \. If the parser sees %25%35%63, it would expand each of the %nn components to get the string %5c, which would then be converted to a \. As a final example, if the parser comes across %255c, it will expand %25 to % to get the string %5c, which would then be converted to a \.

It is not trivial to know what a name relates to but it is clear that all conversions have to be done before the validity of the pathname is checked. As for checking the validity of the pathname in an application, it is error-prone. The operating system itself parses a pathname a component at a time, traversing the directory tree and checking access rights as it goes along. The application is trying to recreate a similar action without actually traversing the file system but rather by just parsing the name and mapping it to a subtree of the file system namespace.

TOCTTOU attacks

TOCTTOU stands for Time of Check to Time of Use. If we have code of the form:

if I am allowed to do something
    then do it

we may be exposing ourselves to a race condition. There is a window of time between the test and the action. If an attacker can change the condition after the check then the action may take place even if the check should have failed.

One example of this is the print spooling program, lpr. It runs as a setuid program with root privileges so that it can copy a file from a user’s directory into a privileged spool directory that serves as a queue of files for printing. Because it runs as root, it can open any file, regardless of permissions. To keep the user honest, it will check access permissions on the file that the user wants to print and then, only if the user has legitimate read access to the file, it will copy it over to the spool directory for printing. An attacker can create a link to a readable file and then run lpr in the background. At the same time, he can change the link to point to a file for which he does not have read access. If the timing is just perfect, the lpr program will check access rights before the file is re-linked but will then copy the file for which the user has no read access.

Another example of the TOCTTOU race condition is the set of temporary filename creation functions (tempnam, tempnam, mktemp, GetTempFileName, etc.). These functions create a unique filename when they are called but there is no guarantee that an attacker doesn’t create a file with the same name before that filename is used. If the attacker creates and opens a file with the same name, she will have access to that file for as long as it is open, even if the user’s program changes access permissions for the file later on.

The best defense for the temporary file race condition is to use the mkstemp function, which creates a file based on a template name and opens it as well, avoiding the race condition between checking the uniqueness of the name and opening the file.


  1. Note that sprintf is vulnerable to buffer overflow. We should use snprintf, which allows one to specify the maximum size of the buffer.  ↩

  2. Unix, Linux, macOS, FreeBSD, NetBSD, OpenBSD, Android, etc.  ↩

  3. the official Unicode name for the slash and backslash characters are solidus and reverse solidus, respectively.  ↩

App confinement

Two lessons we learned from experience are that applications can be compromised and that applications may not always be trusted. Server applications, in particular, such as web servers and mail servers have been compromised over and over again. This is particularly harmful as they often run with elevated privileges and on systems on which normal users do not have accounts. The second category of risk is that we may not always trust an application. We trust our web server to work properly but we cannot necessarily trust that the game we downloaded from some unknown developer will not try to upload our files, destroy our data, or try to change our system configuration. In fact, unless we have the ability to scrutinize the codebase of a service, we will not know for sure if it tries to modify any system settings or writes files to unexpected places.

With this resignation to security in mind, we need to turn our attention to limiting the resources available to an application and making sure that a misbehaving application cannot harm the rest of the system. These are the goals of confinement.

Our initial thoughts to achieving confinement may involve proper use of access controls. For example, we can run server applications as low-privilege users and make sure that we have set proper read/write/execute permissions on files, read/write/search permissions on directories, or even set up role-based policies.

However, access controls usually do not give us the ability to set permissions for “don’t allow access to anything else.” For example, we may want our web server to have access to all files in /home/httpd but nothing outside of that directory. Access controls do not let us express that rule. Instead, we are responsible for changing the protections of every file on the system and making sure it cannot be accessed by “other”. We also have to hope that no users change those permissions. In essence, we must disallow the ability for anyone to make files publicly accessible because we never want our web server to access them. We may be able to use mandatory access control mechanisms if they are available but, depending on the system, we may not be able to restrict access properly either. More likely, we will be at risk of comprehension errors and be likely to make a configuration error, leaving parts of the system vulnerable. To summarize, even if we can get access controls to help, we will not have high assurance that they do.

Access controls also only focus on protecting access to files and devices. A system has other resources, such as CPU time, memory, disk space, and network. We may want to control how much of all of these an application is allowed to use. POSIX systems provide a setrlimit system call that allows one to set limits on certain resources for the current process and its children. These controls include the ability to set file size limits, CPU time limits, various memory size limits, and maximum number of open files.

We also may want to control the network identity for an application. All applications share the same IP address on a system but this may allow a compromised application to exploit address-based access controls. For example, you may be able to connect to or even log into system that believe you are a trusted computer. An exploited application may end up confusing network intrusion detection systems.

Just limiting access through resource limits and file permissions is also insufficient for services that run as root. If an attacker can compromise an app and get root access to execute arbitrary functions, she can change resource limits (just call setrlimit with different values), change any file permissions, and even change the IP address and domain name of the system.

In order to truly confine an application, we would like to create a set of mechanisms that enforce access controls to all of a system’s resources, are easy to use so that we have high assurance in knowing that the proper restrictions are in place, and work with a large class of applications. We can’t quite get all of this yet but we can come close.

chroot

The oldest app confinement mechanism is Unix’s chroot system call and command, originally introduced in 1979 in the seventh edition[1]. The chroot system call changes the root directory of the calling process to the directory specified as a parameter.

chroot("/home/httpd/html");

Sets the root of the file system to /home/httpd/html for the process and any processes it creates. The process cannot see any files outside that subset of the directory tree. This isolation is often called a chroot jail.

Jailkits

If you run chroot, you will likely get an error along the lines of:

# chroot newroot
chroot: failed to run command ‘/bin/bash’: No such file or directory

This is because /bin/bash is not within the root (in this case, the newroot directory). You’ll then create a bin subdirectory and try running chroot again and get the same error:

# mkdir newroot/bin
# ln /bin/bash newroot/bin/bash
# chroot newroot
chroot: failed to run command ‘/bin/bash’: No such file or directory

You’ll find that is also insufficient and that you’ll need to bring in the shared libraries that /bin/bash needs by mounting /lib, /lib64, and /usr/lib within that root just to enable the shell to run. Otherwise, it cannot load the libraries it needs since it cannot see above its root (i.e., outside its jail). To simplify this process, a jailkit simplifies the process of setting up a chroot jail by providing a set of utilities to make it easier to create the desired environment within the jail and populate it with basic accounts, commands, and directories.

Problems with chroot

Chroot only limits access to the file system namespace. It does not restrict access to resources and does not protect the machine’s network identity. Applications that are compromised to give the attacker root access make the entire system vulnerable since the attacker has access to all system calls.

Chroot is available only to administrators. If this was not the case then any user would be able to get root access within the chroot jail. You would: 1. Create a chroot jail 2. Populate it with the shell program and necessary support libraries 3. Link the su command (set user, which allows you to authenticate to become any user) 4. Create password files within the jail with a known password for root. 5. Use the chroot command to enter the jail. 6. Run su root to become the root user. The command will prompt you for a password and validate it against the password file. Since all processes run within the jail, the password file is the one you set up.

You’re still in the jail but you have root access.

Escaping from chroot

If someone manages to compromise an application running inside a chroot jail and become root, they are still in the jail but have access to all system calls. For example, they can send signals to kill all other processes or shut down the system. This would be an attack on availability.

Attaining root access also provides a few ways of escaping the jail. On POSIX systems, all non-networked devices are accessible as files within the filesystem. Even memory is accessible via a file (/dev/mem). An intruder in a jail can create a memory device (character device, major number = 1, minor number = 1):

mknod mem c 1 1

With the memory device, the attacker can patch system memory to change the root directory of the jail. More simply, an attacker can create a block device with the same device numbers as that of the main file system. For example, the root file system on my Linux system is /dev/sda1 with a major number of 8 and a minor number of 1. An attacker can recreate that in the jail:

mknod rootdisk b 8 1

and then mount it as a file system within the jail:

mount -t ext4 rootdisk myroot

Now the attacker, still in the jail, has full access to the entire file system, which is as good as being out of the jail. He can add user accounts, change passwords, delete log files, run any commands, and even reboot the system to get a clean login.

FreeBSD Jails

Chroot was good in confining the namespace of an application but useless against providing security if an application had root access and did nothing to restrict access to other resources.

FreeBSD Jails are an enhancement to the idea of chroot. Jails provide a restricted filesystem namespace, just like chroot does, but also place restrictions on what processes are allowed to do within the jail, including selectively removing privileges from the root user in the jail. For example, processes within a jail may be configured to:

  • Bind only to sockets with a specified IP address and specific ports
  • Communicate only with other processes within the jail and none outside
  • Not be able to load kernel modules, even if root
  • Have restricted access to system calls that include:
    • Ability to create raw network sockets
    • Ability to create devices
    • Modify the network configuration
    • Mount or unmount filesystems

FreeBSD Jails are a huge improvement over chroot since known escapes, such as creating devices and mounting filesystems and even rebooting the system are disallowed. Depending on the application, policies may be coarse. The changed root provides all or nothing access to a part of the file system. This does not make Jails suitable for applications such as a web browser, which may be untrusted but may need access to files outside of the jail. Think about web-based applications such as email, where a user may want to upload or download attachments. Jails also do not prevent malicious apps from accessing the network and trying to attack other machines … or from trying to crash the host operating system. Moreover, FreeBSD Jails is a BSD-only solution. With an estimated 0.95…1.7% share of server deployments, it is a great solution on an operating system that is not that widely used.

Linux namespaces, cgroups, and capabilities

Linux’s answer to FreeBSD Jails was a combination of three elements: control groups, namespaces, and capabilities.

Control groups (cgroups)

Linux control groups, also called cgroups, allow you to allocate resources such as CPU time, system memory, disk bandwidth, network bandwidth, and the ability to monitor resource usage among user-defined groups of processes. This allows, for example, an administrator to allocate a larger share of the processor to a critical server application.

An administrator creates one or more cgroups and assigns resource limits to each of them. Then any application can be assigned to a control group and will not be able to use more than the resource limits configured in that control group. Applications are unaware of these limits. Control groups are organized in a hierarchy similar to processes. Child cgroups inherit some attributes from the parents.

Linux namespaces

Chroot only restricted the filesystem namespace. The filesystem namespace is the best known namespace in the system but not the only one. Linux namespaces Namespaces provide control over how processes are isolated in the following namespaces:

Namespace Description Controls
IPC System V IPC, POSIX message queues Objects created in an IPC namespace are only visible to other processes in that namespace (CLONE_NEWIPC)
Network Network devices, stacks, ports Isolates IP protocol stacks, IP routing tables, firewalls, socket port numbers ( CLONE_NEWNET)
Mount Mount points A set of processes can have their own distinct mount points and view of the file system (CLONE_NEWNS)
PID Process IDs Processes in different PID namespaces can have their process IDs – the child cannot see parent processes or other namespaces (CLONE_NEWPID)
User User & group IDs Per-namespace user/group IDs. Also, you can be root in a namespace but have restricted privileges ( CLONE_NEWUSER )
UTS host name and domain name setting hostname and domainname will not affect rest of the system (CLONE_NEWUTS)
Cgroup control group Sets a new control group for a process (CLONE_NEWCGROUP)

A process can dissociate any or all of these namespaces from its parent via the unshare system call. For example, by unsharing the PID namespace, a process gets a no longer sees other processes and will only see itself and any child processes it creates.

The Linux clone system call is similar to fork in that it creates a new process. However, it allows you to pass flags that will specify which parts of the execution context will be shared with the parent. For example, a cloned process may choose to share memory and open file descriptors, which will make it behave like threads. It can also choose to share – or not – any of the elements of the namespace.

Capabilities

A problem that FreeBSD Jails tackled was that of restricting the power of root inside a Jail. You could be a root user but still disallowed from executing certain system calls. POSIX (Linux) capabilities[2] tackle this issue as well.

Traditionally, Unix systems distinguished privileged versus unprivileged processes. Privileged processes were those that ran with a user ID of 0, called the root user. When running as root, the operating system would allow access to all system calls and all access permission checks were bypassed. You could do anything.

Linux capabilities identify groups of operations, called capabilities, that can be controlled independently on a per-thread basis. The list is somewhat long, 38 groups of controls, and includes capabilities such as:

  • CAP_CHOWN: make arbitrary changes to file UIDs and GIDs
  • CAP_DAC_OVERRIDE: bypass read/write/execute checks
  • CAP_KILL: bypass permission checks for sending signals
  • CAP_NET_ADMIN: network management operations
  • CAP_NET_RAW: allow RAW sockets
  • CAP_SETUID: arbitrary manipulation of process UIDs
  • CAP_SYS_CHROOT: enable chroot

The kernel keeps track of four capability sets for each thread. A capability set is a list of zero or more capabilities. The sets are:

  • Permitted: If a capability is not in this set, the thread or its children can never require that capability. This limits the power of what a process and its children can do.

  • Inheritable: These capabilities will be inherited when a thread calls execve to execute a program (POSIX programs are executed with the same thread; we are not creating a new process)

  • Effective: This is the current set of capabilities that the thread is using. The kernel uses these to perform permission checks.

  • Ambient: This is similar to Inheritable and contains a set of capabilities that are preserved across an execve of a program that is not privileged. If a setuid or setgid program is run, will clear the ambient set. These are created to allow a partial use of root features in a controlled manner. It is useful for user-level device drivers or software that needs a specific privilege (e.g., for certain networking operations).

A child process created via fork (the standard way of creating processes) will inherit copies of its parent’s capability sets following the rules of which capabilities have been marked as inheritable.

A set of capabilities can be assigned to an executable file by the administrator. They are stored as a file’s extended attributes (along with access control lists, checksums, and arbitrary user-defined name-value pairs). When the program runs, the executing process may further restrict the set of capabilities under which it operates if it chooses to do so (for example, after performing an operation that required the capability and knowing that it will no longer need to do so).

From a security point of view, the key concept of capabilities is that they allow us to provide limited elevation of privileges to a process. A process does not need to run as root (user ID 0) but can still be granted very specific privileges. For example, we can grant the ping command the ability to access raw sockets so it can send an ICMP ping message on the network but not have any other administrative powers. The application does not need to run as root and even if an attacker manages to inject code, the opportunities for attack will be restricted.

The Linux combination of cgroups, namespaces, and capabilities provides a powerful set of mechanisms to

  1. Set limits on the system resources (processor, disk, network) that a group of processes will use.

  2. Constrain the namespace, making parts of the filesystem or the existence of other processes or users invisible.

  3. Give restricted privileges to specific applicatiosn so they do not need to run as root.

This enables us to create stronger jails and have a fine degree of control as to what processes are or are not allowed to do in that jail.

While bugs have been found these mechanisms, the more serious problem is that of comprehension. The system has become far, far more complex than it was in the days of chroot. A user has to learn quite a lot to use these mechanisms properly. Failure to understand their behavior fully can create vulnerabilities. For example, namespaces do not prohibit a process from making privileged system calls. They simply limit what a process can see. A process may not be able to send a kill signal to another process only because it does not share the same process ID namespace.

Together with capabilities, namespaces allow a restricted environment that also places limits on the abilities to perform operations even if a process is granted root privileges. This enables ordinary users to create namespaces. You can create a namespace and even create a process running as a root user (UID 0) within that namespace but it will have no capabilities beyond those that were granted to the user; the user ID of 0 gets mapped by the kernel to a non-privileged user.

Containers

Software rarely lives as an isolated application. Some software requires multiple applications and most software relies on the installation of other libraries, utilities, and packages. Keeping track of these dependencies can be difficult. Worse yet, updating one shared component can sometimes cause another application to break. What was needed was a way to isolate the installation, execution, and management of multiple software packages that run on the same system.

Various attempts were undertaken to address these problems.

  1. The most basic was to fix problems when they occurred. This required carefully following instructions for installation, update, and configuration of software and extensive testing of all services on the system when anything changed. Should something break, the service would be unavailable until the problems were fixed.

  2. A drastic, but thorough, approach to isolation was to simply run each service on its own computer. That avoids conflicts in library versions and other dependencies. However, it is an expensive solution, is cumbersome, and is often overkill in most environments.

  3. Finally, administrators could deploy virtual machines. This is a technology that allows one to run multiple operating systems on one computer and gives the illusion of services running on distinct systems. However, this is a heavyweight solution. Every service needs its own installation of the operating system and all supporting software for the service as well as standard services (networking, device management, shell, etc.). It is not efficient in terms of CPU, disk, or memory resources – or even administration effort.

Containers are a mechanism that were originally created not for security but to make it easy to package, distribute, relocate, and deploy collections of software. The focus of containers is not to enable end users to install and run their favorite apps but rather for administrators to be able to deploy a variety of services on a system. A container encapsulates all the necessary software for a service, all of its dependencies, and its configuration into one package that can be easily passed around, installed, and removed.

In many ways, a container feels like a virtual machine. Containers provide a service with a private process namespace, its own network interface, and its own set of libraries to avoid problems with incompatible versions used by other software. Containers also allow an administrator to give the service restricted powers even if it runs with root (administrator) privileges. Unlike a virtual machine, however, multiple containers on one system all share the same operating system and kernel modules.

Containers are not a new mechanism. They are implemented using Linux’s control groups, namespaces, and capabilities to provide resource control, isolation, and privilege control, respectively. They also make use of a copy on write file system. This makes it easy to create new containers where the file system can track the changes made by that container over a clean base version of a file system. Containers can also take advantage of AppArmor, which is a Linux kernel module that provides a basic form of mandatory access controls based on the pathnames of files. It allows an administrator to restrict the ability of a program to access specific files even within its file system namespace.

The best-known and first truly popular container framework is Docker. A Docker Image is a file format that creates a package of applications, their supporting libraries, and other needed files. This image can be stored and deployed on many environments. Docker made it easy to deploy containers using git-like commands (docker push, docker commit) and also to perform incremental updates. By using a copy on write file system, Docker images can be kept immutable (read-only) while any changes to the container during its execution are stored separately.

As people found Docker to be useful, the next design goal was to make it easier to manage containers across a network of many computers. This is called container orchestration. There are many solutions for this, including Apache Mesos, Kubernetes, Nomad, and Docker Swarm. The best known of these is kubernetes, which was designed by Google. It coordinates storage of containers, failure of hardware and containers, and dynamic scaling: deploying the container on more machines to handle increased load. Kubernetes is coordination software, not a container system; it uses the Docker framework to run the actual container.

Even though containers were designed to simplify software deployment rather than provide security to services, they do offer several benefits in the area of security:

  • They make use of namespaces, cgroups, and capabilities with restricted capabilities configured by default. This provides isolation among containers.

  • Containers provide a strong separation of policy (defined by the container configuration) from the enforcement mechanism (handled by the operating system).

  • They improve availability by providing the ability to have a watchdog timer monitor the running of applications and restarting them if necessary. With orchestration systems such as Kubernetes, containers can be re-deployed on another system if a computer fails.

  • The environment created by a container is reproducible. The same container can be deployed on multiple systems and tested in different environments. This provides consistency and aids in testing and ensuring that the production deployment matches the one used for development and test. Moreover, it is easy to inspect exactly how a container is configured. This avoids problems encountered by manual installation of components where an administrator may forget to configure something or may install different versions of a required library.

  • While containers add nothing new to security, they help avoid comprehension errors. Even default configurations will provide improved security over the defaults in the operating system and configuring containers is easier than learning and defining the rules for capabilities, control groups, and namespaces. Administrators are more likely to get this right or import containers that are already configured with reasonable restrictions.

Containers are not a security panacea. Because all containers run under the same operating system, any kernel exploits can affect the security of all containers. Similarly, any denial of service attacks, whether affecting the network or monopolizing the processor, will impact all containers on the system. If implemented and configured properly, capabilities, namespaces, and control groups should ensure that privilege escalation cannot take place. However, bugs in the implementation or configuration may create a vulnerability. Finally, one has to be concerned with the integrity of the container itself. Who configured it, who validated the software inside of it, and is there a chance that it may have been modified by an adversary either at the server or in transit?

Virtual Machines

As a general concept, virtualization is the addition of a layer of abstraction to physical devices. With virtual memory, for example, a process has the impression that it owns the entire memory address space. Different processes can all access the same virtual memory location and the memory management unit (MMU) on the processor maps each access to the unique physical memory locations that are assigned to the process.

Process virtual machines present a virtual CPU that allows programs to execute on a processor that does not physically exist. The instructions are interpreted by a program that simulates the architecture of the pseudo machine. Early pseudo-machines included o-code for BCPL and P-code for Pascal. The most popular pseudo-machine today is the Java Virtual Machine (JVM). This simulated hardware does not even pretend to access the underlying system at a hardware level. Process virtual machines will often allow “special” calls to invoke system functions or provide a simulation of some generic hardware platform.

Operating system virtualization is provided by containers, where a group of processes is presented with the illusion of running on a separate operating system but in reality shares the operating system with other groups of processes – they are just not visible to the processes in the container.

System virtual machines allow a physical computer to act like several real machines with each machine running its own operating system (on a virtual machine) and applications that interact with that operating system. The key to this machine virtualization is to not allow each operating system to have direct access to certain privileged instructions in the processor. These instructions would allow an operating system to directly access I/O ports, MMU settings, the task register, the halt instruction and other parts of the processor that could interfere with the processor’s behavior and with the other operating systems on the system. Instead, a trap and emulate approach is used. Privileged instructions, as well as system interrupts, are caught by the Virtual Machine Monitor (VMM), also known as a hypervisor. The hypervisor arbitrates access to physical resources and presents a set of virtual device interfaces to each guest operating system (including the memory management unit, I/O ports, disks, and network interfaces). The hypervisor also handles preemption. Just as an operating system may suspend a process to allow another process to run, the hypervisor will suspend an operating system to give other operating systems a chance to run.

The two configurations of virtual machines are hosted virtual machines and native virtual machines. With a hosted virtual machine (also called a type 2 hypervisor), the computer has a primary operating system installed that has access to the raw machine (all devices, memory, and file system). This host operating system does not run in a virtual environment. One or more guest operating systems can then be run on virtual machines. The VMM serves as a proxy, converting requests from the virtual machine into operations that get sent to and executed on the host operating system. A native virtual machine (also called a type 1 hypervisor) is one where there is no “primary” operating system that owns the system hardware. The hypervisor is in charge of access to the devices and provides each operating system drivers for an abstract view of all the devices.

Security implications

Unlike app confinement mechanisms such as jails, containers, or sandboxes, virtual machines enable isolation all the way through the operating system. A compromised application, even with escalated privileges, can wreak havoc only within the virtual machine. Even compromises to the operating system kernel are limited to that virtual machine. However, a compromised virtual machine is not much different form having a compromised physical machine sitting inside your organization: not desirable and capable of attacking other systems in your environment.

Multiple virtual machines are usually deployed on one physical system. In cases such as cloud services (e.g., such as those provided by Amazon), a single physical system may host virtual machines from different organizations or running applications with different security requirements. If a malicious application on a highly secure system can detect that it is co-resident on a computer that is hosting another operating system and that operating system provides fewer restrictions, the malware may be able to create a covert channel to communicate between the highly secure system with classified data and the more open system. A covert channel is a general term to describe the the ability for processes to communicate via some hidden mechanism when they are forbidden by policy to do so. In this case, the channel can be created via a side channel attack. A side channel is the ability to get or transmit information using some aspects of a system’s behavior, such as changes in power consumption, radio emissions, acoustics, or performance. For example, processes on both systems, even though they are not allowed to send network messages, may create a means of communicating by altering and monitoring system load. The malware on the classified VM can create CPU-intensive task at specific times. Listener software on the unclassified VM can do CPU-intensive tasks at a constant rate and periodically measure their completion times. These completion times may vary based on whether the classified system is doing CPU-intensive work. The variation in completion times creates a means of sending 1s and 0s and hence transmitting a message.


  1. Note that Wikipedia and many other sites refer to this as “Version 7 Unix”. Unix has been under continuous evolution at Bell Labs from 1969 through approximately 1989. As such, it did not have versions. Instead, an updated set of manuals was published periodically. Installations of Unix have been referred to by the editions of their manuals.  ↩

  2. Linux capabilities are not to be confused with the concept of capability lists, which are a form of access control that Linux does not use).  ↩

Application Sandboxing

The goal of an application sandbox is to provide a controlled and restricted environment for code execution. This can be useful for applications that may come from untrustworthy sources, such as games from unknown developers or software downloaded from dubious sites. The program can run with minimal risk of causing widespread damage to the system. Sandboxes are also used by security researchers to observe how software behaves: what the program trying to do and whether it is attempting to access any resources in a manner that is suspicious for the application. This can help identify the presence of malware within a program. The sandbox defines and enforces what an individual application is allowed to do while executing in within its sandbox.

We previously looked at isolation via jails and containers, which use mechanisms that include namespaces, control groups, and capabilities. These constitute a widely-used form of sandboxing. However, these techniques focus on isolating an application (or group of processes) from other processes, restricting access to parts of the file system, and/or providing a separate network stack with a new IP address.

While this is great for running services without the overhead of deploying virtual machines, it does not sufficiently address the basic needs of running normal applications. We want to protect users from their applications: give users the ability to run apps but restrict what those apps can do on a per-app basis.

For example, you may want to make sure that a program accesses only files under your home directory with a suffix of “.txt”, and only for reading, without restricting the entire file system namespace as chroot would do, which would require creating a separate directory structure for shared libraries and other standard components the application may need. As another example, you might want an application to have access only to TCP networking. With a mechanism such as namespaces, we cannot exercise control over the names of files that an application can open or their access modes. Namespaces also do not allow us to control how the application interacts with the network. Capabilities allow us to restrict what a process running with root privileges can do but offer no ability to restrict more fundamental operations, such as denying a process the ability to read a file even if that file has read access enabled. The missing ingredient is rule-based policies to define precisely what system calls an application can invoke – down to the parameters of the system calls of interest.

Instead of building a jail (a container), we will add an extra layer of access control. An application will have the same view of the operating system as any other application but will be restricted in what it can do.

Sandboxing is currently supported on a wide variety of platforms at either the kernel or application level. We will examine four types of application sandboxes:

  1. User-level validation
  2. OS support
  3. Browser-based application sandboxing
  4. The Java sandbox

Note that there are many other sandbox implementations. This is just a representative sampling.

Application sandboxing via system call interposition & user-level validation

Applications interact with their environment via system calls to the operating system. Any interaction that an application needs to do aside from computation, whether legitimate or because it has been compromised, must be done through system calls: accessing files or devices, changing permissions, accessing the network, talking with other processes, etc.

An application sandbox will allow us to create policies that define which system calls are permissible to the application and in what way they can be used.

If the operating system does not provide us with the required support and we do not have the ability to recompile an application to force it to use alternate system call libraries, we can rely on system call interposition to construct a sandbox. System call interposition is the process of intercepting an app’s system calls and performing additional operations. The technique is also called hooking. In the case of a sandbox, it will intercept a system call, inspect its parameters, and decide whether to allow the system call to take place or return an error.

Example: Janus

One example of doing validation at the user level is the Janus sandboxing system, developed at UC Berkeley, originally for SunOS but later ported to Linux. Janus uses a loadable, lightweight, kernel module called mod_janus. The module initializes itself by setting up hooks to redirect system call requests to to itself. A hook is simply a mechanism that redirects an API request somewhere else and allows it to return back for normal processing. For example, a function can be hooked to simply log the fact that it has been called. The Janus kernel module copies the system call table to redirect the vector of calls to the mod_janus.

A user-configured policy file defines the allowable files and network operations for each sandboxed application. Users run applications through a Janus launcher/monitor program, which places the application in the sandbox. The monitor parses the policy file and spawns a child process for the user-specified program. The child process executes the actual application. The parent Janus process serves as the monitor, running a policy engine that receives system call notifications and decides whether to allow or disallow the system call.

Whenever a sandboxed application makes a system call, the call is redirected by the hook in the kernel to the Janus kernel module. The module blocks the thread (it is still waiting for the return from the system call) and signals the user-level Janus process that a system call has been requested. The user-level Janus process’ policy engine then requests all the necessary information about the call (calling process, type of system call, parameters). The policy engine makes a policy decision to determine whether, based on the policy, the process should be permitted to make the system call. If so, the system call is directed back to the operating system. If not, an error code is returned to the application.

Challenges of user-level validation

The biggest challenge with implementing Janus is that the user-level monitor must mirror the state of the operating system. If the child process forks a new process, the Janus monitor must also fork. It needs to keep track of not just network operations but the proper sequencing of the steps in the protocol to ensure that no improper actions are attempted on the network. This is a sequence of socket, bind, connect, read/write, and shutdown system calls. If one fails, chances are that the others should not be allowed to take place. However, the Janus monitor does not have the knowledge of whether a particular system call succeeded or not; approved calls are simply forwarded from the module to the kernel for processing. Failure to handle this correctly may enable attack vectors such as trying to send data on an unconnected socket.

The same applies with file operations. If a file failed to open, read and write operations should not be allowed to take place. Keeping track of state also gets tricky if file descriptors are duplicated (e.g., via the dup2 system call); it is not clear whether any requested file descriptor is a valid one or not.

Pathname parsing of file names has to be handled entirely by the monitor. We earlier examined the complexities of processing "../" sequences in pathnames. Janus has to do this in order to validate any policies on permissible file names or directories. It also has to keep track of relative filenames since the application may change the current directory at any time via the chdir system call. This means Janus needs to intercept chdir requests and process new pathnames within the proper context. Moreover, the application may change its entire namespace if the process calls chroot.

File descriptor can cause additional problems. A process can pass an open file descriptor to another process via UNIX domain sockets, which can then use that file descriptor (via a sendfd and recv_fd set of calls). Janus would be hard-pressed to know that this happened since that would require understanding the intent of the underlying sendmsg system calls and cmsg directives.

In addition to these difficulties, user-level validation suffers from possible TOCTTOU (time-of-check-to-time-of-use) race conditions. The environment present when Janus validates a request may change by the time the request is processed.

Application sandboxing with integrated OS support

The better alternative to having a user-level process decide on whether to permit system calls is to incorporate policy validation in the kernel. Some operating systems provide kernel support for sandboxing. These include the Android Application Sandbox, the iOS App Sandbox, the macOS sandbox, and AppArmor on Linux. Microsoft introduced the Windows Sandbox in December 2018, but this functions far more like a container than a traditional application sandbox, giving the process an isolated execution environment.

Seccomp-BPF

Seccomp-BPF, which stands for SECure COMPuting with Berkeley Packet Filters, is a sandboxing framework that is available on Linux systems. It allows the user to attach a system call filter to a process and all of the descendants of that process. Users can enumerate allowable system calls and also allow or disallow access to specific files or network protocols. Seccomp has been a core part of the Android security since the release of Android O in August 2017.

Seccomp uses the Berkeley Packet Filter (BPF) interpreter, which is a framework that was initially created for network socket filtering. With socket filtering, a user can create a filter to allow or disallow certain types of data to come through the socket. Since BPF is a framework that was initially created for sockets, seccomp sends “packets” that represent system calls to the BPF (Berkeley Packet Filter) interpreter. The filter allows the user to define rules that are applied to these system calls. These rules enable the inspection of each system call and its arguments and take subsequent action. Actions include allowing the call to run or not. If the call is not permitted, rules can specify whether an error is returned to the process, a SIGSYS signal is sent, or whether the process gets killed.

Seccomp is not designed to serve as a complete sandbox solution but is a tool for building sandboxes. For further process isolation, it can be used with other components, such as namespaces, capabilities, and control groups. The biggest downside of seccomp is the use of the BPF. BPF is a full interpreter – a processor virtual machine – that supports reading/writing registers, scratch memory operations, arithmetic, and conditional branches. Policies are compiled into BPF instructions before they are loaded into the kernel. It provides a low-level interface and the rules are not simple condition-action definitions. System calls are referenced by numbers, so it is important to check the system architecture in the filter as Linux system call numbers may vary across architectures. Once the user gets past this, the challenge is to apply the principle of least privilege effectively: restrict unnecessary operations but ensure that the program still functions correctly, which includes things like logging errors and other extraneous activities.

The Apple Sandbox

Conceptually, Apple’s sandbox is similar to seccomp in that it is a kernel-level sandbox, although it does not use the Berkeley Packet Filter. The sandbox comprises:

  • User-level library functions for initializing and configuring the sandbox for a process
  • A server process for handling logging from the kernel
  • A kernel extension that uses the TrustedBSD API to enforce sandbox policies
  • A kernel extension that provides support for regular expression pattern matching to enforce the defined policies

An application initializes the sandbox by calling sandbox_init. This function reads a human-friendly policy definition file and converts it into a binary format that is then passed to the kernel. Now the sandbox is initialized. Any function calls that are hooked by the TrustedBSD layer will be passed to the sandbox kernel extension for enforcement. Note that, unlike Janus, all enforcement takes place in the kernel. Enforcement means consulting a list of sandbox rules for the process that made the system call (the policy that was sent to the kernel by sandbox_init). In some cases, the rules may involve regular expression pattern matching, such as those that define filename patterns).

The Apple sandbox helps avoid comprehension errors by providing predefined sandbox profiles (entitlements). Certain resources are restricted by default and a sandboxed app must explicitly ask the user for permission. This includes accessing:

  • the system hardware (camera, microphone, USB)
  • network connections, data from other apps (calendar, contacts)
  • location data, and user files (photos, movies, music, user-specified files)
  • iCloud services

For mobile devices, there are also entitlements for push notifications and Apple Pay/Wallet access.

Once permission is granted, the sandbox policy can be modified for that application. Some basic categories of entitlements include:

  • Restrict file system access: stay within an app container, a group container, any file in the system, or temporary/global places
  • Deny file writing
  • Deny networking
  • Deny process execution

Browser-based Sandboxing: Chromium Native Client (NaCl)

Since the early days of the web, browsers have supported a plug-in architecture, where modules (containing native code) could be loaded into the browser to extend its capabilities. When a page specifies a specific plug-in via an <object> or <embed> element, the requested content is downloaded and the plug-in that is associated with that object type is invoked on that content. Examples of common plug-ins include Adobe Flash, Adobe Reader (for rendering pdf files), and Java, but there are hundreds of others. The challenge with this framework is how to keep the software in a plug-in from doing bad things.

An example of sandboxing designed to address the problem of running code in a plugin is the Chromium Native Client,called NaCl. Chromium is the open source project behind the Google Chrome browser and Chrome OS. The NaCl Browser plug-in designed to allow safe execution of untrusted native code within a browser, unlike JavaScript, which is run through an interpreter. It is built with compute-intensive applications in mind or interactive applications that use the resources of a client, such as games.

NaCl is a user-level sandbox and works by restricting the type of code it can sandbox. It is designed for the safe execution of platform-independent, untrusted native code inside a browser. The motivation was that some browser-based applications will be so compute-intensive that writing them in JavaScript will not be sufficient. These native applications may be interactive and may use various client resources but will need to do so in a controlled and monitored manner.

NaCl supports two categories of code: trusted and untrusted. Trusted code can run without a sandbox. Untrusted code must run inside a sandbox. This code has to be compiled using the NaCl SDK or any compiler that adheres to NaCl’s data alignment rules and instruction restrictions (not all machine instructions can be used). Since applications cannot access resources directly, the code is also linked with special NaCl libraries that provide access to system services, including the file system and network. NaCl includes a GNU-based toolchain that contains custom versions of gcc, binutils, gdb, and common libraries. This toolchain supports 32-bit ARM, 32-bit Intel x86 (IA–32), x86–64, and 32-bit MIPS architectures.

NaCl executes with two sandboxes in place:

  1. The inner sandbox uses Intel’s IA–32 architecture’s segmentation capabilities to isolate memory regions among apps so that even if multiple apps run in the same process space, their memory is still isolated. Before executing an application, the NaCl loader applies static analysis on thecode to ensure that there is no attempt to use privileged instructions or create self-modifying code. It also attempts to detect security defects in the code.

  2. The outer sandbox uses system call interposition to restrict the capabilities of apps at the system call level. Note that this is done completely at the user level via libraries rather than system call hooking.

Process virtual machine sandboxes: Java

A different type of sandbox is the Java Virtual Machine. The Java language was originally designed as a language for web applets, compiled Java programs that would get download and run dynamically upon fetching a web page. As such, confining how those applications run and what they can do was extremely important. Because the author of the application would not know what operating system or hardware architecture a client had, Java would compile to a hypothetical architecture called the Java Virtual Machine (JVM). An interpreter on the client would simulate the JVM and process the instructions in the application. The Java sandbox has three parts to it:

The bytecode verifier verifies Java bytecodes before they are executed. It tries to ensure that the code looks like valid Java byte code with no attempts to circumvent access restrictions, convert data illegally, bypass array bounds, or forge pointers.

The class loader enforces restrictions on whether a program is allowed to load additional classes and that key parts of the runtime environment are not overwritten (e.g., the standard class libraries). The class loader ensures that malicious code does not interfere with trusted code nad ensures that trusted class librares remain accessible and unmodified. It implements ASLR (Address Space Layout Randomization) by randomly laying out Runtime data areas (stacks, bytecodes, heap).

The security manager enforces the protection domain. It defines what actions are safe and which are not; it creates the boundaries of the sandbox and is consulted before any access to a resource is permitted. It is called at the time an application makes a call to specific methods so it can provide run-time verification of whether a program has been given rights to invoke the method, such as file I/O or network access. Any actions not allowed by the security policy result in a SecurityException being thrown. The security manager is the component that allows the user to restrict an application from accessing files or accessing the network, for example.A user can create a security policy file that enumerates what an application can or cannot do.

Java security is deceptively complex. After over twenty years of bugs one hopes that the truly dangerous ones have been fixed. Even though the Java language itself is pretty secure and provides dynamic memory management and array bounds checking, buffer overflows have been found in the underlying C support library, which has been buggy in general. Varying implementations of the JVM environment on different platforms make it unclear how secure any specific client will be. Moreover, Java supports the use of native methods, libraries that you can write in compiled languages such as C that interact with the operating system directly. These bypass the Java sandbox.

References

Injection

SQL Injection, The Open Web Application Security Project, April 10, 2016.

SQL Injection, Acunetix.

Simson Garfinkel & Gene Spafford, Section 11.5, Protecting Yourself, Practical UNIX & Internet Security, Second Edition, April 1996. Discusses shell attacks.

Directory traversal attack, Wikipedia.

Why does Directory traversal attack %C0%AF work?, Information Security Stack Exchange, September 9, 2016

Tom Rodriquez, What are unicode vulnerabilities on Internet Information Server (IIS)?, SANS.

The Unicode Consortium.

IDN homograph attack, Wikipedia.

Time of check to time of use, Wikipedia.

Michael Cobb, How to mitigate the risk of a TOCTTOU attack, TeachTarget, August 2011.

Ernst & Yount LLP Security & Technology Solutions, Using Attack Surface Area And Relative Attack Surface Quotient To Identify Attackability. Customer Information Paper.

Michael Howard, Back to the Future: Attack Surface Analysis and Reduction, Microsoft Secure Blog, February 14, 2011.

Olivier Sessink, Jailkit, November 18, 2015.

Confinement

Evan Sarmiento, Chapter 4. The Jail Subsystem, FreeBSD Architecture Handbook, The FreeBSD Documentation Project. 2001, Last modified: 2016–10–29.

Matteo Riondato, Chapter 14. System Administration: Jails, FreeBSD Architecture Handbook, The FreeBSD Documentation Project. 2001, Last modified: 2016–10–29.

Chapter 1. Introduction to Control Groups, Red Hat Enterprise Linux 6.8 Resource Management Guide.

José Manuel Ortega, Everything you need to know about Containers Security, OSDEM 2018 presentation.

Johan De Gelas, Hardware Virtualization: the Nuts and Bolts, AnandTech article, March 17 2008.