BigTable
A NoSQL massively parallel table
November 2011
Introduction
Traditional relational databases present a view that is composed of multiple tables, each with rows and named columns. Queries, mostly performed in SQL (Structured Query Language) allow one to extract specific columns from a row where certain conditions are met (e.g., a column has a specific value). Moreover, one can perform queries across multiple tables (this is the "relational" part of a relational database). For example a table of students may include a student's name, ID number, and contact information. A table of grades may include a student's ID number, course number, and grade. We can construct a query that extracts a grades by name by searching for the ID number in the student table and then matching that ID number in the grade table. Moreover, with traditional databases, we expect ACID guarantees: that transactions will be atomic, consistent, isolated, and durable. As we saw when we studied distributed transactions, it is impossible to guarantee consistency while providing high availability and network partition tolerance. This makes ACID databases unattractive for highly distributed environments and led to the emergence of alternate data stores that are target to high availability and high performance. Here, we will look at the structure and capabilities of BigTable.
BigTable
BigTable is a distributed storage system that is structured as a large table: one that may be petabytes in size and distributed among tens of thousands of machines. It is designed for storing items such as billions of URLs, with many versions per page; over 100 TB of satellite image data; hundreds of millions of users; and performing thousands of queries a second. BigTable was developed at Google in has been in use since 2005 in dozens of Google services. An open source version, HBase, was created by the Apache project on top of the Hadoop core. Apache Cassandra, first developed at Facebook to power their search engine, is similar to BigTable with a tunable consistency model and no master (central server).
BigTable is designed with semi-structured data storage in mind. It is a large map that is indexed by a row key, column key, and a timestamp. Each value within the map is an array of bytes that is interpreted by the application. Every read or write of data to a row is atomic, regardless of how many diferent columns are read or written within that row.
It is easy enough to picture a simple table. Let's look at a few characteristics of BigTable:
- map
- A map is an associative array; a data structure that allows one to look up a value to a corresponding key quickly. BigTable is a collection of (key, value) pairs where the key identifies a row and the value is the set of columns.
- persistant
- The data is stored peristantly on disk.
- distributed
- BigTable's data is distributed among many independent machines. At Google, BigTable is built on top of GFS (Google File System). The Apache open source version of BigTable, HBase, is built on top of HDFS (Hadoop Distributed File System) or Amazon S3. The table is broken up among rows, with groups of adjacent rows managed by a server. A row itself is never distributed.
- sparse
- The table is sparse, meaning that different rows in a table may use different columns, with many of the columns empty for a particular row.
- sorted
-
Most associative arrays are not sorted. A key is hashed to a position in a table. BigTable sorts its data by keys. This helps keep related data close together, usually on the same machine — assuming that one structures keys in such a way that sorting brings the data together. For example, if domain names are used as keys in a BigTable, it makes sense to store them in reverse order to ensure that related domains are close together. For example:
edu.rutgers.cs edu.rutgers.nb edu.rutgers.www - multidimensional
-
A table is indexed by rows. Each row contains one or more named column families. Column families are defined when the table is first created. Within a column family, one may have one or more named columns. All data within a column family is usually of the same type. The implementation of BigTable usually compresses all the columns within a column family together. Columns within a column family can be created on the fly. Rows, column families and columns provide a three-level naming hierarchy in identifying data. For example:
edu.rutgers.cs" : { // row "users" : { // column family "watrous": "Donald", // column "hedrick": "Charles", // column "pxk" : "Paul" // column } "sysinfo" : { // another column family "" : "SunOS 5.8" // column (null name) } }To get data from BigTable, you need to provide a fully-qualified name in the form column-family:column. For example, users:pxk or sysinfo:. The latter shows an null column name.
- time-based
- Time is another dimension in BigTable data. Every column family may keep multiple versions of column family data. If an application does not specify a timestamp, it will retrieve the latest version of the column family. Alternatively, it can specify a timestamp and get the latest version that is earlier than or equal to that timestamp.
Columns and column families
Let's look at a sample slice of a table that stores web pages (this example is from Google's paper on BigTable). The row key is the page URL. For example, "com.cnn.www". various attributes of the page are stored in column families. A contents column family contains page contents (there are no columns within this column family). A language column family contains the language identifier for the page. Finally, an anchor column family contains the text of various anchors from other web pages. The column name is the URL of the page making the reference. These three column families underscore a few points. A column may be a single short value, as seen in the language column family. This is our classic database view of columns. In BigTable, however, there is no type associated with the column. It is just a bunch of bytes. The data in a column family may also be large, as in the contents column family. The anchor column family illustrates the extra hierarchy created by having columns within a column family. It also illustrates the fact that columns can be created dynamically (one for each external anchor), unlike column families. Finally, it illustrates the sparse aspect of BigTable. In this example, the list of columns within the anchor column family will likely vary tremendously for each URL. In all, we may have a huge number (e.g., hundreds of thousands or millions) of columns but the column family for each row will have only a tiny fraction of them populated. While the number of column families will typically be small in a table (at most hundreds), the number of columns is unlimited.
Rows and partitioning
A table is logically split among rows into multiple subtables called tablets. A tablet is a set of consecutive rows of a table and is the unit of distribution and load balancing within BigTable. Because the table is always sorted by row, reads of short ranges of rows are efficient: one typically communicates with a small number of machines. Hence, a key to ensuring a high degree of locality is to select row keys properly (as in the earlier example of using domain names in reverse order).
Timestamps
Each column family cell can contain multiple versions of content. For example, in the earlier example, we may have several timestamped versions of page contents associated with a URL. Each version is identified by a 64-bit timestamp that either represents real time or is a value assigned by the client. Reading column data retrieves the most recent version if no timestamp is specified or the latest version that is earlier than a specified timestamp. timestamp.
A table is configured with per-column-family settings for garbage collection of old versions. A column family can be defined to keep only the latest n versions or to keep only the versions written since some time t.
Implementation
BigTable comprises a client library (linked with the user's code), a master server that coordinates activity, and many tablet servers. Tablet servers can be added or removed dynamically.
The master assigns tablets to tablet servers and balances tablet server load. It is also responsible for garbage collection of files in GFS and managing schema changes (table and column family creation).
Each tablet server manages a set of tablets (typically 10-1,000 tablets per server). It handles read/write requests to the tablets it manages and splits tablets when a tablet gets too large. Client data does not move through the master; clients communicate directly with tablet servers for reads/writes. The internal file format for storing data is Google's SSTable, which is a persistent, ordered, immutable map from keys to values.
BigTable uses the Google File System (GFS) for storing both data files and logs. A cluster management system contains software for scheduling jobs, monitoring health, and dealing with failures.
Chubby
Chubby is a highly available and persistent distributed lock service that manages leases for resources and stores configuration information. The service runs with five active replicas, one of which is elected as the master to serve requests. A majority must be running for the service to work. Paxos is used to keep the replicas consistent. Chubby provides a namespace of files & directories. Each file or directory can be used as a lock.
In BigTable, Chubby is used to:
- ensure there is only one active master
- store the bootstrap location of BigTable data
- discover tablet servers
- store BigTable schema information
- store access control lists
Startup and growth
A table starts off with just one tablet. As the table grows, it is split into multiple tablets. By default, a table is split at around 100 to 200 MB.
Locating rows within a BigTable is managed in a three-level hierarchy. The root (top-level) tablet stores the location of all Metadata tablets in a special Metadata tablet. Each Metadata table contains the location of user data tablets. This table is keyed by node IDs and each row identifies a tablet's table ID and end row. For efficiency, the client library caches tablet locations.
A tablet is assigned to one tablet server at a time. Chubby keeps track of tablet servers. When a tablet server starts, it creates and acquires an exclusive lock on a uniquely-named file in a Chubby servers directory. The master monitors this directory to discover new tablet servers. When the master starts, it:
- grabs a unique master lock in Chubby (to prevent multiple masters from starting)
- scans the servers directory in Chubby to find live tablet servers
- communicates with each tablet server to discover what tablets are assigned to each server
- scans the Metadata table to learn the full set of tablets
- builds a set of unassigned tablet servers, which are eligible for tablet assignment
Replication
A BigTable can be configured for replicaiton to multiple BigTable clusters in different data centers to ensure availability. Data propagation is asynchronous and results an eventually consistent model.References
- Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber, Bigtable: A Distributed Storage System for Structured Data, Google, Inc. OSDI 2006
- The definitive paper on BigTable
- Robin Harris, Google’s Bigtable Distributed Storage System , StorageMojo.com
- Understanding HBase and BigTable, Jumoojw.com