File System Design Case Studies: File Systems for Flash Memory
April 24, 2015
Introduction
In the previous sections, we covered conventional file system designs. Here, we consider special needs for flash memory.
File systems for flash memory
Flash memory is non-volatile memory, meaning that its contents are preserved even when the chip is not powered. Its contents can also be modified. There are two technologies for flash memory: NOR flash and NAND flash.
NOR flash is fully addressable: each byte can be accessed individually and it can interface to a system’s memory bus. It is generally used to hold system firmware such as a PC’s BIOS or EFI firmware or the entire software of an embedded system. It is considerably slower than regular dynamic memory (RAM) and computers often copy the contents of NOR flash into RAM when they boot to allow faster code execution. It also has extremely long write times and requires rwrites to be done on an entire block at a time.
The second type of flash memory, NAND flash, is much cheaper and denser than NOR flash. Unlike NOR, which can be read randomly, NAND flash requires content to be read a chunk (called a page) at a time. Modifications are erase-write cycles and must also be done a page at a time. The size of a page depends on the vendor and are typically 512, 2K, or 4K bytes long. Because of this property, NAND flash is treated as a block device and mimics a magnetic disk. Data can only be read or written a page at a time (actually, a block at a time, where the controller defines a block as one or more pages). We will only focus on NAND flash in this discussion.
Flash memory suffers from limited erase-write cycles. It eventually wears out. Typically, today’s NAND flash can handle from 100,000 to 1,000,000 erase-write cycles before a block wears out. At that point, software must map it as a damaged block and “retire” it from the pool of allocatable blocks. A key goal with flash memory is to employ wear leveling: use blocks in such a way that writes are distributed among all (or most) of the blocks in the flash memory.
As an aside, there are two types of NAND flash memory available. A single-level cell (SLC) flash stores one bit of data in each cell (either programmer or erased; 1 or 0). Multi-level cell (MLC) flash is a design that stores more than one bit per cell. MLC–2 flash has four states per cell, MLC–3 flash has eight cells per state, and so on. The value it stores depends on the level of charge applied to it. This memory is less expensive than single-level cell (SLC) flash and is popular in most consumer-grade solid state storage devices. Since each cell can be in one of several states, MLC may have a higher bit error rate than SLC flash, which only holds two states. The more bits a cell stores, the fewer write cycles it will have. A 2-bit MLC cell is typically good for about 3,000 to 10,000 write operations. A 3-bit cell can sustain only 300 to 3,000 write cycles. In contrast, SLC flash can support around 100,000 write cycles. However, a 2016 Google study showed that SLC-based solid state drives (SSDs) are not necessarily more reliable than the lower-cost MLC-based ones.
There are two forms of wear leveling. Dynamic wear leveling monitors high-use and low-use areas of flash memory. At a certain threshold, it will swap high-use blocks with low-use blocks. In doing this kind of swapping, the controller needs to maintain a mapping of logical block addresses to actual flash memory block addresses. Whenever a block of data is written to flash memory, instead of overwriting the requested block (and wearing it out more), the data is written to a new block and the block map is updated. However, blocks that do not get modified will never get reused. Their content will remain as is. With static wear leveling, we do the same actions but also periodically move static data (blocks that have not been modified) to blocks that have been more heavily used to ensure that all blocks get a chance to be rewritten and hence wear out evenly.
Conventional file systems are not well-suited to flash file systems for a couple of reasons:
They tend to modify the same blocks over and over. Structures like the superblock get updated constantly and blocks that are freed from deleted files tend to get reallocated quickly. This is at odds with the goal of wear leveling, which is to avoid reusing blocks.
File systems are optimized to reduce disk seeking. Starting with FFS, we created block (cylinder) groups to hold file data close to the inodes that define the files that contain that data. We strive to allocate contiguous groups of blocks whenever possible. None of this helps performance in a NAND file system. There is no concept of seek time or a current disk head position. Reading one block is as fast as reading any other block.
Disk scheduling algorithms (elevator algorithms) are also pointless as they try to resequence blocks to minimize the overhead of seek delays.
Some on-chip NAND controllers implement wear leveling and bad block mapping. Most external controllers, such as those on Secure Digital (SD, SDHC, SDXC) cards implement some form of wear leveling, usually both static and dynamic. These controllers create and manage a block lookup table, which maps requested block numbers (logical block numbers) to their actual on-chip blocks (physical block numbers). They also asynchronously erase unused blocks (e.g., those that were moved) to make writing to them quicker. With this controller logic, a conventional file system does not have to be concerned with the limited erase cycles of the memory.
To save costs, many cell phones, tablet computers, and other embedded devices do not use controllers that perform wear leveling and need other solutions. One solution is to use a layer of software between the flash hardware and the block device called a flash translation layer (FTL). The FTL performs the wear leveling task in much the same way that controllers that support wear leveling do. This is the approach taken by iOS, for example. Another solution is to use a file system that works well with the characteristics of flash memory. This is the approach taken by Android.
Log-structured file systems
Two common file systems that are designed with wear-leveling in mind are YAFFS (Yet Another Flash File System) and JFFS (Journaling Flash File System). Both of these belong to a category of file systems known as log-structured file systems, which were pioneered by Mendel Rosenblum and John Ousterhout at the University of California at Berkeley in the early 1990s. Don’t be confused by JFFS’s use of the word journaling in its name; it is not the same as the file system journaling we discussed earlier, which was used to ensure the integrity of the file system structure.
YAFFS and JFFS are currently both in version 2 of their designs and are often referred to as YAFFS2 and JFFS2. We will dispense with the numeric suffix. YAFFS is typically favored for larger disks, those over 64 MB, and is the default file system for the Android operating system. JFFS is typically favored for smaller disks and is used in many embedded systems. Both of these file systems are roughly similar in principle. Both address only dynamic wear leveling. We will only take a look at YAFFS in this discussion.
YAFFS
The basic unit of allocation under YAFFS is called a chunk. Several (anywhere from 32 to 128 or more) chunks make up one block. A block is the unit of erasure for YAFFS.
YAFFS maintains a log structure. This means that all updates to the file system are written sequentially. Instead of overwriting different data structures (inodes, data blocks, etc.) at various locations, the file system is simply a linear list of logs — operations that take place on the file system. There is no underlying storage structure that resembles a hash of directory entries or an inode table.
Each log entry is one chunk in size and is either a data chunk or an object header. A data chunk represents a file’s data. An object header describes a directory, a file, a link, or other object. Sequence numbers are used to organize this log chronologically. Each chunk (log entry) includes:
- Object ID
- identifies the object to which the chunk belongs. Every file, for example, will have a unique object ID and any chunk with a matching object ID will belong to that file. AN object header for the file will contain information about the file: permissions, file length, and other attributes that would typically be found in an inode.
- Chunk ID
- identifies where the chunk belongs in the object. A file may be made up of zero or more data chunks. The chunk ID identifies the sequencing of these chunks so that the file could be reconstructed. If multiple Chunk IDs are present, the newest one wins over any older ones.
- Byte count
- Number of bytes of valid data in the file
When we first create a file, we might see something like this:
Block | Chunk | ObjectID | ChunkID | Status | Content |
---|---|---|---|---|---|
0 | 0 | 500 | 0 | Live | Object header for file (length=0) |
0 | 1 | ||||
0 | 2 | ||||
0 | 3 |
At this time, we have an empty file with no data. Now our process will write some data to it. All write operations will be written out to the log as chunks of data with object IDs that match that of the object header.
Block | Chunk | ObjectID | ChunkID | Status | Content |
---|---|---|---|---|---|
0 | 0 | 500 | 0 | Live | Object header for file (length=0) |
0 | 1 | 500 | 1 | Live | First chunk of data |
0 | 2 | 500 | 2 | Live | Second chunk of data |
0 | 3 | 500 | 3 | Live | Third chunk of data |
Note that the object header still refers to a zero-length file. When our process closes the file, the YAFFS driver will write the new object header, obsoleting the old one. In this example, we’ve run out of chunks in our first block and continue writing chunks to the second block.
Block | Chunk | ObjectID | ChunkID | Status | Content |
---|---|---|---|---|---|
0 | 0 | 500 | 0 | Deleted | Object header for file (length=0) |
0 | 1 | 500 | 1 | Live | First chunk of data |
0 | 2 | 500 | 2 | Live | Second chunk of data |
0 | 3 | 500 | 3 | Live | Third chunk of data |
1 | 0 | 500 | 0 | Live | Object header for file (length=n) |
1 | 1 | ||||
1 | 2 | ||||
1 | 3 |
If we now reopen the file, modify the first chunk of data in the file, and close the file again, the new contents of the first chunk of data will be written to the log (obsoleting the previous data for object 500/chunk 1) followed by a new object header that will contain a new modification time and any other file attributes that may have changed. This obsoletes the previous two object headers for Object 500.
Block | Chunk | ObjectID | ChunkID | Status | Content |
---|---|---|---|---|---|
0 | 0 | 500 | 0 | Deleted | Object header for file (length=0) |
0 | 1 | 500 | 1 | Deleted | First chunk of data |
0 | 2 | 500 | 2 | Live | Second chunk of data |
0 | 3 | 500 | 3 | Live | Third chunk of data |
1 | 0 | 500 | 0 | Deleted | Object header for file (length=n) |
1 | 1 | 500 | 1 | Live | New first chunk of data |
1 | 2 | 500 | 0 | Live | Object header for file (length=n) |
1 | 3 |
Garbage collection
As modifications to existing files take place, the affected data and header chunks become obsolete because updated entries are written to the log. These chunks are garbage; they’re useless. If all the chunks in a block are obsoleted (deleted), then that entire block can be erased and reused. Beyond that, garbage collection will be performed periodically to create free blocks by copying live chunks from blocks that have with just a few live chunks and mostly deleted ones. This is known in YAFFS as passive garbage collection. If the file system is desperate for free space, then aggressive garbage collection takes place, which will consolidate chunks even from blocks that have few deleted chunks.
In-memory reconstruction
One thing that should be clear by now is that the log-based structure of YAFFS is not an efficient one for finding either files or their data. One has to search linearly through the logs, finding the items of interest, and discarding any chunks that have more recent updates.
To avoid doing this over and over again, YAFFS scans the event log and constructs the file system state in memory (RAM). It creates an in-memory object state for each object and a file tree (directory) structure to locate files and directories. File data still resides in flash memory, but by having all file and directory information in RAM, we can quickly find out which blocks to read to get that data without having to scan the log.
YAFFS saves these reconstructed data structures in flash memory when the file system is unmounted so that the scan need not take place when it is mounted the next time. This is known as checkpointing. With checkpointing, the logs will have to be scanned only if the file system has not been unmounted cleanly. In that case, activity since the last checkpoint will need to be added to the hierarchy that was recovered from the last checkpoint.
References
The File System , Linux Documentation Project
Tuning the Linux file system Ext3, Oliver Diedrich, The H, Heise Media UK Ltd., October 27, 2008.
The Linux Virtual File System , The Linux Kernel, Andries Brouwer
How NTFS Works , Microsoft TechNet
The NTFS File System, Microsoft TechNet
What Is NTFS? , Microsoft TechNet
The Design and Implementation of a Log-Structured File System, Mendel Rosenblum and John K. Ousterhout, University of California at Berkeley, ACM Transactions on Computer Systems, Volume 10, No. 1, February 1992, Pages 26–52.
YAFFS, Project page.
How YAFFS Works, Charles Manning, 2007–2010.
JFFS Home Page, Axis Communications.
JFFS: The Journalling Flash File System, David Woodhouse, Red Hat, Inc.
XFS: the big storage file system for Linux, Christoph Hellwig, ;LOGIN: vOL. 34, NO. 5. October 2009
Low-level iOS forensics, Soeti ESEC Lab, June 28, 2012.
Multi-level cell, TechTarget .
This document is updated from its original version of March 23, 2012.