BlobFS

BlobFS is a content-addressable filesystem optimized for write-once, read-often files, such as binaries and libraries. On Fuchsia, BlobFS is the storage system used for all software packages.

When mounted, BlobFS presents a single logical directory containing all files (a.k.a., blobs):

blob/
 ├── 00aeb9b5652a4adbf630d04a6ca22668f9c8469746f3f175687b3c0ff6699a49
 ├── 01289d3e1d2cdbc7d1b4977210877c5bbdffdbad463d992badc149152962a205
 ├── 018951bcf92091fd5d294cbd1f3a48d6ca59be7759587f28077b2eb754b437c0
 └── 01bad8536a7aee498ffd323f53e06232b8a81edd507ac2a95bd0e819c4983138

Files in BlobFS are:

Immutable: Once created, a blob cannot be modified (except removal).
Content-Addressable: Blob names are deterministically derived from their contents.
Verified: Cryptographic checksums are used to ensure integrity of blob data.

These properties of blobs make BlobfS a key component of Fuchsia's security posture, ensuring that software packages' contents can be verified before they are executed.

Design and implementation of BlobFS

On-disk format

BlobFS stores each blob in a linked list of non-adjacent extents (a contiguous range of data blocks). Each blob has an associated Inode, which describes where the block's data starts on disk and other metadata about the blob.

BlobFS divides a disk (or a partition thereof) into five chunks:

The Superblock storing filesystem-wide metadata,
The Block Map, a bitmap used to keep track of free and allocated data blocks,
The Node Map, a flat array of Inodes (reference to where a blob's data starts on disk) or ExtentContainers (reference to several extents containing some of a blob's data).
The Journal, a log of filesystem operations that ensures filesystem integrity, even if the device reboots or loses power during an operation, and
The Data Blocks, where blob contents and their verification metadata are stored in a series of extents.

BlobFS disk layout

Figure 1: BlobFS disk layout

Superblock

The superblock is the first block in a BlobFS-formatted partition. It describes the location and size of the other chunks of the filesystem, as well as other filesystem-level metadata.

When a BlobFS-formatted filesystem is mounted, this block is mapped into memory and parsed to determine where the rest of the filesystem lives. The block is modified whenever a new blob is created, and (for FVM-managed BlobFS instances) whenever the size of the BlobFS filesystem shrinks or grows.

BlobFS superblock

Figure 2: BlobFS superblock

When BlobFS is managed by FVM, the superblock contains some additional metadata describing the FVM slices that contain the BlobFS filesystem. These fields (yellow in the above diagram) are ignored for non-FVM, fixed-size BlobFS images.

Block map

The block map is a simple bit-map that marks each data block as allocated or not. This map is used during block allocation to find contiguous ranges of blocks, known as extents, to store blob contents in.

Example block map

Figure 3: An example block-map with several free extents of varying size.

When a BlobFS image is mounted, the block map is mapped into memory where it can be read by the block allocator. The block map is written back to disk whenever a block is allocated (during blob creation) or deallocated (during blob deletion).

Node map

The node map is an array of all nodes on the filesystem, which can come in two variations:

Inodes, which describe a single blob on the filesystem, or
ExtentContainers, which point to an extent containing part of a blob's data.

Nodes of both types are stored together in a single flat array. Each node has a common header that describes what type the node is, and whether the node is allocated. Both node types are the same size, so there is no internal fragmentation of the array.

Inodes

Each blob in the filesystem has a corresponding Inode, which describes where the blob's data starts and some other metadata about the blob.

Layout of a BlobFS Inode

Figure 4: Layout of a BlobFS Inode.

For small blobs, the Inode may be the only node necessary to describe where the blob is on disk. In this case extent_count is one, next_node must not be used, and inline_extent describes the blob's single extent.

Larger blobs will likely occupy multiple extents, especially on a fragmented BlobFS image. In this case, the first extent of the blob is stored in inline_extent, and all subsequent extents are stored in a linked list of ExtentContainers starting at next_node.

Format of an Extent

Figure 5: Format of an Extent (occupying 64 bits). This format is used both in Inodes and ExtentContainers.

Note that this representation of extents implies that an extent can have at most 2**16 blocks in it (the maximum value of Extent Size).

ExtentContainers

An ExtentContainer holds references to several (up to 6) extents, which store some of the contents of a blob.

The extents in an ExtentContainer are logically contiguous (i.e. the logical addressable chunk of the blob stored in extents[0] is before extents[1]) and are filled in order. If next_node is set, then the ExtentContainer must be full.

Layout of a BlobFS ExtentContainer

Figure 6: Layout of a BlobFS ExtentContainer.

Properties of the node linked-list

A blob's extents are held in a linked-list of a single Inode (which holds the first extent) and zero or more ExtentContainers (each of which holds up to 6 extents).

This linked list has the following properties. Violating any of these properties results in blobfs treating the blob as corrupted.

Extents are logically contiguous:
- If Node A precedes Node B in the list, then all extents in Node A have lower logical offsets into the blob's contents.
- Within a given ExtentContainer, for extents 𝑥 and 𝑦, if 𝑥 < 𝑦, then extent 𝑥 has a lower logical offset into the blob's contents than extent 𝑦.
Nodes are packed before a new node is linked. That is, if a Node has a non-null next_node, then it must be full of extents (*extent for Inodes and 6 extents for ExtentContainers).
The total number of extents in the linked-list must equal to the Inode's extent_count.
The sum of the size of all extents in the linked-list must equal to the Inode's block_count.
The end of the list is determined based on the extent_count in the Inode being satisfied. next_node in the final node should not be used.

Example Node layouts

This section contains some examples of different ways a blob's Nodes may be formatted.

Example: Single-extent blob
Example: Multiple-extent blob

Example: Single-extent blob

Figure 7: Node layout for a blob stored in a single extent

Example: Multiple-extent blob

Figure 8: Node layout for a blob stored in several extents. Note that a blob's extents may be scattered throughout the disk.

Blob fragmentation

A newly created BlobFS image has all of its data blocks free. Extents of arbitrary size can easily be found, and blobs tend to be stored in a single large extent (or a few large extents).

Over time, as blobs are allocated and deallocated, the block map will become fragmented into many smaller extents. Newly created blobs will have to be stored in multiple smaller extents.

A fragmented block map

Figure 9: A fragmented block map. While there are plenty of free blocks, there are few large extents available.

Fragmentation is undesirable for several reasons:

Slower Reads: Reading a fragmented blob requires chasing pointers in the Node Map. This affects both sequential reads and random-access reads
Slower Creation and Deletion: Creating a blob requires finding free extents for it; this takes longer if many small extents must be found. Similarly, deleting a fragmented blob requires chasing down and freeing many extents.
Metadata Overhead: Storing fragmented blobs requires more nodes. There are a finite number of nodes in the Node Map, which can be exhausted, preventing blobs from being created.

Currently BlobFS does not perform defragmentation.

Journal

TODO

Data blocks

Finally, the actual contents of the blobs must be stored somewhere. The remaining storage blocks in the BlobFS image are designated for this purpose.

Each blob is allocated enough extents to contain all of its data, as well as a number of data blocks reserved for storing verification metadata of the blob. This metadata is always stored in the first blocks of the blob. Metadata is padded so that the actual data always starts at a block-aligned address.

This verification metadata is called a Merkle Tree, a data structure that uses cryptographic hashes to guarantee the integrity of the blob's contents.

Merkle tree

A blob's Merkle Tree is constructed as follows (for more details, see Fuchsia Merkle Roots):

Each leaf node is a sha256 hash of a single block's worth of data.
Each non-leaf node is a sha256 hash combining its children's hashes.
The tree terminates at the level where there is a single sha256 hash.

The hash value at the top-most node is known as the Merkle Root of the blob. This value is used as the name of the blob.

A simplified example Merkle Tree

Figure 10: A simplified example Merkle Tree. Note that in practice more information is included in each hash value (such as the block offset and length), and each non-leaf node is significantly wider (in particular, each non-leaf node can contain up to 8192 / 32 == 256 children).

Implementation of BlobFS

Like other Fuchsia filesystems, BlobFS is implemented as a userspace process that serves clients through a FIDL interface.