Random access compression in BlobFS

The BlobFS filesystem in Fuchsia transparently compresses files in order to save disk space. BlobFS supports a number of different compression strategies, with zstd being the default.

One downside of file compression is that it can prevent random access into files. For most compression algorithms, the entire contents must be read from disk and decompressed to access a single byte.

For BlobFS, this is a particular challenge when files are demand paged. Demand paging allows a file to be partially loaded into memory, which saves system memory. Full-file compression prevents a file from being partially loaded.

The chunked-compression format and library in Fuchsia breaks compressed files up into frames that can be independently decompressed. This allows BlobFS to implement demand paging of compressed files, since the file can be partially loaded and decompressed.

This document describes the design of the chunked-compression format and explains its use in Fuchsia.

Design goals and non-goals

The following are the goals that motivate the design of the chunked-compression format and library:

Random access decompression. It must be possible to independently decompress frames of data without needing to decompress the entire file.
Flexible decompression API. The library was designed to give clients control over the seek table, so clients have fine grained control over which frames they decompress. This supports use cases like demand paging, where the client (BlobFS) has more information about access patterns and can control read-ahead and prefetch precisely by decompressing specific frames.

This is in contrast to a more managed API that hides the details of which frames contain which decompressed bytes.
Configurable frame sizes. It must be possible to adjust the sizes of decompressed frames of data, to suit different use cases and different requirements.
Flexible compression layout. The format supports more exotic frame layouts, such as having non-uniform frame sizes or having aligned frame starts, in order to accommodate future use cases that require more flexibility. For example, non-uniform frame sizes could be used to improve data locality by grouping together data that is read together into smaller or larger frames.
Comparable compression ratio to zstd. Since zstd was the current default compression algorithm for BlobFS at the time of this document's writing, zstd's compression ratio is the baseline which the chunked-compression library is benchmarked against. The overhead due to framing and additional metadata must be minimal.
Configurable compression aggressiveness. It must be possible to trade off slower compression speed in favor of better compression ratios.
Cross-platform library. The library used to compress and decompress chunked compression archives must be usable both on Fuchsia and on the compilation host (e.g. Linux), to enable the use of the library in the build toolchain. This is necessary to compress files at build-time, for example when generating a base BlobFS image.

The following are non-goals:

Format-level compatibility with zstd. The chunked compression archive is not intended to be format-compatible with zstd, and regular zstd tooling will not work with chunked compression archives.

Chunked compression

Archive format

A chunked archive consists of a header followed by zero or more zstd frames.

The header describes the format of the archive and contains a seek table that maps the compressed frames to decompressed space.

      0     1     2     3     4     5     6     7
   +-----+-----+-----+-----+-----+-----+-----+-----+
 0 |                 Magic Number                  |
   +-----+-----+-----+-----+-----+-----+-----+-----+
 8 |  Version  |  Reserved |       Num Frames      |  // Reserved must be zero.
   +-----+-----+-----+-----+-----+-----+-----+-----+
16 |    Header CRC32       |        Reserved       |  // Reserved must be zero.
   +-----+-----+-----+-----+-----+-----+-----+-----+
24 |                    Reserved                   |  // Reserved must be zero.
   +-----+-----+-----+-----+-----+-----+-----+-----+
32 |                                               |
40 |                   Seek Table                  |
48 |                     Entry                     |
56 |                                               |
   +-----+-----+-----+-----+-----+-----+-----+-----+
.. |                                               |
.. |                   Seek Table                  |
.. |                     Entry                     |
.. |                                               |
   +-----+-----+-----+-----+-----+-----+-----+-----+

The "Header CRC32" is computed based on the entire header, including the seek table.

Seek table

Each seek table entry describes a contiguous range of data in the compressed space, and where in the decompressed data it expands to.

   +-----+-----+-----+-----+-----+-----+-----+-----+
 0 |               Decompressed Offset             |
   +-----+-----+-----+-----+-----+-----+-----+-----+
 8 |                Decompressed Size              |
   +-----+-----+-----+-----+-----+-----+-----+-----+
16 |                Compressed Offset              |
   +-----+-----+-----+-----+-----+-----+-----+-----+
24 |                 Compressed Size               |
   +-----+-----+-----+-----+-----+-----+-----+-----+

Seek table entries are contiguous in decompressed space, but may be discontiguous in compressed space. This is to support adding alignment and padding to output files to improve storage access efficiency.

A seek table can hold up to 1023 entries (which results in a 32KiB header) and must contain at least 1 entry (which results in a 64 byte header). Typically, compressed frames immediately follow the seek table (but the format supports the compressed frames starting at any offset past the end of the seek table).

Seek table invariants

I0: The first seek table entry must have decompressed offset 0.
I1: The first seek table entry must have compressed offset greater than or equal to the size of the header.
I2: Each entry's decompressed offset must be equal to the end of the previous frame (i.e. to the previous frame's decompressed offset+length).
I3: Each entry's compressed offset must be greater than or equal to the end of the previous frame (i.e. to the previous frame's compressed offset+length).
I4: Each entry must have a non-zero decompressed and compressed length.
I5: No compressed frame may exceed the end of the file.

Compressed frames

Each compressed frame in the file is a regular zstd compressed frame. A given frame will map to some contiguous chunk of bytes in the decompressed file.

Any ranges of bytes in the file not covered by the seek table are ignored.

It is not a requirement that each frame of data is the same decompressed size, but the current implementation of compression splits the input data into equal-sized frames. The size of the frames is configurable during compression.

Random-access decompression

The seek table is used to look up the frames that contain a given decompressed range. When a given range in the decompressed file is requested, one or more compressed frames must be loaded and decompressed.

For example, consider the following file with three frames:

Decompressed Space            Compressed Space
+-----------------+           +--------------+
|                 | <-------- |              |
|                 |           |     +--------|
|                 |           +-----+        |
+-----------------+   +------ |              |
|                 | <-+       +--------------+
|                 |       +-- |              |
|                 |       |   +--------------=
+-----------------+       |
|                 | <-----+
|       +---------|
+-------+

Accessing a byte range within a single frame only requires decompressing its corresponding compressed frame:

Decompressed Space            Compressed Space
+-----------------+           +--------------+
|                 | <-------- |xxxxxxxxxxxxxx|
|  xxxxxxxx       |           |xxxxx+--------|
|                 |           +-----+        |
+-----------------+   +------ |              |
|                 | <-+       +--------------+
|                 |       +-- |              |
|                 |       |   +--------------=
+-----------------+       |
|                 | <-----+
|       +---------|
+-------+

A range that spans several decompressed frames will require decompressing multiple compressed frames:

Decompressed Space            Compressed Space
+-----------------+           +--------------+
|                 | <-------- |xxxxxxxxxxxxxx|
|                 |           |xxxxx+--------|
|              xxx|           +-----+xxxxxxxx|
+-----------------+   +------ |xxxxxxxxxxxxxx|
|xxxx             | <-+       +--------------+
|                 |       +-- |              |
|                 |       |   +--------------=
+-----------------+       |
|                 | <-----+
|       +---------|
+-------+

Use in BlobFS

In BlobFS, files are compressed using the chunked compression library to facilitate random access and demand paging.

Currently, random access is only used when demand paging is enabled. With demand paging disabled, BlobFS will load and decompress the entire file on first access, buffering the file in memory while there are handles to the file.

With demand paging enabled, BlobFS lazily loads in portions of files as they are accessed. BlobFS registers itself as the pager for the VMO using the zx_pager_create syscall. When a non-present page is accessed in the VMO, a page fault occurs, which BlobFS handles.

BlobFS looks up the decompressed frames containing the target page(s), and decompresses each frame. After decompressing each frame, the data is verified, and committed to the pager-backed VMO through the zx_pager_supply_pages syscall.