RFC-0136: Fxfs

RFC-0136: Fxfs
Status	Accepted
Areas	Storage
Description	High level overview of a new Fuchsia filesystem.
Gerrit change	588001
Authors	csuter@google.com jfsulliv@google.com
Reviewers	abarth@google.com brettw@google.com palmer@google.com
Date submitted (year-month-day)	2021-09-30
Date reviewed (year-month-day)	2021-10-13

Edit this RFC

Edit RFC metadata

Summary

Fxfs is a new Fuchsia filesystem that has been under development within the storage team. This proposal outlines the motivation and higher level design decisions for Fxfs.

Motivation

Fuchsia's existing mutable filesystem, Minfs, has some limitations that mean it will be difficult to add the features we want to add going forward and achieve our performance goals. Fxfs is being developed as a candidate for replacing Minfs to achieve these goals.

There are some features that will almost certainly be required, such as:

Full read/write mmap support
Support for extended attributes
Strict quota enforcement
Filesystem notifications (specifically inotify support)
File locking (specifically flock support)

There are other features that we might eventually want:

Built-in encryption — per file encryption
Snapshot support
File clones
Multiple volume support
Compression support

We also have a goal of attaining filesystem performance which is comparable to other operating systems. Of all of the considerations, this is likely to be the most difficult to achieve and where most effort will be spent.

There are basic filesystem goals that all filesystems need to consider:

Low write amplification is desirable to minimize flash wear.
Efficient space usage — e.g. minimize wastage for small files.
Ease of migrations.

Much of the work required to achieve these goals will be covered by work elsewhere in the storage stack and in the kernel and will not be covered here; this RFC focuses on just the filesystem implementation, Fxfs.

While Fxfs is well suited to run on flash storage devices, there is nothing in the design that precludes its use on spinning storage devices, and we expect that Fxfs would run well on spinning storage devices as well.

Stakeholders

Facilitator:

Reviewers: abarth@google.com, brettw@google.com, palmer@google.com

Consulted:

Fuchsia's security team was consulted for questions related to encryption.

Socialization:

A detailed design review of Fxfs was held within the Local Storage team.

Design

Log Structured Merge Trees

An overview of log structured merge (LSM) trees can be found on Wikipedia. Fxfs uses LSM trees to provide a persistent key-value data structure that is used within Fxfs.

LSM trees offer some appealing properties:

The layers written to the device are immutable which means:
- No locking is required when accessing these layers.
- Restricts certain classes of bugs — if the layers are not being modified, there are fewer chances of corrupting them.
- Potentially easier to debug issues — if layers are not changing, there are less moving parts.
- It is easier to support compression, and more aggressive compression can be applied in the background to save space.
Mutations are written out sequentially in batches, rather than piecemeal, which is a more favourable write pattern for flash storage, reducing write amplification and avoiding unnecessary erase cycles. This write pattern also happens to be friendly for spinning storage devices.
Snapshot support falls out more easily than otherwise.
Permits deferral of work required for writes to idle times (many devices have significant periods of idle time). What this means in practice is that compactions can be done when the device is otherwise not in use, at times which are most convenient.
Compactions can be done in the background yielding time back for more important things.
Rewriting metadata is a well-exercised integral filesystem operation which can make format migrations easier. A completely different layer format can be written out during compaction.

There are some downsides:

More difficult space management — all operations require metadata space, even deletions, so careful reservation of space is required. Note that this issue arises with other filesystems that support snapshots so it is not unique to LSM trees.
Slower reads in cases where there are a large number of layers. There are mitigations we can employ for this, such as bloom filters, and compactions will reduce this cost by merging layers.

It is worth noting that even with the choice of LSM trees, there is still the option of using a mutable persistent layer format if we find that a hybrid approach makes sense.

The API to Fxfs's LSM tree implementation require a merge function which dictates how records are to be merged.

Fxfs's in-memory layer is represented by an efficient, concurrency-friendly data structure (currently a skip-list).

Object Stores

Fxfs is comprised of a hierarchy of object stores. Object stores provide a file-like API keyed by object ID, a 64-bit unsigned integer. Simple namespace features (i.e. directory support or similar) are supported, but not mandatory; it is possible to use an object store and just refer to objects using object IDs. Object stores keep their metadata in a persistent key-value data structure (LSM trees). The metadata comprises typical file information such as the size of object and extents which map to device offsets.

Fxfs's LSM trees use object stores to store their persistent layers which presents a cyclic dependency: object stores use LSM trees to store their metadata and then LSM trees use object stores to store their persistent layers. To resolve this, object stores are arranged in a hierarchy:

Root Parent Store

The root parent object store is backed by an in-memory only LSM tree, thus has no dependencies on object stores.

The root parent store only contains the layer files for the root store and the journal file (discussed later). Note that the LSM tree only contains metadata (e.g. extent information) and it is only that information that is permanently resident in memory.
Root Store

The root store uses the root parent store to store its persistent layer files.

The root store contains all other objects to support the filesystem, such as objects used by the allocator and the super-blocks.
Child Stores

Child stores use the root store to store their persistent layer files. They store user data. There can be many child stores, but they can only use the root store as their parent, which limits the hierarchy to three levels. Note that we could support deeper hierarchies of object stores if needed.

Objects

Objects are identified by a 64-bit unsigned integer that is unique within the store that they are stored in, so to uniquely identify an object within the filesystem, you need to specify the store.
Zero is reserved and used to indicate an invalid object ID.
Objects support multiple attributes, indexed by a 64-bit attribute ID which is unique within the object. Attributes have a file-like API — it is possible to read and write arbitrary amounts at arbitrary offsets within an attribute. Initially, Fxfs will only expose access to a single attribute for objects (i.e. the contents of the file). Later, additional attributes might be used to support extended attributes. Attributes can be sparse.
There are two modes available for writing to objects: a copy-on-write mode (the same as Minfs) and an overwrite mode. With the copy-on-write mode, any writes to blocks that are already allocated will involve writing to newly allocated blocks, updating metadata to point at the new blocks and then freeing the old blocks. The overwrite mode will overwrite existing blocks and thus will not require updating the same metadata. Most objects will be written using the copy-on-write mode and this will be the only mode initially exposed to external users.

Journal

Fxfs's journal serves two purposes:

It provides support for transactions: the ability to atomically apply mutations to multiple file system objects.
To provide fast persistence of changes in the event of power loss. Without this, no changes would persist until the memory layers are flushed, but for performance and write amplification reasons, it is desirable to delay this as much as possible (depending on memory constraints).

All mutations to in-memory data structures must go through the journal. Unlike Minfs' journal, the journal is a logical journal rather than a physical journal. This substantially reduces the amount of space taken by the journal — an insertion record can be represented by a few bytes, whereas changes to data structures within Minfs might involve several blocks.

Note that Fxfs' journal (like Minfs') does not include data. However, Fxfs' journal does include checksums on data written in COW mode, and Fxfs verifies these checksums during journal replay for data written since the last known device flush, which means that Fxfs can detect a torn COW write and will revert the file contents to its last completely written version. Minfs does not have this data-checksumming feature.

The journal exists as a single ever-growing file. Mutations are streamed to the file. As trees are compacted, extents at the beginning of the journal file can be freed which means the journal will have a sparse prefix. All objects have 64-bit sizes and the write rates we need to support will mean wrapping will not be a concern in our lifetime. To counter the fact that devices are free to reorder writes, a checksum will be placed at the end of every 4 KiB chunk. At replay time, replay will end at the first chunk in the journal that has a checksum mismatch. The checksum of one chunk is used as the seed for the next chunk which should guard against inadvertently accepting blocks intended for a different offset or that are remnants from a previous instance of Fxfs.

When replaying the journal, there is the challenge of knowing the extents for the journal file that aren't covered by the super-block. To solve this issue, the journal will use the overwrite mode of operation and will preallocate extents. During replay, any extents that are in the journal stream are used to read later offsets in the journal file.

Super-blocks

Fxfs's super-blocks are more than just blocks but they serve similar purposes to conventional super-blocks so the same name is used. The super-blocks exist as objects in the root object store. Unlike other objects, their first extent is placed at fixed locations on the device (all other objects have no restrictions on location). The first part of the super-block contains a serialized structure that contains similar information to that found on other filesystems (e.g. block sizes, object numbers for important objects, etc.) After that, the super-block contains all the records contained in the root parent object store (at the time the super-block was written). The super-block is written in the same way that the journal is written: the end of each 4 KiB chunk has a checksum.

There are two super-blocks. During mount, the super-block with the latest sequence number is chosen.

Mounting the filesystem involves:

Read the super-block with the latest sequence number.
Apply all the mutations in the journal to in-memory layers.
Read the layer information and initialise the LSM tree persistent layers.

Allocator

The allocator uses an LSM tree to store reference counted extents (this can be effectively seen as persistently storing <device-range> => <reference-count>. Initially Fxfs will only support reference counts of one, but in future, it is possible that Fxfs will support file clones or extent sharing which will lead to reference counts of more than one.

Directories

Object stores support directories by storing records in the form: <object-id><child-name> => <child-object-id><child-type>. This information is recorded in the same tree that is used for other object metadata.

Compactions

Initially compactions will be driven by the need to release space in the journal. A reasonable policy will be chosen initially and will evolve as required; whatever we do is not tied to the format so this is easily changeable.

Filename Encoding

The specific encoding which Fxfs uses for filenames is left unspecified for now, but we note that the encoding choice will be minimally compatible with UTF-8, which is the encoding used for FIDL strings and Rust strings. Fxfs is case-sensitive and performs no normalisation. This may need to change if compatibility issues arise.

Filesystem Limits

File size is limited to 2^63 bytes (the maximum value of a 64-bit signed integer). Internally, Fxfs can represent file sizes up to 2^64 bytes, but the existing filesystem API limits this (e.g. the off_t type which is signed).
Directory size has no specific limit, but is in practice constrained by the disk size and inode count.
Each volume can have roughly 2^64 inodes. In practice this will more likely be limited by disk size.
Fxfs can support roughly 2^64 volumes. In practice this will more likely be limited by disk size and other resources (e.g. memory when loading the volumes).
Fxfs supports timestamps at nanosecond granularity, with 2^64 bit seconds and 2^32 bit nanoseconds, which can represent UNIX epoch timestamps far into the future.

Fsck

Fxfs has a basic implementation of fsck, which verifies allocations and object reference-counts. More work will be done in the near future to expand the coverage of fsck.

Minfs

Some limitations of Minfs guide our design choices for Fxfs:

Multi-thread support. Minfs is single-threaded. Retroactively adding multi-threaded support is much harder than implementing it at the start. Fxfs is multi-threaded.
Minfs uses a physical journal which is simple, but incurs significant write amplification. Fxfs uses a logical journal.
Minfs has hard limits on directory sizes and inode count. Fxfs' limits are mainly based on the size of the disk.
Evolving Minfs's format is difficult because all changes have to be backwards compatible and migrations must be carefully considered. Fxfs still requires migrations but because compactions trigger full rewrites of metadata, some format changes are simpler.

Implementation

Prototyping of Fxfs has been underway for some time, it already exists in-tree (//src/storage/fxfs) and options exist for using it instead of Minfs. Minfs will remain the default mutable filesystem for the time being. The decision to change the default is outside the scope of this RFC.

Language

Fxfs uses Rust. We believe Rust offers some compelling benefits:

Memory safety — fewer crashes and fewer security issues, which is particularly helpful when multiple threads come into play.
Good async support — filesystems benefit from high parallelism and Rust's async support makes this considerably easier than in other languages.

Performance

One of the motivating reasons for pursuing Fxfs is to improve performance of our mutable filesystems. The benchmarks we are using to evaluate performance are being developed independently. The precise nature of these benchmarks is not within the scope of this RFC.

Backwards Compatibility

Fxfs will initially exist as an alternative to Minfs with limited support (not suitable for production) for migrating from Minfs.

We may consider supporting more robust migrations in future.

Security considerations

Initially Fxfs can be used with Zxcrypt for volume-level encryption (just as Minfs is), so there should be little change in this area. Fxfs is developed in Rust which should, in theory, reduce some security risk.

Fxfs built-in encryption will require a thorough consideration of security aspects, which is outside the scope of this RFC.

Privacy considerations

N/A

Testing

Fxfs uses the same comprehensive filesystem test suite that Minfs uses. Where appropriate, additional unit and integration tests specific to Fxfs will exist.

Documentation

Fxfs should be interchangeable with Minfs, so no additional external documentation is necessary; our APIs will not be changing. This RFC itself is the primary artifact documenting the high level design of Fxfs.

Drawbacks, alternatives, and unknowns

Developing a new filesystem is a significant undertaking. Whilst we have managed to implement a prototype that is close to Minfs in features and performance, there is some way to go before Fxfs is production ready.

Before embarking on this project we considered other strategies such as porting an existing open-source filesystem, or using the design from an existing filesystem. Some considerations were:

Fuchsia has a very different filesystem interface compared to other operating systems, so porting is non-trivial.
Some existing open source filesystems come with licensing hurdles.
The code bases are and would remain inconsistent with the majority of Fuchsia's code base (C++ is not common and testing practices are different).
Choosing an existing design would place some restrictions on format evolution; format changes to support a new feature requirement would involve potentially forking the format and/or incur overheads with pushing changes upstream.
Much of the performance work we need to do is not actually within filesystem implementations, so developing a new filesystem is arguably not on the critical path for achieving our performance goals anyway.
Existing filesystems come with an evolution history. Reimplementing an existing design would typically involve picking the set of features that we want to support, which leads to limited compatibility.
Making incremental changes to Minfs to get to where we need to be would likely take longer given that we would most likely need to support migrations.

With that said, we have chosen to support external contributions to the porting of F2fs, with the goal of getting to the point where Fuchsia works well with multiple different filesystems.

Prior art and references

LSM trees are not new and are widely used in various applications although examples of their use in smaller scale filesystems is limited.

Other aspects of Fxfs design such as a logical journal format and multi-volume support can be found in other filesystems.