RFC-0236: VMO Snapshot-Modified Clone | |
---|---|
Status | Accepted |
Areas |
|
Description | The objective is to introduce a new VMO child type that allows a snapshot to be taken that captures any modified pages in a pager-backed VMO. |
Issues | |
Gerrit change | |
Authors | |
Reviewers | |
Date submitted (year-month-day) | 2023-07-20 |
Date reviewed (year-month-day) | 2023-12-12 |
Summary
The objective is to introduce a new VMO child type that allows a snapshot to be taken of any modified pages in the child of a pager-backed VMO.
Motivation
The kernel today supports two types of VMO clones (also called child VMOs): true snapshots in which neither VMO sees each other's changes after the clone operation completes, and snapshot-at-least-on-write clones, in which the child can see the changes on the parent VMO, after the operation completes, except for the pages on the child that have been written to.
The cloning operation can be repeated on the VMO clones themselves, creating a hierarchy of child VMOs. For ease of implementation the Zircon kernel does not allow mixed hierarchies; you can either create a hierarchy of true snapshots or a hierarchy of snapshot-on-write VMOs.
With the invention of Starnix, fork() must be efficiently supported by Fuchsia. Fork requires cloning the entire address space of the parent process into the child process, which includes anonymous memory but also pager-backed data and code VMOs.
A problem arises because the kernel only supports true snapshots for anonymous memory, while for pager-backed VMO it only supports snapshot-on-write clones. Therefore the fork() contract cannot be met with cloning: so Starnix is forced to implement eager copies and/or other cpu and memory intensive workarounds.
Stakeholders
Who has a stake in whether this RFC is accepted? (This section is optional but encouraged.)
csuter@google.com, jamesr@google.com
Facilitator:
davemoore@google.com
Reviewers:
rashaeqbal@google.com, jamesr@google.com
Consulted:
List people who should review the RFC, but whose approval is not required.
csuter@google.com, adanis@google.com, cpu@google.com, mvanotti@google.com, lindkvist@google.com
Socialization:
A design document was socialized with the Zircon team and some members of starnix and a work in progress CL was shared with stakeholders for benchmarking.
Requirements
- Efficient support for cloning address spaces that have both pager-backed VMOs and anonymous VMOs.
- There should not be wasted memory after a VMO in inaccessible by a process (last handle is closed) for either parent VMOs or child VMOs, in other words, these two cases:
vmo = create_vmo();
loop {
child = create_snapshot_modified(vmo)
child.write(...)
vmo = child // old vmo is dropped
}
vmo = create_vmo();
loop {
vmo.write(...)
child = create_snapshot_modified(vmo)
// child is dropped
}
Design
ZX_VMO_CHILD_SNAPSHOT_MODIFIED
is a new type of VMO clone that will allow the
creation of a snapshot-modified child type.
This flag creates a child that retains a snapshot of any pages that have been modified by a child of a pager-backed VMO. Semantically, it is as though eager copy is performed on any pages in the parent not backed by a pager. Pages in the parent that are backed by a pager will behave with at least copy-on-write semantics into the clone. This differs from the original snapshot semantics which behave as though an eager copy is created on all pages in the VMO.
When used for the first time on a pager-backed VMO, the semantics behave as
though SNAPSHOT_AT_LEAST_ON_WRITE
were used. A handle to a clone is created
which is initially identical to the parent, but modifications can be made on
the clone which will cause it to diverge.
When used against any anonymous VMO, a SNAPSHOT_MODIFIED
clone will be
upgraded to have snapshot semantics, similar to the existing clone type upgrade
semantics that are used by SNAPSHOT_AT_LEAST_ON_WRITE
.
This flag is not valid for cloning VMOs with pinned regions, slices or VMOs
descended from zx_vmo_create_physical()
or zx_vmo_create_contiguous()
.
Cases
Snapshot-modified on pager-backed VMO
Creating a single SNAPSHOT_MODIFIED
clone of a pager-backed VMO will behave
as though a single SNAPSHOT_AT_LEAST_ON_WRITE
clone was created. At the time
the clone was performed, the new VMO will be identical to the parent.
Until another SNAPSHOT_MODIFIED
is performed on the clone, it can still be
modified. Any unmodified pages in the clone will have at least copy-on-write
semantics.
Snapshot-modified after at-least-on-write or snapshot-modified
of a pager-backed VMO
As both SNAPSHOT_MODIFIED
and SNAPSHOT_AT_LEAST_ON_WRITE
behave the same
way on a pager-backed VMO, both cases of calling SNAPSHOT_MODIFIED
on a clone
of the pager-backed will result in the same semantics. Any pages that are
no longer pager-backed will have a snapshot, and pager-backed pages will have at
lease copy-on-write semantics.
Snapshot-modified after snapshot
In this case the semantics will be upgraded to snapshot, similar to snapshot-at-least-on-write.
Unsupported Cases
The following cases are currently unsupported, and if a SNAPSHOT_MODIFIED
clone is attempted, ZX_ERR_NOT_SUPPORTED
will be returned.
Snapshot-modified end of snapshot-at-least-on-write chain
Snapshot-modified could potentially be expanded to be used at the end of a snapshot-at-least-on-write chain. This could have confusing results however as un-forked pages from the cloned VMO can see modifications all the way through the unidirectional VMO chain, with the closest relative being the one that is read. This creates inconsistencies with the original promise stating that snapshots can be created of any modified page.
Snapshot-modified the middle of a snapshot-at-least-on-write chain
Snapshot-modified could never be used on a VMO that has children (i.e. in the middle of a snapshot-at-least-on-write chain) as it can create inconsistent hierarchies.
Nomenclature
The existing naming convention for clone type flags in zx_vmo_create_child()
aims to name flags in a way that describes the provided semantics. The current
name for this flag is SNAPSHOT_MODIFIED
as it summarizes the behavior in
which the modified pages in the clone are snapshot. A similar option is
SNAPSHOT_MODIFICATIONS
. Some other considered flags were SNAPSHOT_MIXED
which does describe the semantics, but is less clear
SNAPSHOT_PAGER_MODIFICATIONS
was another consideration, but it isn't ideal
to couple the VMO with the pager.
Implementation
Snapshot-modified affects a number of files in Zircon, but can be broken into CLs that add support for the new snapshot type in kernel internals before the option flag is added to the syscall.
Some in-kernel testing will be added to validate correct behavior of the new structure during the first stage, and more complex core tests will be included with the introduction of the option flag.
Performance
In most cases the newly added code will only be called on the creation of a new
snapshot-modified clone, so performance of existing code is unexpected to
change. The one exception is that when creating a SNAPSHOT_AT_LEAST_ON_WRITE
child the naive approach includes an additional acquisition of the VMO lock, but
if this introduces a performance penalty it will be trivial to refactor the
clone selection code to avoid this.
Security considerations
Snapshot-modified is unlikely to introduce any vulnerabilities as it is build using existing Zircon primitives and no new functionality is introduced.
Testing
Kernel unit tests and core tests will be included with the relevant CLs.
Documentation
A detailed design document aimed at Zircon developers will be released. It will outline the new data structures, supported and unsupported cases, changes to the code, challenges and alternatives.
A new flag will be added into the zx_vmo_create_child()
with the
description:
ZX_VMO_CHILD_SNAPSHOT_MODIFIED
- Create a child that behaves as though an
eager copy was performed on any pages in the parent not backed by a pager, i.e.
pages that have been modified by a child of a pager-backed VMO. Pager-backed
pages will have at least copy-on-write semantics. This flag may not be used for
VMOs created with zx_vmo_create_physical(), zx_vmo_create_contiguous(), VMOs
containing pinned pages, or descendants of such VMOs. This flag is also not
supported for any VMOs created with SNAPSHOT_AT_LEAST_ON_WRITE
that have
non-slice children or are not the child of a pager-backed VMO.
Drawbacks, alternatives, and unknowns
Creating snapshot semantics for a pager backed VMO is non-trivial using existing Zircon VMO primitives. The user pager operates by servicing page requests to a single VMO, and at present its children form a single, unidirectional chain with copy-on-write semantics.
The snapshot flag that can be used on an anonymous VMO creates a hidden VmCowPages that is a common ancestor to the target VMO and its new snapshot. As there is nothing pointing to the hidden VmCowPages that can modify it's pages, it is immutable. This hidden VmCowPages retains the pages from the target VMO, modifications in the children have copy-on-write semantics. Therefore, a search for pages in this hierarchy involves a walk up the tree that can end at the hidden root.
It would be difficult to use the existing snapshot data structure for use by the pager, as the root VMO would always become hidden, with the original VMO becoming the left child. If the pager remains pointing to the original VMO (which now has a hidden parent), pager operations would have to be propagated up to the hidden root, to supply it pages at the request of its child. This creates inconsistencies as pages are added to a node that is not being operated on.
The simplest solution was to create a mixed hierarchy, where the root is visible & has a single hidden child that acts as the hidden root to a snapshot tree.
There is more than one way to describe the provided semantics to users. The description in the RFC outlines the provided behavior with respect to the pager, but an alternative way of framing it is on the modification & cloning of pages only. An example of this would be:
"This flag will create a child that retains a snapshot of any modified pages from the parent. If the root vmo writes to an unmodified page after the snapshot has occurred, the snapshot-modified child will see the changes. This differs from the original snapshot semantics which behave as though an eager copy was created."
This descripion is correct, but it requires additional clarification of the caveat that a pager is required for this behavior, as the flag will upgrade to snapshot semantics when used on an anonymous VMO.
Could SNAPSHOT_MODIFIED replace SNAPSHOT_AT_LEAST_ON_WRITE?
It would be non-trivial to phase out snapshot-at-least-on-write and replace it with snapshot-modified as the semantics of the two clone types differ, which could cause unexpected behavior. Although both clone types offer 'at least copy-on-write' semantics, snapshot-modified can have a mix of snapshot and at-least-on-write pages within the same VMO. Additionally, the change would require performance testing. When a VMO is pager-backed, snapshot-at-least-on-write allocates less memory per clone created as there are no hidden, common ancestors created. Thus, migrating all uses of snapshot-at-least-on-write could introduce performance regressions in some use cases.
It could be worth investigating replacing SNAPSHOT_AT_LEAST_ON_WRITE
,
however, as it would simplify the API for zx_vmo_create_child()
as most
fdio helpers promise semantics that are compatible with SNAPSHOT_MODIFIED
.