RFC-0013: Cloning a VMO mapping | |
---|---|
Status | Accepted |
Areas |
|
Description | A system call that creates a CoW clone of a VMO from a mapping, without needing a handle to it. |
Issues | |
Gerrit change | |
Authors | |
Reviewers | |
Date submitted (year-month-day) | 2020-10-26 |
Date reviewed (year-month-day) | 2020-12-09 |
Summary
It is currently only possible to create a CoW clone of a VMO if you have a handle to it. This is not sufficient for zygote use cases, which requires CoW cloning an entire address space, including mappings for which all VMO handles have been closed. This RFC proposes a new system call to address this gap.
Motivation and problem statement
Chromium on Linux spawns renderers by forking them from a zygote process, resulting in significant memory and CPU savings. We would like to realize these savings on Fuchsia.
To do this, we need some way to implement an "address space clone" operation, which when given:
- A process root VMAR
- Some handles that refer to VMOs mapped in the address space
will return:
- A new process with an address space populated by clones of the VMOs mapped in the input process
- For each handle referring to a VMO in the input process, a new handle referring to the corresponding clone of that VMO in the new process
By adding this system call, we can comfortably implement this in userspace.
This works by getting the address space layout with ZX_INFO_PROCESS_MAPS
and using zx_vmar_create_vmo_child
to create a clone of each mapping,
except for any VMOs that there is a handle to,
which are handled using zx_vmo_create_child
.
Design
zx_vmar_create_vmo_child
zx_status_t zx_vmar_create_vmo_child(zx_handle_t handle,
uint32_t options,
zx_vaddr_t addr,
size_t size,
zx_handle_t* out)
Creates a CoW clone of the pages mapped in the range addr
to addr
+size
in the VMAR referenced by handle
.
The range must be a subrange of a single VMO mapping,
i.e. it may not span two mappings or include any unmapped pages.
The result is a new VMO that is a child of the VMO mapped in the range.
Allowed options are
ZX_VMO_CHILD_SNAPSHOT
, ZX_VMO_CHILD_SNAPSHOT_AT_LEAST_ON_WRITE
, and ZX_VMO_CHILD_NO_WRITE
.
They are interpreted the same way as in zx_vmo_create_child
.
(ZX_VMO_CHILD_SLICE
is not currently allowed due to security concerns:
special care would have to be taken to ensure that
the ability to modify the parent VMO doesn't break any security boundaries.)
Returns ZX_ERR_INVALID_ARGS
if:
- the range
addr
toaddr
+size
isn't a subrange of a single VMO mapping addr
orsize
is not page-aligned
Returns ZX_ERR_ACCESS_DENIED
if:
- the mapping was done using a new flag
ZX_VM_CANNOT_CREATE_VMO_CHILD
, or the vmar options include this flag handle
doesn't haveZX_RIGHT_INSPECT
Returns ZX_ERR_BAD_HANDLE
if handle
isn't a valid handle,
ZX_ERR_WRONG_TYPE
if handle
isn't a VMAR,
and ZX_ERR_BAD_STATE
if the VMAR is destroyed.
Can also return any error returned by zx_vmo_create_child
if the documented reason applies.
The returned VMO handle will have ZX_DEFAULT_VMO_RIGHTS
, with the following changes:
ZX_RIGHT_WRITE
will be removed ifZX_VMO_CHILD_NO_WRITE
was specified.ZX_RIGHT_EXECUTE
will be added if the mapping was executable andZX_VMO_CHILD_NO_WRITE
was not specified. This means it is be possible to create executable clones of executable mappings, but they must be read-only.
ZX_VM_CANNOT_CREATE_VMO_CHILD
ZX_VM_CANNOT_CREATE_VMO_CHILD
is a new zx_vm_options_t
flag
which can be specified for a VMO mapping or a VMAR.
This allows code that maps VMOs or creates VMARs
to disallow zx_vmar_create_vmo_child
on them.
Implementation
Adding a syscall is not a complex change; it can be done in one CL. Implementation of the clone algorithm is beyond the scope of this RFC.
Performance
It's not worth benchmarking this syscall on its own, since it's only used in the context of a zygote implementation. Instead we should measure the overall performance of the zygote implementation.
Security considerations
Creating VMO children using VMAR references instead of handles is not currently possible, but this RFC would make it possible. It's a bit like granting a new capability out of thin air.
In this case it doesn't seem that risky
because the end result is the same as making a new VMO and copy the data over
(but with less memory usage), which was possible before.
If anything needs to disallow this,
this RFC proposes a zx_vm_options_t
flag (ZX_VM_CANNOT_CREATE_VMO_CHILD
)
that can be used.
Testing
The syscall will be unit tested. The tests for a future implementation of address space cloning will also serve as integration tests for the syscall.
Documentation
This RFC is a starting point for the documentation of zx_vmar_create_vmo_child
.
Drawbacks, alternatives, and unknowns
What other strategies might solve the same problem?
We could add a system call to do the entire address space clone operation. This would be a very complicated system call, in both interface and implementation. It's better to push the complexity into userspace.
We could add a system call that allows minting a VMO handle from a mapping.
This allows what is essentially a userspace implementation of zx_vmar_create_vmo_child
,
by first creating a handle from the mapping and then calling zx_vmo_create_child
.
But the Zircon API generally only allows minting handles to newly created objects,
in order to make the interface easier to reason about.
This syscall would have to create handles out of thin air.
There are various tricks that might be used to avoid closing any VMO handles,
such as modifying all code that creates and maps VMOs to avoid closing them,
or intercepting zx_handle_close
.
However, this is not sufficient to implement address space cloning,
since the loader service maps some VMOs into the process before it starts running,
and this doesn't help you get handles to those.
We could modify the loader service to pass the necessary handles to the process,
but that would be more work than adding this syscall.
Prior art and references
https://chromium.googlesource.com/chromium/src/+/master/docs/linux/zygote.md http://neugierig.org/software/chromium/notes/2011/08/zygote.html