RFC-0010: zx_channel_iovec_t support for zx_channel_write and zx_channel_call | |
---|---|
Status | Accepted |
Areas |
|
Description | This RFC introduces a new mode to zx_channel_write and zx_channel_call which copies input data from multiple memory regions rather than from a single contiguous buffer. |
Issues | |
Gerrit change | |
Authors | |
Reviewers | |
Date submitted (year-month-day) | 2020-09-25 |
Date reviewed (year-month-day) | 2020-10-06 |
Summary
This RFC introduces a new mode to zx_channel_write
, zx_channel_write_etc
,
zx_channel_call
and zx_channel_call_etc
that copies input data from
multiple memory regions rather than from a single contiguous buffer. This
improves performance for certain users / clients by allowing message data to be
directly copied from multiple userspace objects without an intermediate
allocation, copy and layout step. This is accomplished by updating existing
syscalls to take an array of zx_channel_iovec_t
memory region descriptors
when an option is specified.
Motivation
The main motivation for this proposal is performance.
For non-linearized domain objects, FIDL bindings currently need to (1) allocate
a buffer and (2) copy objects into the buffer in a standard layout. After these
steps, the buffer is again copied into the kernel. zx_channel_iovec_t
allows
the objects to be directly copied into the kernel. In addition FIDL message
data no longer needs to be laid out in a standard order -- only the
zx_channel_iovec_t
array must reflect the needed order.
Design
zx_channel_write currently has the following signature:
zx_status_t zx_channel_write(zx_handle_t handle,
uint32_t options,
const void* bytes,
uint32_t num_bytes,
const zx_handle_t* handles,
uint32_t num_handles);
The input data is a contiguous byte array pointed to by bytes
.
In zx_channel_write_etc
, zx_channel_call
and zx_channel_call_etc
, there
are analogous arrays. The fact that these arrays must be contiguous leads to
overhead. In particular, for FIDL messages with out-of-line components, the
FIDL encoder must allocate a buffer and relocate data into it which can be
expensive.
zx_channel_iovec_t
provides an alternative path. zx_channel_write
,
zx_channel_write_etc
, zx_channel_call
and zx_channel_call_etc
instead receive a list of locations and sizes of objects and
copying happens within the kernel, avoiding additional duplication and
allocation.
zx_channel_iovec_t
is defined in C++ as the following:
typedef struct zx_channel_iovec {
void* buffer; // User-space bytes.
uint32_t capacity; // Number of bytes.
uint32_t reserved; // Reserved.
} zx_channel_iovec_t;
Each zx_channel_iovec_t
points to the next capacity
bytes to be copied from
buffer
to the kernel message buffer. reserved
must be assigned to zero.
The buffer
field may be NULL only if the capacity
is 0. buffer
pointers
may be repeated in multiple zx_channel_iovec_t
.
The signatures of zx_channel_write
, zx_channel_write_etc
, zx_channel_call
or zx_channel_call_etc
are unchanged. However, when the user specifies the
ZX_CHANNEL_WRITE_USE_IOVEC
option to these syscalls, the void* bytes
argument will be interpreted as a zx_channel_iovec_t*
. Similarly, the
num_bytes
argument will be interpreted as the number of zx_channel_iovec_t
in the array.
Note that the type of the handle array (zx_handle_t
or
zx_handle_disposition_t
) is irrelevant as only the bytes
array is
changed.
The message described by the zx_channel_iovec_t
array with either be sent
with all parts of the message included, or the message will not be sent at
all. Handles provided to the syscall are no longer available to the caller
on both success and failure.
Error conditions
These are the error conditions of zx_channel_write
, zx_channel_write_etc
,
zx_channel_call
and zx_channel_call_etc
with updates due to the
introduction of iovecs.
ZX_ERR_OUT_OF_RANGE num_bytes
or num_handles
are larger than
ZX_CHANNEL_MAX_MSG_BYTES
or ZX_CHANNEL_MAX_MSG_HANDLES
respectively.
If the ZX_CHANNEL_WRITE_USE_IOVEC
option is specified,
ZX_ERR_OUT_OF_RANGE will be produced if num_bytes
is larger than
ZX_CHANNEL_MAX_MSG_IOVEC
or the sum of the iovec capacities exceeds
ZX_CHANNEL_MAX_MSG_BYTES
.
ZX_ERR_INVALID_ARGS bytes
is an invalid pointer, handles
is an invalid pointer, or options
contains an invalid option bit.
If the ZX_CHANNEL_WRITE_USE_IOVEC
option is specified,
ZX_ERR_INVALID_ARGS if the buffer
field contains an invalid pointer.
Alignment
There are no alignment restrictions on the bytes specified in a
zx_channel_iovec_t
. Each zx_channel_iovec_t
will be copied without padding.
Limits
The existing limits on the number of bytes (65536
) and handles (64
) per
message are unchanged. Note that these limits apply to messages and not
zx_channel_iovec_t
entries.
The number of zx_channel_iovec_t
will be limited to 8192
per syscall. This
number comes from the number of 8-byte aligned inline + out of line objects
that can fit in a 65536
byte message, with each inline + out of line object
potentially using a zx_channel_iovec_t
entry.
Implementation
Syscall
- Introduce the
zx_channel_iovec_t
type, as defined in the design section. - Add
ZX_CHANNEL_WRITE_USE_IOVEC
- No changes to the visible syscall interface, the
zx_channel_iovec_t
array is passed in to the existingbytes
parameter.
Kernel
After receiving the ZX_CHANNEL_WRITE_USE_IOVEC
option, the kernel will:
- Copy the data pointed to by the
zx_channel_iovec_t
objects to the message buffer. While the copy operations will typically also be performed in order of thezx_channel_iovec_t
inputs, it is not mandatory. However, the final message must be laid out in the order of thezx_channel_iovec_t
entries. - Write the message to the channel.
FIDL
This is a proposal for a system call change for which the implementation takes place within the kernel and the specifics of FIDL binding changes are out of scope. That said, for the sake of evaluating this proposal it is important to understand the effect on FIDL encoding.
FIDL bindings can optionally take advantage of zx_channel_iovec_t
support by
adding support for encoding FIDL objects into an array of zx_channel_iovec_t
.
A key difference betweeen this encode path and existing encode paths is that
the zx_channel_iovec_t
allow the kernel to copy objects in-place. The main
complication with this is with pointers. FIDL-encoded messages need to be
sent to the kernel, with pointers replaced with PRESENT
or ABSENT
marker
values. However, in many cases the objects continue to need to have the
original pointer values after the system call so that destructors can be
called.
This means that bindings taking advantage of zx_channel_iovec_t
will
sometimes need to do extra bookkeeping work to make sure the objects are
cleaned up correctly.
Migration
Since this feature is implemented as an option that is default-disabled, it shouldn't have an immediate effect on existing users. Call-sites can migrate to use the option as needed.
Practically speaking, the intention is to migrate FIDL bindings that can
benefit from zx_channel_iovec_t
to use it. This is expected to have minimal
effects on FIDL users.
Performance
A prototype was implemented and benchmarked.
This prototype implemented the zx_channel_write option on the kernel side
and limited FIDL support (inline objects and vectors only).
The message header, along with each inline and out-of-line object each
had a zx_channel_iovec_t
entry.
An array of 64 entries was used to store the zx_channel_iovec_t
in both the
kernel and FIDL encode.
These measurements are from a machine with a Intel Core i5-7300U CPU @ 2.60GHz.
Byte vector event benchmarks (zx_channel_write, zx_channel_wait_async and zx_channel_read) showed a significant improvement:
- 4096 byte vector: 9398 ns -> 4495 ns
- 256 byte vector: 8788 ns -> 3794 ns
FIDL encode also showed performance improvements.
Encode time of the byte vector examples:
- 4096 byte vector: 345 ns -> 88 ns
- 256 byte vector: 251 ns -> 86 ns
Inline objects also show small encode improvement:
- Struct with 256 uint8 fields: 67 ns -> 49 ns
Security considerations
Given that this is a significant change to an existing system call, a security review is needed before the implementation lands.
Privacy considerations
There should be no impact to privacy.
Testing
Unit and integration tests will be added for each layer that is changed.
No device or system-wide end-to-end tests are intended to be added, though existing test coverage will help ensure no unexpected bug has been introduced after a migration has taken place.
Documentation
The system call documentation needs to be updated to indicate support for this feature.
No architecture-wide documentation changes are needed.
Drawbacks, alternatives, and unknowns
The main drawback of this proposal is added complexity from needing to support
the option in the kernel and the practical added complexity for FIDL bindings
that use the ZX_CHANNEL_WRITE_USE_IOVEC
option that need to ensure that
objects are properly cleaned up after they have been mutated for in-place copy.
Limits
There is an argument for a lower limit on the number of zx_channel_iovec_t
,
potentially closer to 16
than 8192
. This would allow the
zx_channel_iovec_t
array to be copied onto the kernel's stack. However, this
would prevent the implementation strategy of assigning one zx_channel_iovec_t
entry per out-of-line FIDL object.
In practice, it might be more performant to linearize in userspace when there
are a large number of zx_channel_iovec_t
, or at least avoid shifting work
to the kernel. However, the 8192
limit is suggested for simplicity until
it is known if further refinement is needed.
An implementation-level consequence of the higher limit is that the
zx_channel_iovec_t
array cannot entirely fit on the kernel stack. A stack
buffer can be used for the common case, but it will need to be copied into
a larger (and slower) buffer when there are sufficiently many entries.
Vectorized handles
It would be possible to have an equivalent of zx_channel_iovec_t
for handles,
or include them alongside bytes in the existing zx_channel_iovec_t
. However,
the benefits are more limited for handles because the handle array tends to be
small. For simplicity, handles remain in a dedicated array.
Support for multiple messages in single write
A previous version of this RFC included a proposal for support for multiple
messages in a single zx_channel_write
call.
Three proposals were considered:
- Flat representation: repurpose the
reserved
field on thezx_channel_iovec_t
with twouint16_t
fields:message_seq
(which message thezx_channel_iovec_t
is part of) andhandle_count
(the number of handles consumed by the bytes inbuffer
). The sequence numbers are constrained to be monotonic and have no gaps. This constraint enables a more performant kernel implementation, but can be weakened in the future if needed. This approach aligns with this RFC and multi-message support can be added to the existing structure. - Array-of-array representation: there is an outer array of messages, each with
pointers into an inner array of iovecs per message. This is similar to the
structure used in the Linux syscall
sendmmsg
and might be more familiar to users. While the performance of the array-of-array representation wasnt't measured, there is evidence that there could be a 5-25% overhead due to indirection (see CL). - Header-prefixed representation: the buffer begins with a header and is
followed by the iovec array. The header consists of 16 message descriptors,
each of which contains only the a
uint16_t
message_size
field. This field determines the number ofzx_channel_iovec_t
entries associated with the message. This representation provides a hierarchical structure but eliminates the need for additional redirection and copying.
In design discussions, the flat representation was favored due to its performance properties and simplicity. While a full proposal for multi-message support is out of scope of this RFC, please note that this RFC is compatible with the flat representation.
Dedicated syscall for iovec
Instead of adding a new option to existing syscalls,
zx_channel_write_iovec
, zx_channel_write_etc_iovec
, zx_channel_call_iovec
and zx_channel_call_etc_iovec
syscall could be created. However, an option is
preferred to avoid an explosion in the number of syscalls and cognitive load on
users.
zx_channel_iovec_t support in zx_channel_read
This RFC proposes support for zx_channel_iovec_t
for channel writes, but not
channel reads. The reason for this is that there is a clear motivation for
iovecs on the write side - avoiding a FIDL linearization step - but there isn't
a clear and immediate benefit on read side.
The Rust bindings could potentially benefit from read-side iovec support by partitioning the buffer into multiple smaller buffers each with its own ownership. This would facilitate a variant of the bindings similar to LLCPP that essentially casts buffers into output objects. However, there is no short-term plan to change the Rust bindings to work this way and there doesn't appear to be much cost to deferring adding support for read-path iovec until it is needed.
Prior art and references
Fuchsia has existing zx_stream_readv
and zx_stream_writev
system calls that
use vectorized io. Linux also provides similar readv
and writev
system
calls that respectively read and write to a file descriptor.