|RFC-0209: Memory priority profiles|
Mark VMARs with a priority using profiles.
|Date submitted (year-month-day)||2022-08-16|
|Date reviewed (year-month-day)||2023-02-15|
Use the profile object to instruct the kernel to minimize page faults and other operations that may increase the latency of memory accesses by allowing profiles to be applied to VMARs.
Zircon is an over commit system and employs different reclamation systems to attempt to support these applications. These reclamation methods include eviction, page table reclamation, zero page scanning, etc and can be at odds with deadline tasks, as they can add unacceptable amounts of latency to memory accesses. Potential future reclamation methods, such as page compression, will also be problematic.
Audio has been a repeated example where any reclamation by the kernel, which leads to subsequent page faults in the audio threads, can cause audio threads to miss their deadlines.
The solution up to this point has been special case workarounds in the kernel, to disable all kinds of reclamation, just for audio related components. However, this approach is very fragile and does not scale to other users with similar requirements.
This RFC aims to provide a general mechanism suitable both for audio, to replace the existing temporary workarounds, as well as any other component.
firstname.lastname@example.org, email@example.com, firstname.lastname@example.org
This problem and proposal has been discussed in docs with the Zircon team and at the Kernel Evolution Working Group (KEWG).
To provide control of reclamation to the user the profile object will be extended in two ways:
- An additional flag
zx_profile_createwill be added, along with a
- It will be allowable to apply a profile, via
zx_object_set_profile, to VMARs.
Initially only two values of the
memory_priority field will be supported:
ZX_PRIORITY_HIGH, with high acting to disable all
reclamation permanently. Room is left to support additional levels in the future
to allow for trade offs between minimizing latency and preventing OOMs.
When a profile is applied to a VMAR, assuming that profile has a valid
memory_priority, it is immediately applied to that VMAR and all its
subregions, overriding any previously applied profiles.
It is a requirement that the kernel will respect the
setting and the kernel MUST disable any dynamic reclamation in the marked VMARs.
ZX_PRIORITY_DEFAULT setting has no particular meaning, and does not have
to be respected. In particular the kernel is allowed, primarily for ease of
implementation, to extend
ZX_PRIORITY_HIGH requests to wider ranges that
Although over application is always allowed, if an address space goes from
ZX_PRIORITY_HIGH VMARs to have none, then the kernel SHOULD
return to the same state equivalent to if no profiles had ever been applied.
ZX_INFO_KMEM_STATS_EXTENDED topic of
zx_object_get_info will be
extended to report an additional field:
// The amount of memory in VMOs that would otherwise be tracked for // reclamation, but has had reclamation disabled. uint64_t vmo_reclaim_disabled_bytes;
User-space components do not typically use
zx_profile_create etc directly,
rather they invoke a
ProfileProvider service. Here the
method of the
ProfileProvider will be relaxed to accept arbitrary handles,
instead of just threads.
This section describes how the objects in the kernel would be changed to accommodate the information from the profile. The goal being to ensure that any part of the system that might need to know the memory priority in order to make a decision has efficient access to it. Where possible this design favors doing work at the point of profile application under the assumption that profile applications will be infrequent compared to other operations.
Reduction to boolean
The kernel objects involved in reclamation are:
VmAspace- Controls page table mappings and page table reclamation goes through here.
VmAddressRegion- Not currently involved in reclamation, but the creation of all page table mappings happens through this object.
VmObject- Any eviction, or future page reclamation strategies, go through the
Aside from the
VmAddressRegion, which is the object the profile is applied to,
each of these objects can query, for any of their sub ranges, what VMARs exist
in that range and the applied memory priority.
Having multiple VMARs, with different priorities, apply to a VMO region can be resolved by taking the highest applied priority.
To avoid repeated long running searches across all VMARs, objects need an
efficient way to know if any memory priority apply to them. For simplicity of
tracking, the initial proposed implementation will upgrade any priority range to
the full object. That is, if any part of a
VmObject is referenced by a
profile, the entire object will be considered to have that profile.
These two implementation simplifications make efficient memory profile querying just require propagating the union of booleans through object links.
Propagation of this reclamation disable boolean is based around edge transitions and counting.
When a VMAR has a profile set there are three outcomes to consider:
- Reclamation of this VMAR stays the same.
- Reclamation transitions from disabled to enabled.
- Reclamation transitions from enabled to disabled.
In the first case no direct propagation happens, but all subregions must still be traversed and have the profile applied to them. This unconditional traversal needs to occur as a subregion have had a different profile applied that needs to be overridden.
In either of the transitions both of the potential reference objects,
VmObject, need to be updated.
Instead of a single boolean the
VmObject objects will have a
counter of how many objects that refer to them have reclamation disabled.
Determining whether reclamation is disabled for these objects is therefore a
matter of comparing the count with zero.
VmObject may have additional
VmObject parents that it needs to propagate
the flag to. Since reclamation is controlled by the counter being, or not being
zero, when the counter transitions to or from zero the propagation occurs.
This reference counting propagation strategy ensures that profile changes are as efficient as possible, with almost no overhead compared to just tracking a boolean.
In addition to propagating the reclamation flag,
VmObject needs to perform
actions as part of a transition to update its pages.
When reclamation is disabled, all pages that would otherwise be reclaimable will be moved to a separate page queue. Being in this queue will both prevent reclamation, and provide a way to count the pages.
Similarly, when reclamation is enabled, pages need to be moved back to their default queues so that they become reclamation candidates again. It is an implementation detail what the age of these pages are considered to be when placed back into their default queues. On platforms without a hardware accessed flag it will not be possible to have any age information and so an age will have to be invented.
The implementation will be landed in a series of steps from the kernel implementation through successive layers of APIs.
The proposed design of changes to the kernel objects can be fully implemented to add support for being able to set memory priorities without changing any behavior. This would be done over multiple CLs and tested with in kernel unittests.
Kernel API changes
ZX_INFO_KMEM_STATS_EXTENDED query requires a fairly privileged system
resource and only has a handful of usages, all in tree. Therefore this query can
be modified in a single CL without a multi stage struct evolution.
Update the profile API and documentation and link the profile syscalls to the
previously implemented kernel support. The
zx_profile_create syscall and its
associated configuration struct,
zx_profile_info_t, are privileged system
calls and so can similarly be modified in single CLs.
Extend the implementation of the
ProfileProvider to support a way to specify
the memory priority in
ProfileProvider FIDL API to take arbitrary handles instead of just
threads. As this is a relaxation of the API it does not break any backwards
The relevant profiles for media related components will be changed to include a
ZX_PRIORITY_HIGH, and any media components will then
apply these profiles to their root VMARs in exactly the same way as they apply
them to their threads.
Once the profiles are confirmed to be working the existing hard coded kernel workarounds can be removed.
The proposed kernel design is very close to the temporary workaround used in the kernel, with the exception that the current temporary method cannot be un-applied, and is permanent to all involved VMARs and VMOs. Therefore switching from this method to profiles should be a complete no-op in terms of functionality and kernel behavior, including CPU and memory usage.
Usage of profiles, and hence the ability to set memory priorities, is gated by
needing the root job handle. Therefore any security considerations around
denial-of-service attacks are equivalent to existing denial-of-service
possibilities of the
The majority of the testing can be focused into unittests on the respective kernel and profile provider implementations, with some integration tests to validate the full usage path.
Documentation around the profile objects, related syscalls and FIDL protocols will be updated.
Property or equivalent on VMARs
Instead of using a profile object, directly setting a property, or equivalent,
on the VMAR object could be used to indicate its priority. This avoids
extensions to the scheduler objects and simplifies user space usage and does not
need involvement of the
With this approach there is not an inherent way to restrict the ability to set priorities. Although any component can already allocate arbitrary memory and perform denial-of-service, this is not necessarily ideal, and disabling reclamation is potentially tempting as a way to boost performance, potentially enabling a tragedy of the commons scenario.
Some form of bespoke access control could be designed to solve this problem, but now the benefits over leveraging profiles has been removed.
Inference via taint
An alternative to any direct marking by user space is to assume any deadline thread needs deadline memory accesses and to mark any memory it accesses as high priority. This requires no changes to user space, but either requires over marking, with every address space with at least one deadline thread having all of its mappings marked, or mappings / VMOs are marked once they have been used touched by a deadline thread.
Over marking is bad as not all deadline threads have the same memory latency requirements, and not to all of their address spaces. Lazily tainting means there is no way to pre-fault and so deadline threads may, in the worst case, always miss their deadlines once to fault items in.
With a lazy marking scheme it is also unclear how items get unmarked.
Applying memory_priority to thread
Instead of needing to apply profiles to VMARs, the
memory_priority field could
be interpreted when a profile is applied to a thread, and have the priority
applied to its root VMAR.
Although this simplifies both the
ProfileProvider protocol, as well as the
syscall interface, it prevents the option for users and the kernel to be more
efficient in the future. An efficient component could organize to have critical
latency sensitive data in one subregion of its address space, and non-critical
data in another region, and just apply a profile to the critical region. This
lets the non-critical data remain a consideration for reclamation, benefiting
Single priority field
priority field for thread priority could be reused instead of
introduction the additional
memory_priority field. This produces a theoretical
saving of having a simpler configuration structure, but now requires different
profile objects to be created if a different memory and scheduler priority are
Expanding ALWAYS_NEED hints
There is an existing API for controlling reclamation by using
DONT_NEED hints on a VMO or VMAR. Currently these are only given meaning to
pager backed VMOs, but could be extended to have meaning to anonymous VMOs.
Just extending the semantics to cover anonymous VMOs leaves some gaps:
ALWAYS_NEEDdoes not guarantee a lack of reclamation, as its a hint.
- Pages may still need access faults as age is still tracked.
- Does not disable page table reclamation.
- Only be applied to existing mappings.
Each of these limitations could be addressed by using an internal implementation similar to what was described in the main proposal, however the API itself has two fundamental problems.
The main proposal has a clear way to know when reclamation has been re-enabled
by having a profile applied to all VMARs (or just the root VMAR). With hinting
there is no way to remove
DONT_NEED is stronger than undoing
One motivation for the hinting API to be hints, and not promises, is due to a lack of access control with it being usable on any VMAR or VMO. Job policy or similar could be used to control what the hints do, but such a mechanism would also need to be designed.