RFC-0269: Measuring Memory Stalls

RFC-0269: Measuring Memory Stalls
Status	Accepted
Areas	Kernel
Description	Detecting and reporting memory stalls to userspace
Issues	319641298
Gerrit change	1138712
Authors	fdurso@google.com
Reviewers	adanis@google.com eieio@google.com maniscalco@google.com rashaeqbal@google.com
Date submitted (year-month-day)	2024-10-15
Date reviewed (year-month-day)	2025-04-10

Edit this RFC

Edit RFC metadata

Problem Statement

Zircon does not currently expose information about transient stalls caused by memory pressure, which Starnix needs in order to implement the PSI interface (/proc/pressure/memory).

This RFC addresses the issue of synthesizing compatible measurements and notifying observers when a given threshold is crossed.

Summary

This document proposes a mechanism to detect stalls caused by memory pressure and notify userspace. The mechanism will be based on measuring time spent in pressure-induced activities that could have otherwise been spent doing useful work.

Motivation

Linux-based operating systems (both Android and systemd) rely on a designated userspace daemon to be notified and act on delays ("stalls") induced by memory pressure. The action performed by such programs is usually just killing entire processes/services in response.

In the context of Starnix, whose goal is to implement the Linux API as she is spoke (RFC-0082), we need to offer the same stall detection API, which is called Pressure Stall Information (PSI) on Linux.

Linux's PSI quantifies pressure-induced memory stalls by timing the delays caused by the additional actions that are necessary to free up memory, and those stats can be accessed through the /proc/pressure/memory virtual file. This RFC proposes the approach of measuring pressure-induced delays similarly to Linux.

This document's focus is only on Starnix and how to implement Linux's PSI interface in order to reach feature parity. Whether other Fuchsia components, outside of Starnix, might be able to use it, too, is out of its scope and left for a future phase.

Stakeholders

Facilitator: lindkvist@google.com

Reviewers:

adanis@google.com
eieio@google.com
maniscalco@google.com
rashaeqbal@google.com

Consulted:

davemoore@google.com
lindkvist@google.com
wez@google.com

Socialization:

This RFC has been discussed with the Zircon team over email, chat and meetings.

Requirements

Given the focus on Starnix, the requirements are driven by the interface to be offered to Linux programs.

On Linux, reading the /proc/pressure/memory file returns the following information:

Two monotonic timers since boot, labeled some and full (whose unit is microseconds): according to the official documentation, the some time grows when at least one thread is memory-stalled, and the full time grows when all otherwise-runnable threads are memory-stalled. Note that the some condition can occur even if some other thread is able to make progress, whereas the full condition implies that no thread at all is making progress. As a consequence, the some time is always greater or equal to the full time.
The averages of the above two timers' growth rates (as a ratio of their variations over the monotonic clock, thus giving a range of values between 0.00 and 1.00) over the last 10, 60 and 300 seconds.

In addition to reading those stats, Linux makes it also possible to be notified when a certain stall level is reached by opening that file, writing which timer to observe (either some or full), a threshold and an observation window (both in microseconds), and then polling that file descriptor. Poll will then return a POLLPRI event when the selected timer has grown by more than the given threshold over the configured observation window.

It's worth noting that, on Linux, the same measuring and notification mechanism can also be enabled at cgroup-level by means of the memory.pressure file, which offers the same interface. While the design proposed in this RFC is only concerned with implementing the global /proc/pressure/memory file at this time, the internal measurement mechanisms described in this RFC can be reused to implement a similar feature, too, should the need arise (see Drawbacks, alternatives, and unknowns for more details).

For the sake of minimizing divergences in the magnitude of the measured memory stalls in comparison to Linux, we intend to initially adapt the same formulas to Zircon, and leave further adjustments/deviations for a later tuning phase. The proposed adaptation of those formulas for Zircon is detailed in the following section.

Design

The goal of this RFC is to precisely define the notion of a stall in Zircon, how stall measurements are generated and aggregated, how to expose this data to userspace and how to let userspace be notified when the system stalls more than a given level.

In short, Zircon will measure memory stalls, aggregate them and expose two system-wide values (as a new zx_object_get_info topic) and a notification interface (through a new syscall), both gated by a new resource type. These two new values (called some and full stall time) will be expressed in nanoseconds, and they will grow monotonically and continuously, at a fraction (between zero and one) of the rate of the monotonic clock that depends on the current stall level.

As an example, let's suppose that, after being up for 300000 nanoseconds without experiencing any stall, the system detects a 25% stall for 100000 nanoseconds and then a 50% stall for 200000 extra nanoseconds. The corresponding reported value will be 0% * 300000 + 25% * 100000 + 50% * 200000 = 125000 nanoseconds.

Identifying memory stalls

In principle, any kind of delay induced by memory pressure that prevents a thread from otherwise doing useful work is to be regarded as a stall for that thread.

In practice, given the current codebase, only time spent while waiting for memory to be freed by the dedicated kernel thread (or, in other words, time spent while AnonymousPageRequester::WaitOnRequest exists on the call stack) will be regarded as a memory stall. This happens when a new memory allocation is attempted but the amount of free pages is currently below the delay threshold (which is close to the Out Of Memory level and computed here).

The above operations involve tracking the stalling thread through many different scheduler states. Furthermore, the stall tracking mechanism aims at being generic enough to make it easy to mark other code sections as memory stalls in the future. In order to avoid making the current scheduler more complex by adding new ad hoc states, this RFC proposes to delimit sections of code to be regarded as a stall with a guard object (ScopedMemoryStall) and classify threads for the purpose of tracking stalls as either IGNORED, PROGRESSING or STALLING, according to both their current scheduler state and whether they are executing in a guarded region or not.

In order to eliminate noise in measurements resulting from internal memory book-keeping activities in the kernel, non-usermode threads will always be regarded as IGNORED.

It's worth noting that, in the above model, threads are considered as STALLING if they are either running inside a guard, or were running inside a guard before blocking or being preempted.

Growth rate of the `some` and `full` timers

The stall timers will not be merely the sum of each thread's STALLING time. Rather, they will be the result of a continuous evaluation of the overall number of STALLING and PROGRESSING threads (performed locally per-CPU) and then aggregated periodically. This will match the Linux model as documented by this reference. Internally, two levels of measurements will be added to Zircon.

The first level, local to each CPU and tracked by per-CPU StallAccumulator objects, will measure time spent in each of the following three non-mutually-exclusive conditions (using the monotonic clock as the time base):

full: At least one thread tied to this CPU is in STALLING state and none is PROGRESSING.
some: At least one thread tied to this CPU is in STALLING state.
active: At least one PROGRESSING or STALLING thread is tied to this CPU.

The second level of measurements, implemented by the StallAggregator singleton, will update the stats at system level, by periodically querying all StallAccumulators. In particular, it will run a background thread that periodically wakes up and averages the some and full stall time increments observed by each StallAccumulators, weighted by the respective active time, and increment the global some and full values accordingly.

In addition, the StallAggregator will notify registered watchers when the resulting growth rate exceeds the threshold that they set (see The new zx_system_watch_memory_stall syscall for details).

The Stall resource

A new type of kernel resource, called ZX_RSRC_SYSTEM_STALL_BASE, will be introduced.

Since this RFC is only concerned with measuring the global stall level (the equivalent of /proc/pressure/memory), it is enough to use a single resource to control access to global stall measurements.

A handle to this resource will be created at boot and served by component manager through the fuchsia.kernel.StallResource protocol (as a built-in capability), which will simply return duplicated handles to it. Controlling who this capability is routed to will control who has access to stall measurements.

The new `ZX_INFO_MEMORY_STALL` topic

A new ZX_INFO_MEMORY_STALL topic (on the stall resource) will be added, exposing the two following fields:

typedef struct zx_info_memory_stall_t {
    // Total monotonic time spent with at least one memory-stalled thread.
    zx_duration_mono_t stalled_time_some;

    // Total monotonic time spent with all threads memory-stalled.
    zx_duration_mono_t stalled_time_full;
} zx_info_memory_stall_t;

This topic will be queried by Starnix to synthesize the /proc/pressure/memory file. In addition, Starnix will also sample it at regular intervals to internally compute the rolling averages over the last 10, 60 and 300 seconds.

The new `zx_system_watch_memory_stall` syscall

Similarly to the existing zx_system_get_event syscall, the new zx_system_watch_memory_stall syscall will also return an event handle, on which the kernel will assert/deassert ZX_EVENT_SIGNALED according to the current growth rate of the selected stall timer.

The full syscall prototype will be:

zx_status_t zx_system_watch_memory_stall(zx_handle_t stall_resource,
                                         zx_system_memory_stall_type_t kind,
                                         zx_duration_mono_t threshold,
                                         zx_duration_mono_t window,
                                         zx_handle_t* event);

Arguments:

stall_resource: A handle to the stall resource.
kind: Either ZX_SYSTEM_MEMORY_STALL_SOME or ZX_SYSTEM_MEMORY_STALL_FULL, to select the timer to be observed.
threshold: Minimum stall time that will trigger the signal (nanoseconds).
window: Duration of the observation window (nanoseconds).
event: Filled by the kernel in return, a handle to an event that will be asserted (ZX_EVENT_SIGNALED) if, in the course of the last observation window, the stall timer selected by kind has increased by at least threshold.

The returned event will stay asserted for as long as the trigger condition applies, and it will be deasserted by the kernel when it no longer holds.

Possible errors:

ZX_ERR_BAD_HANDLE: stall_resource is not a valid handle.
ZX_ERR_WRONG_TYPE: stall_resource's kind is not ZX_RSRC_KIND_SYSTEM.
ZX_ERR_OUT_OF_RANGE: stall_resource is not the stall resource.
ZX_ERR_INVALID_ARGS: invalid kind, threshold or window (see note below).

The caller's job policy must allow ZX_POL_NEW_EVENT.

Note that even if the API defines threshold and window in terms of nanoseconds, Zircon will be allowed to not use the full precision of the requested values. Furthermore, it must hold that 0 < threshold <= window and, in order to limit the amount of necessary book-keeping memory, we also impose that window cannot be longer than 10 seconds (Linux puts the same constraint).

Implementation

The Zircon changes will be split in several CLs, roughly in this order:

Add infrastructure to detect stalls at per-CPU level (StallAccumulator).
Make time spent blocked in AnonymousPageRequest count as a memory stall.
Periodically aggregate all per-CPU values into global metrics (StallAggregator) in a low-priority kernel thread.
Serve the new stall resource as a component manager built-in capability (fuchsia.kernel.StallResource).
Implement the new ZX_INFO_MEMORY_STALL topic.
Add infrastructure to detect when a stall threshold is crossed.
Expose it through the zx_system_watch_memory_stall syscall.

Once the Zircon changes are made, Starnix will be modified to implement /proc/pressure/memory on top of them as follows:

the total some and full values will be directly read from ZX_INFO_MEMORY_STALL.
the averages will be computed within Starnix by polling ZX_INFO_MEMORY_STALL (approximately) every second, keeping the last 300 samples in a circular queue and computing the rolling averages from it.
PSI notifications will be implemented on top of zx_system_watch_memory_stall, with the addition of a rate-limiter (implemented within Starnix itself) to match Linux's PSI behavior of only delivering up to one notification per window.

Performance

As described in the previous sections, the proposed implementation will maintain stall measurements in per-CPU data structures and periodically aggregate them from a dedicated kernel thread. Performance testing has not found any appreciable regression due to stall time book-keeping.

A possible performance risk might derive from having too many subscribed zx_system_watch_memory_stall observers to be maintained and notified by the Zircon kernel when the stall thresholds are crossed. If needed, it will be possible to mitigate this by proxying access through a dedicated user-space component that will aggregate and serve multiple clients through a single Zircon subscription. Similarly to Zircon (see The new zx_system_watch_memory_stall syscall above), such a proxy would be allowed to drop precision from the requested values in order to maximize aggregation opportunities.

Ergonomics

The proposed Zircon API makes implementing PSI in Starnix almost entirely straightforward, with the exception of Linux's rate-limiting of notifications. We could, in theory, implement rate-limiting in the Zircon kernel too and end up with a strobe signaling scheme similar to the one described in RFC-0237. However, in order to make zx_system_watch_memory_stall's behavior easier to describe, as well as for consistency with the existing zx_system_get_event signals, it has been chosen to just expose a simple level-triggered signal, and delegate the complexity of rate-limiting notifications to user space.

Backwards Compatibility

This RFC only introduces new interfaces (a new resource type, the ZX_INFO_MEMORY_STALL topic and the zx_system_watch_memory_stall syscall) that do not bring any breaking API/ABI change.

Security considerations

Giving user space access to new performance measurements opens the possibility for creating unwanted side channels. However, in this case, the risk is mitigated by the fact that the capability to access the new measurements will only be routed to either trusted Fuchsia components, or to trusted Linux processes within Starnix (according to the Starnix container's security model).

The risk of unbounded kernel memory allocations is eliminated by giving the kernel the freedom to drop precision from the requested threshold and window values (enabling internal quantization of stats), and by limiting the range of permissible values.

Privacy considerations

This change should have no impact on privacy.

Testing

Two kinds of Zircon tests will be added:

Userspace tests (core-tests) that will watch memory stall events, generate memory pressure, query the info topic and verify that the events are actually triggered.
In-kernel unit tests that will simulate the effects of ScopedMemoryStall guards and verify that the measured timings match the expectations.

Documentation

Documentation will be added next to the relevant syscall surface (the new ZX_INFO_MEMORY_STALL topic and the new zx_system_watch_memory_stall syscall) as well as the fuchsia.kernel.StallResource FIDL protocol.

The list of conditions that are considered to be a stall will be regarded as an implementation detail and will not be included in the documentation, to leave the possibility of further expansion and tuning of such conditions without breaking the API.

Drawbacks, alternatives, and unknowns

Syscall surface

The design in this RFC is closely related to implementing global PSI in Starnix, and the syscall API is modeled after its requirements.

The option to expose the new features through a new "stall gauge" kernel object type (instead of a new resource type) was considered, as a way to ease the future implementation of cgroups' memory.pressure hierarchical stats. However, given that at this time there is no need for such an extension, it was decided to keep the API changes simple and avoid overplanning.

Instead of making zx_system_watch_memory_stall syscall return an event instance, it was also considered to return a new specialized "stall monitor" object type with dedicated signals (e.g. ZX_STALL_MONITOR_ABOVE_THRESHOLD). While more explicit in the intent and in the correspondence to the internal kernel implementation, this approach would have required a new kernel object type for no other reason than this. It was rejected because the API of the existing event object type has been deemed expressive enough.

Lastly, many new API names refer to "stalls" instead of "memory stalls" to leave room for future expansion to other types of stall measurements too (such as CPU and I/O stalls).

Stall measurement and aggregation

Within the proposed measurement scheme, the option to transitively consider threads waiting on resources held by other stalling threads as stalling themeselves ("stall inheritance") was discussed but rejected, at least for an initial implementation, due to the extra complexity and the lack of baseline data to evaluate its benefits.

Time spent while decompressing ZRAM was initially planned to be regarded as a memory stall too, as a sign of thrashing (due to trying to re-access an anonymous page that had been compressed in the past). However, due to factors such as 1) Zircon proactively compressing inactive pages in the background even if there is plenty of free memory and 2) decompression only happening when pages are needed again, this kind of event only signalled a delayed indication, at best, or a spurious notification, in the worst case. Furthermore, it was experimentally found that the stall time contributions originating from ZRAM decompression are currently several order of magnitude lower than those deriving from AnonymousPageRequester::WaitOnRequest. For the above reasons, it was decided to not consider ZRAM decompressions as memory stalls for now and leave them out of the initial implementation.

A different aggregation scheme was also considered, in which the kernel exposes raw stall time counters for each thread through shared memory in a lockless data structure. It would then be up to user space to periodically perform the aggregation with product-specific logic, decoupling the aggregation policy from the kernel API. The drawbacks of this approach would be worse precision, due to its sampling nature, a more complex design relying on shared memory, where it's easy to get subtle details wrong, and the need to activate user space much more frequently.

A simpler way of measuring stalls was considered too: just measuring time spent with the evictor running. However, this approach would lend itself neither to distinguishing between some and full stall times nor to identifying the process that was affected by the stall.

Relation to existing memory pressure events

Zircon already exposes information on the current memory pressure level through five mutually-exclusive events (Normal, Warning, Critical, Imminent Out Of Memory and Out Of Memory), that are asserted by the kernel and whose handles can be obtained via the zx_system_get_event syscall. While the exact conditions that trigger them are not stated explicitly in Zircon's API contract, in practice the trigger condition implemented in Zircon consists in matching the current amount of free memory to one of the five corresponding ranges.

As an alternative to this RFC's proposal of timing the duration of stalls, the option to synthesize fake delays from the existing pressure events has been evaluated too. However, it has been rejected on the basis that the expected dynamics of Linux's PSI signal are very different: while the transitions between Zircon's existing pressure events are slow (and, in some cases, subject to artificial delays to avoid too frequent changes), Linux's PSI timers are expected to update in near real-time. The growth rate of the PSI timers, in particular, needs to change fast enough to immediately generate an impulse to trigger registered watchers when the respective threshold is exceeded, and then immediately stop triggering them as soon as it falls back below.

In addition, with Linux's PSI timers, the growth rate of an idle system should be zero. With the current Zircon events, instead, there is no guarantee that the actions that userspace is asked to take in response will eventually free enough memory to go back to the Normal state.

Conclusion

Preliminary tests on an Android image showed that measuring stalls as described in this RFC generated notification events comparable to those generated by corresponding workloads on Linux. However, by virtue of Zircon and Linux being two different kernels with very different VMM subsystems, this is not granted for all the edge cases, and refinements might be needed in the future, e.g. by expanding/modifying the definition of what Zircon considers to be a stall condition.