RFC-0152: Improved OOM handling behavior

RFC-0152: Improved OOM handling behavior
Status	Accepted
Areas	Drivers Kernel Power
Description	Improves out of memory handling by allowing vastly more userspace code to exit and creating a signal path for userspace to inform the kernel when userspace cleanup is complete.
Issues	66786
Gerrit change	622788
Authors	jmatt@google.com
Reviewers	dgilhooley@google.com frousseau@google.com maniscalco@google.com palmer@google.com pankhurst@google.com ppi@google.com pshickel@google.com rashaeqbal@google.com shayba@google.com surajmalhotra@google.com
Date submitted (year-month-day)	2021-12-20
Date reviewed (year-month-day)	2022-02-09

Edit this RFC

Edit RFC metadata

Summary

The goal of this RFC is to increase the reliability of capturing debugging data and minimize generation of unhelpful data during an out of memory ("OOM") event. This is accomplished by shutting down much more of userspace in an orderly fashion instead of the more basic mechanism used currently.

Motivation

Collecting data when the system runs out of memory helps identify the root cause of the event. Currently we collect some data, like memory reports, but we don't have guarantees about logs that improve our understanding of what was happening on the system. Collecting all relevant data is challenging today because few parts of the system know when the system runs out of memory. Additionally, analyzing available data following an OOM can be difficult. The way OOMs are handled often causes a cascade of secondary effects which adds noise to the data. This has repeatedly led to wasted time from confusion, lengthy discussions, and false conclusions.

Stakeholders

Facilitator: hjfreyer@google.com

Reviewers: dgilhooley@google.com, frousseau@google.com, maniscalco@google.com, palmer@google.com, pankhurst@google.com, ppi@google.com, pshickel@google.com, rashaeqbal@google.com, shayba@google.com, surajmalhotra@google.com

Consulted: adanis@google.com, alexlegg@google.com, geb@google.com, johngro@google.com, wez@google.com

Socialization: This idea was discussed over the course of various related bugs initially. After that key stakeholders were consulted about the proposed solution.

Background

The process of handling an out of memory (OOM) condition on Fuchsia works as follows:

The memory watchdog detects the system's free memory is at the ZX_SYSTEM_EVENT_OUT_OF_MEMORY threshold if free memory continues to decline. The kernel is now committed to rebooting the system.
The memory watchdog generates the ZX_SYSTEM_EVENT_OUT_OF_MEMORY signal for userspace and claims the halt token. The halt token's purpose is to prevent multiple things in the kernel from trying to perform reboots concurrently.
Memory watchdog sleeps for 8 seconds before rebooting. There is no way for userspace to indicate to the kernel it is ready for reboot.
driver_manager observes the ZX_SYSTEM_EVENT_OUT_OF_MEMORY signal. When driver_manager starts it subscribes to the signal with the system_get_event syscall which requires a handle to the root job.
driver_manager tells fshost to shut down and then tells all the drivers to stop. The system does this to try to minimize lost data and get hardware into a consistent state before reboot.
Outside driver_manager and fshost, userspace continues to run until the memory watchdog's timer expires. In this period it is typical to see a variety of crashes as programs try to access executable pages that were paged out and can not be paged in because filesystems are gone. These crashes create a lot of noise which may be recorded if something is listening to the serial log.

The current OOM handling doesn't use it, but there already exists a way to gracefully tear down userspace. This mechanism stops components, including filesystems and drivers, in reverse dependency order and gives all components a chance to clean up.

memory_monitor is not involved in OOM handling, but is interested in low memory events. RFC-0091 created the ZX_SYSTEM_EVENT_IMMINENT_OUT_OF_MEMORY event which is generated by the kernel when system free memory reaches a level a bit higher than the ZX_SYSTEM_EVENT_OUT_OF_MEMORY threshold. memory_monitor observes this signal and tries to persist its memory profile data to storage. This is best effort because memory_monitor has no way to signal it handled the event and the system might hit the lower threshold at any time, possibly shutting down filesystems before memory_monitor can flush its data.

Design

The design is conceptually quite simple and uses a lot of existing pieces of the system, but composed in a new way. At a high level the strategy is when the kernel detects we've hit out of memory it records some state in memory, signals userspace about the OOM, the kernel waits with a timeout for userspace to call back into the kernel, userspace receives the signal, multiple userspace components play their already-established role in orderly shutdown, and finally userspace calls back into the kernel which allows the kernel to store data in NVRAM and finish the reboot.

In greater detail the sequence is:

The kernel's memory watchdog detects the system is very low on memory and it
- Claims the halt token
- Signals userspace with ZX_SYSTEM_EVENT_OUT_OF_MEMORY
- Sets a timer
The userspace signal is received by pwrbtn-monitor and it talks to power_manager via the fuchsia.hardware.power.statecontrol/Admin.Reboot call.
power_manager then tells driver_manager, via the fuchsia.device.manager/SystemStateTransition.SetTerminationSystemState call, that before it exits it should move the system to the REBOOT_KERNEL_INITIATED power state.
power_manager tells component_manager to tear down the component topology by calling fuchsia.sys2/SystemController.Shutdown call.
component_manager tears down the component topology in reverse dependency order, eventually telling driver_manager to exit.
driver_manager sees that it should execute the transition to the REBOOT_KERNEL_INITIATED state before exiting.
driver_manager calls zx_system_powerctl, passing ZX_SYSTEM_POWERCTL_ACK_KERNEL_INITIATED_REBOOT for the cmd value before it exits.
The kernel receives the syscall and signals the halt token.
The rest of the OOM handling runs so that appropriate information is written to NVRAM to be read after reboot.
The kernel reboots the system.

Implementation

Controlling Kernel OOM Timeout

This RFC proposes adding a kernel boot-option to control the OOM timeout. Currently the OOM timeout is hard-coded at 8 seconds. Implementation of this RFC increases the amount of code executed to handle the OOM and therefore a larger timeout is warranted.

Changes to the Halt Token

This RFC proposes changing the halt token to include a kernel event object instead of being simply an atomic boolean. The halt token would continue being an object which is irrevocably claimed. The halt token's event object will be used inside the kernel to coordinate reboots. The halt token would allow the event object to be signaled without the need to take the token.

OOM Handling Flow

Currently, when the kernel detects an OOM it generates the ZX_SYSTEM_EVENT_OUT_OF_MEMORY signal for userspace, takes the halt token, and starts an 8 second timer. This RFC proposes that whenever the kernel wants to reboot the system, but also give userspace a chance to do something, the kernel should first take the halt token, then notify userspace, and wait some bounded time. The wait bound should be equal to the maximum amount of time user mode is allowed to run in response to the event. This is in contrast to the current implementation, whose timeout value is both the maximum and minimum wait period. Once the halt token's event is signaled or the timeout reached the kernel will finish the reboot operation. In the case of an OOM the kernel code that decides to reboot is in the memory watchdog which finishes the OOM handling by creating the OOM crash log, storing it in NVRAM, and rebooting.

As mentioned previously this RFC proposes adding the ability to set the OOM timeout via a kernel boot-option.

Currently, the userspace ZX_SYSTEM_EVENT_OUT_OF_MEMORY handler is in driver_manager. This RFC suggests moving the handler to pwrbtn-monitor. pwrbtn-monitor is an existing component which is present on all builds and used to control power state on some hardware. Effectively we can think of the OOM as a software-generated power button press. As a result of pwrbtn-monitor's increased responsibility we propose renaming it to system-event-monitor.

When pwrbtn-monitor receives the signal it will call fuchsia.hardware.power.statecontrol/Admin.Reboot. We will add a new RebootReason for OOM which pwrbtn-monitor will pass to this call. The call initiates the existing user mode graceful shutdown path which deconstructs the component topology in reverse dependency order and concludes with driver_manager changing the hardware power state. This RFC proposes that when servicing an OOM driver_manager should always use a reboot path that uses the zx_system_powerctl syscall and passes the new value ZX_SYSTEM_POWERCTL_ACK_KERNEL_INITIATED_REBOOT as the value for the cmd argument. The existing path in driver_manager happens to use zx_system_powerctl. On x86 this happens when driver_manager delegates completion of the reboot to the board driver and the board driver makes the syscall. On arm64 driver_manager makes the syscall directly. The change in this RFC is formally requiring zx_system_powerctl in the reboot path.

When Zircon receives the zx_system_powerctl with a cmd value of ZX_SYSTEM_POWERCTL_ACK_KERNEL_INITIATED_REBOOT the handler code tries to signal the halt token. If the halt token is not claimed, signaling fails and the syscall returns an error. For other cmd values the handler code remains the same, specifically it tries to take the halt token and if fails to do so, sleeps forever. In the case of a kernel-initiated reboot because of OOM, signaling the token will allow the memory watchdog to complete its work and restart the system. If userspace does not call zx_system_powerctl before the memory watchdog's timeout expires, the watchdog will continue its shutdown procedure and reboot the system, this is unchanged from the current implementation.

Performance

We expect that OOM handling will take longer in some cases than it currently does. Currently during an OOM driver_manager stops the filesystems, stops the drivers, and then does nothing else. 8 seconds after the kernel detected the OOM, it reboots the system.

After implementation of this RFC much of userspace has a chance to react to the system's impending reboot. Specifically power_manager notifies listeners via the RebootWatcher protocol that a reboot is about to happen. power_manager has a 5-second timeout for clients to respond. After power_manager notifies reboot watchers it tells component_manager to tear down the component topology. The component topology is torn down in reverse dependency order, meaning not all components stop simultaneously. Components have a timeout period to stop. After implementation of this RFC more code has a chance to execute during an OOM and various timeouts may be hit. These factors make it possible for a restart to take longer than 8 seconds, although on most systems today this process is much shorter than 8 seconds.

This RFC does not attempt to solve the problem of something allocating more memory after the kernel signals ZX_SYSTEM_EVENT_OUT_OF_MEMORY.

Backwards Compatibility

There should be no backward compatibility concerns, the changes can be made as soft transitions.

Security considerations

This RFC proposes moving usage of the zx_system_get_event syscall from driver_manager to pwrbtn-monitor. This syscall requires a handle to the root job, which is a highly sensitive handle. pwrbtn-monitor is a small, focused component which already has access to control system power state via the fuchsia.hardware.power.statecontrol/Admin capability. Adding access to the root job increases this component's privilege.

This RFC also proposes increasing the reboot timeout. This is only a concern if we think an OOM is an attack vector and a longer reboot timeout gives an attacker more time to execute an exploit.

Testing

Tests will be needed to validate the system reboots as expected both when userspace shuts down before the kernel timeout and when it doesn't. If these tests don't exist they will be added.

We may also want performance tests to profile how long userspace takes to tear down. These profiling tests could be used to inform the kernel timeout value.

Documentation

Various pieces of API documentation should be updated, but no new conceptual updates are required since this RFC is more a re-wiring of signals than something which fundamentally changes system behavior.

Drawbacks, alternatives, and unknowns

Alternative: Userspace handler location

There are a number of options for where we could put the userspace handler of the ZX_SYSTEM_EVENT_OUT_OF_MEMORY. It is desirable for the handler to be in the ZBI and present on all products so that the handling experience is available early and consistently. The main alternate candidates are power_manager, shutdown-shim, and component_manager. The primary reason to choose pwrbtn-monitor is that this responsibility fits with its overall job of rebooting the system in response to an event, the OOM just being a software generated event instead of a hardware one.

Alternative: Report `NO_CRASH` for all userspace-initiated reboots

Currently Zircon writes data to persistent memory during an OOM and this RFC proposes continuing that practice. As an alternative we could write the same data to persistent RAM every time userspace triggers calls zx_system_powerctl to reboot the system, regardless of whether the kernel detected an OOM and signaled userspace about the OOM. If we did this then if the graceful shutdown of userspace following an OOM was successful the Feedback component would see a NO_CRASH reboot reason from Zircon. If the kernel timer expired and Zircon rebooted the system following the OOM then Feedback would see an OOM reboot reason from Zircon.

The downside of this approach is that problems in userspace handling of the OOM could result in the system knowing it rebooted, but not knowing it was caused by an OOM. In this case Feedback would see that there was a NO_CRASH reboot reason from Zircon, but find no persisted crash information on disk. In this case Feedback would still file a report.

Alternative: Allow compatible requests to `zx_system_powerctl` during an OOM reboot

This RFC proposes that once a kernel-iniated OOM reboot starts that only two things complete the reboot: userspace calls zx_system_powerctl with the cmd value of ZX_SYSTEM_POWERCTL_ACK_KERNEL_INITIATED_REBOOT or the kernel's reboot timer expires. As an alternative we could allow any compatible call to zx_system_powerctl to complete the reboot. A compatible call is one that also reboot's the system, regardless of the cmd value passed. This would resolve race situations where userspace independently decided to reboot the system before the kernel signaled the OOM condition. Possibly the kernel would write different reboot reasons based on the cmd value it actually received in the zx_system_powerctl call. This would enable auditing of whether the system is typically following the expected reboot path.

Alternative: Signal userspace handling completion via a kernel object

This RFC proposes completing the kernel-iniated OOM reboot by callng zx_system_powerctl with a specific cmd value. Instead the kernel-to- userspace signaling mechanism could be changed such that userspace receives a channel or event object. Userspace could then send a message or assert/deassert a signal to indicate reboot can proceed. This alternative has the advantage that the component which completes the reboot would not need access to the root resource. Access to the root resource is required to zx_system_powerctl. This alternative requires more work because it is a more substantial change to how kernel/userspace signaling happens today.

Drawback: Certain races continue to be possible

Today it is possible that the kernel detects an OOM condition, claims the halt token, and then userspace calls zx_system_powerctl because userspace previously decided to reboot. In this situation the call to zx_system_powerctl will fail. driver_manager then exits. component_manager continues tearing down the component topology, eventually arriving at power_manager, which it kills. Killing power_manager crashes the root job because power_manager is set as critical to the root job. Normally Zircon would reboot the system when the root job dies, but does not in this case since MemoryWatchdog holds the halt token. Instead the system eventually hits MemoryWatchdog's timeout and it reboots the system.

This RFC allows similar races. Userspace could be in the process of tearing down in preparation for a reboot when an OOM condition is hit. It could be that pwrbtn-monitor already exited, which means nothing in userspace is around to observe the OOM signal from the kernel. Alternately, pwrbtn-monitor might be running, but its attempt to reboot the system will fail because power_manager only allows one in-flight request to shutdown or reboot the system. The earlier teardown request will eventually get to driver_manager. driver_manager's request to zx_system_powerctl will fail as previously described and the root job will crash, but the system doesn't reboot until MemoryWatchdog's timeout expires. How bad is this race? It is pretty benign because userspace is going to clean itself up. By the time the root job crashes userspace has already cleaned up as much of itself as it ever expects to. The biggest downside is that the reboot will be less prompt.

Another possible race is if a reboot-on-terminate component exits at just the wrong time. Components can configure themselves so that if they exit component_manager will reboot the system. component_manager reboots the system by calling fuchsia.hardware.power.statecontrol/Admin.Reboot. component_manager does not try to reboot the system if a reboot-on-terminate component exits after component_manager is told to tear down the component topology. If such a reboot-on-terminate component exits after a component calls calls Admin.Reboot, but before power_manager tells component_manager to tear down the topology, the system will crash because component_manager panics if its call to power_manager fails. This RFC does not propose a fix to this race. The race may result in unpredictable behavior because we don't know how much of the system is running when component_manager panics.

Unknown: Impact on chance of completely running out of memory

The net impact of this RFC on the risk of the system completely running out of memory during OOM handling is uncertain. The proposed changes could reduce the chance of the system completely running out of memory because most userspace components do not listen for an exit signal and are killed as soon as their clients have exited. This should start freeing memory quickly. We expect that userspace components which do observe the exit signal should exit promptly, also freeing memory. The proposed changes could increase the chance of running out of memory because some systems could take longer to shut down than the existing, 8 second timeout and therefore giving more time for some components to allocate memory. This RFC also delays shutting down filesystems, which today exit quickly. Filesystem caches typically take the form of discardable memory managed by the kernel. It is unclear whether filesystems running longer will have a significant negative impact on memory.