|RFC-0152: Improved OOM handling behavior|
Improves out of memory handling by allowing vastly more userspace code to exit and creating a signal path for userspace to inform the kernel when userspace cleanup is complete.
|Date submitted (year-month-day)||2021-12-20|
|Date reviewed (year-month-day)||2022-02-09|
The goal of this RFC is to increase the reliability of capturing debugging data and minimize generation of unhelpful data during an out of memory ("OOM") event. This is accomplished by shutting down much more of userspace in an orderly fashion instead of the more basic mechanism used currently.
Collecting data when the system runs out of memory helps identify the root cause of the event. Currently we collect some data, like memory reports, but we don't have guarantees about logs that improve our understanding of what was happening on the system. Collecting all relevant data is challenging today because few parts of the system know when the system runs out of memory. Additionally, analyzing available data following an OOM can be difficult. The way OOMs are handled often causes a cascade of secondary effects which adds noise to the data. This has repeatedly led to wasted time from confusion, lengthy discussions, and false conclusions.
Reviewers: firstname.lastname@example.org, email@example.com, firstname.lastname@example.org, email@example.com, firstname.lastname@example.org, email@example.com, firstname.lastname@example.org, email@example.com, firstname.lastname@example.org, email@example.com
Consulted: firstname.lastname@example.org, email@example.com, firstname.lastname@example.org, email@example.com, firstname.lastname@example.org
Socialization: This idea was discussed over the course of various related bugs initially. After that key stakeholders were consulted about the proposed solution.
The process of handling an out of memory (OOM) condition on Fuchsia works as follows:
- The memory watchdog detects the system's free memory is at the
ZX_SYSTEM_EVENT_OUT_OF_MEMORYthreshold if free memory continues to decline. The kernel is now committed to rebooting the system.
- The memory watchdog generates the
ZX_SYSTEM_EVENT_OUT_OF_MEMORYsignal for userspace and claims the halt token. The halt token's purpose is to prevent multiple things in the kernel from trying to perform reboots concurrently.
- Memory watchdog sleeps for 8 seconds before rebooting. There is no way for userspace to indicate to the kernel it is ready for reboot.
driver_managerstarts it subscribes to the signal with the
system_get_eventsyscall which requires a handle to the root job.
fshostto shut down and then tells all the drivers to stop. The system does this to try to minimize lost data and get hardware into a consistent state before reboot.
fshost, userspace continues to run until the memory watchdog's timer expires. In this period it is typical to see a variety of crashes as programs try to access executable pages that were paged out and can not be paged in because filesystems are gone. These crashes create a lot of noise which may be recorded if something is listening to the serial log.
The current OOM handling doesn't use it, but there already exists a way to gracefully tear down userspace. This mechanism stops components, including filesystems and drivers, in reverse dependency order and gives all components a chance to clean up.
memory_monitor is not involved in OOM handling, but is
interested in low memory events. RFC-0091 created the
ZX_SYSTEM_EVENT_IMMINENT_OUT_OF_MEMORY event which is generated by the kernel
when system free memory reaches a level a bit higher than the
memory_monitor observes this signal
and tries to persist its memory profile data to storage. This is best effort
memory_monitor has no way to signal it handled the event and the
system might hit the lower threshold at any time, possibly shutting down
memory_monitor can flush its data.
The design is conceptually quite simple and uses a lot of existing pieces of the system, but composed in a new way. At a high level the strategy is when the kernel detects we've hit out of memory it records some state in memory, signals userspace about the OOM, the kernel waits with a timeout for userspace to call back into the kernel, userspace receives the signal, multiple userspace components play their already-established role in orderly shutdown, and finally userspace calls back into the kernel which allows the kernel to store data in NVRAM and finish the reboot.
In greater detail the sequence is:
- The kernel's memory watchdog detects the system is very low on memory and it
- Claims the halt token
- Signals userspace with
- Sets a timer
- The userspace signal is received by
pwrbtn-monitorand it talks to
driver_manager, via the
fuchsia.device.manager/SystemStateTransition.SetTerminationSystemStatecall, that before it exits it should move the system to the
component_managerto tear down the component topology by calling
component_managertears down the component topology in reverse dependency order, eventually telling
driver_managersees that it should execute the transition to the
REBOOT_KERNEL_INITIATEDstate before exiting.
cmdvalue before it exits.
- The kernel receives the syscall and signals the halt token.
- The rest of the OOM handling runs so that appropriate information is written to NVRAM to be read after reboot.
- The kernel reboots the system.
Controlling Kernel OOM Timeout
This RFC proposes adding a kernel boot-option to control the OOM timeout. Currently the OOM timeout is hard-coded at 8 seconds. Implementation of this RFC increases the amount of code executed to handle the OOM and therefore a larger timeout is warranted.
Changes to the Halt Token
This RFC proposes changing the halt token to include a kernel event object instead of being simply an atomic boolean. The halt token would continue being an object which is irrevocably claimed. The halt token's event object will be used inside the kernel to coordinate reboots. The halt token would allow the event object to be signaled without the need to take the token.
OOM Handling Flow
Currently, when the kernel detects an OOM it generates the
ZX_SYSTEM_EVENT_OUT_OF_MEMORY signal for userspace, takes the halt token, and
starts an 8 second timer. This RFC proposes that whenever the kernel wants to
reboot the system, but also give userspace a chance to do something, the kernel
should first take the halt token, then notify userspace, and wait some bounded
time. The wait bound should be equal to the maximum amount of time user mode is
allowed to run in response to the event. This is in contrast to the current
implementation, whose timeout value is both the maximum and minimum wait
period. Once the halt token's event is signaled or the timeout reached the
kernel will finish the reboot operation. In the case of an OOM the kernel code
that decides to reboot is in the memory watchdog which finishes the OOM handling
by creating the OOM crash log, storing it in NVRAM, and rebooting.
As mentioned previously this RFC proposes adding the ability to set the OOM timeout via a kernel boot-option.
Currently, the userspace
ZX_SYSTEM_EVENT_OUT_OF_MEMORY handler is in
driver_manager. This RFC suggests moving the handler to
pwrbtn-monitor is an existing component which is present on all builds and
used to control power state on some hardware. Effectively we can think of the
OOM as a software-generated power button press. As a result of
pwrbtn-monitor's increased responsibility we propose renaming it to
pwrbtn-monitor receives the signal it will call
fuchsia.hardware.power.statecontrol/Admin.Reboot. We will add a new
RebootReason for OOM which
pwrbtn-monitor will pass to this call. The call
initiates the existing user mode graceful shutdown path which deconstructs the
component topology in reverse dependency order and concludes with
driver_manager changing the hardware power state. This RFC proposes that when
servicing an OOM
driver_manager should always use a reboot path that uses the
zx_system_powerctl syscall and passes the new value
ZX_SYSTEM_POWERCTL_ACK_KERNEL_INITIATED_REBOOT as the value for the
argument. The existing path in driver_manager happens to use
zx_system_powerctl. On x86 this happens when
completion of the reboot to the board driver and the board driver makes the
syscall. On arm64 driver_manager makes the syscall directly. The change in this
RFC is formally requiring
zx_system_powerctl in the reboot path.
When Zircon receives the
zx_system_powerctl with a
cmd value of
ZX_SYSTEM_POWERCTL_ACK_KERNEL_INITIATED_REBOOT the handler code tries to
signal the halt token. If the halt token is not claimed, signaling fails and the
syscall returns an error. For other
cmd values the handler code remains the
same, specifically it tries to take the halt token and if fails to do so, sleeps
forever. In the case of a kernel-initiated reboot because of OOM, signaling the
token will allow the memory watchdog to complete its work and restart the
system. If userspace does not call
zx_system_powerctl before the memory
watchdog's timeout expires, the watchdog will continue its shutdown procedure
and reboot the system, this is unchanged from the current implementation.
We expect that OOM handling will take longer in some cases than it currently
does. Currently during an OOM
driver_manager stops the filesystems, stops the
drivers, and then does nothing else. 8 seconds after the kernel detected the
OOM, it reboots the system.
After implementation of this RFC much of userspace has a chance to react to the
system's impending reboot. Specifically
power_manager notifies listeners via
RebootWatcher protocol that a reboot is about to happen.
has a 5-second timeout for clients to respond. After
reboot watchers it tells
component_manager to tear down the component
topology. The component topology is torn down in reverse dependency order,
meaning not all components stop simultaneously. Components have a timeout period
to stop. After implementation of this RFC more code has a chance to execute
during an OOM and various timeouts may be hit. These factors make it possible
for a restart to take longer than 8 seconds, although on most systems today this
process is much shorter than 8 seconds.
This RFC does not attempt to solve the problem of something allocating more
memory after the kernel signals
There should be no backward compatibility concerns, the changes can be made as soft transitions.
This RFC proposes moving usage of the
zx_system_get_event syscall from
pwrbtn-monitor. This syscall requires a handle to the root
job, which is a highly sensitive handle.
pwrbtn-monitor is a small, focused
component which already has access to control system power state via the
fuchsia.hardware.power.statecontrol/Admin capability. Adding access to the
root job increases this component's privilege.
This RFC also proposes increasing the reboot timeout. This is only a concern if we think an OOM is an attack vector and a longer reboot timeout gives an attacker more time to execute an exploit.
Tests will be needed to validate the system reboots as expected both when userspace shuts down before the kernel timeout and when it doesn't. If these tests don't exist they will be added.
We may also want performance tests to profile how long userspace takes to tear down. These profiling tests could be used to inform the kernel timeout value.
Various pieces of API documentation should be updated, but no new conceptual updates are required since this RFC is more a re-wiring of signals than something which fundamentally changes system behavior.
Drawbacks, alternatives, and unknowns
Alternative: Userspace handler location
There are a number of options for where we could put the userspace handler of
ZX_SYSTEM_EVENT_OUT_OF_MEMORY. It is desirable for the handler to be in
the ZBI and present on all products so that the handling experience is available
early and consistently. The main alternate candidates are power_manager,
shutdown-shim, and component_manager. The primary reason to choose
pwrbtn-monitor is that this responsibility fits with its overall job of
rebooting the system in response to an event, the OOM just being a software
generated event instead of a hardware one.
NO_CRASH for all userspace-initiated reboots
Currently Zircon writes data to persistent memory during an OOM and this RFC
proposes continuing that practice. As an alternative we could write the same
data to persistent RAM every time userspace triggers calls
to reboot the system, regardless of whether the kernel detected an OOM and
signaled userspace about the OOM. If we did this then if the graceful shutdown
of userspace following an OOM was successful the Feedback component would see a
NO_CRASH reboot reason from Zircon. If the kernel timer expired and Zircon
rebooted the system following the OOM then Feedback would see an OOM reboot
reason from Zircon.
The downside of this approach is that problems in userspace handling of the OOM could result in the system knowing it rebooted, but not knowing it was caused by an OOM. In this case Feedback would see that there was a NO_CRASH reboot reason from Zircon, but find no persisted crash information on disk. In this case Feedback would still file a report.
Alternative: Allow compatible requests to
zx_system_powerctl during an OOM reboot
This RFC proposes that once an kernel-iniated OOM reboot starts that only two
things complete the reboot: userspace calls
zx_system_powerctl with the
ZX_SYSTEM_POWERCTL_ACK_KERNEL_INITIATED_REBOOT or the kernel's reboot
timer expires. As an alternative we could allow any compatible call to
zx_system_powerctl to complete the reboot. A compatible call is one that also
reboot's the system, regardless of the
cmd value passed. This would resolve
race situations where userspace independently decided to reboot the system
before the kernel signaled the OOM condition. Possibly the kernel would write
different reboot reasons based on the
cmd value it actually received in the
zx_system_powerctl call. This would enable auditing of whether the system is
typically following the expected reboot path.
Alternative: Signal userspace handling completion via a kernel object
This RFC proposes completing the kernel-iniated OOM reboot by callng
zx_system_powerctl with a specific
cmd value. Instead the kernel-to-
userspace signaling mechanism could be changed such that userspace receives
a channel or event object. Userspace could then send a message or
assert/deassert a signal to indicate reboot can proceed. This alternative has
the advantage that the component which completes the reboot would not need
access to the root resource. Access to the root resource is required to
zx_system_powerctl. This alternative requires more work because it is a more
substantial change to how kernel/userspace signaling happens today.
Drawback: Certain races continue to be possible
Today it is possible that the kernel detects an OOM condition, claims the halt
token, and then userspace calls
zx_system_powerctl because userspace
previously decided to reboot. In this situation the call to
driver_manager then exits.
component_manager continues tearing
down the component topology, eventually arriving at
power_manager, which it
power_manager crashes the root job because
set as critical to the root job. Normally Zircon would reboot the system when
the root job dies, but does not in this case since
MemoryWatchdog holds the
halt token. Instead the system eventually hits
MemoryWatchdog's timeout and it
reboots the system.
This RFC allows similar races. Userspace could be in the process of tearing down
in preparation for a reboot when an OOM condition is hit. It could be that
pwrbtn-monitor already exited, which means nothing in userspace is around to
observe the OOM signal from the kernel. Alternately,
pwrbtn-monitor might be
running, but its attempt to reboot the system will fail because
only allows one in-flight request to shutdown or reboot the system. The earlier
teardown request will eventually get to
zx_system_powerctl will fail as previously described and the root
job will crash, but the system doesn't reboot until
expires. How bad is this race? It is pretty benign because userspace is going to
clean itself up. By the time the root job crashes userspace has already cleaned
up as much of itself as it ever expects to. The biggest downside is that the
reboot will be less prompt.
Another possible race is if a reboot-on-terminate component exits
at just the wrong time. Components can configure themselves so that if they exit
component_manager will reboot the system.
component_manager reboots the
system by calling
component_manager does not try to reboot the system if a reboot-on-terminate
component exits after
component_manager is told to tear down the component
topology. If such a reboot-on-terminate component exits after a component calls
Admin.Reboot, but before
tear down the topology, the system will crash because
if its call to
power_manager fails. This RFC does not propose a fix to this
race. The race may result in unpredictable behavior because we don't know how
much of the system is running when
Unknown: Impact on chance of completely running out of memory
The net impact of this RFC on the risk of the system completely running out of memory during OOM handling is uncertain. The proposed changes could reduce the chance of the system completely running out of memory because most userspace components do not listen for an exit signal and are killed as soon as their clients have exited. This should start freeing memory quickly. We expect that userspace components which do observe the exit signal should exit promptly, also freeing memory. The proposed changes could increase the chance of running out of memory because some systems could take longer to shut down than the existing, 8 second timeout and therefore giving more time for some components to allocate memory. This RFC also delays shutting down filesystems, which today exit quickly. Filesystem caches typically take the form of discardable memory managed by the kernel. It is unclear whether filesystems running longer will have a significant negative impact on memory.