RFC-0152: Improved OOM handling behavior | |
---|---|
Status | Accepted |
Areas |
|
Description | Improves out of memory handling by allowing vastly more userspace code to exit and creating a signal path for userspace to inform the kernel when userspace cleanup is complete. |
Issues | |
Gerrit change | |
Authors | |
Reviewers | |
Date submitted (year-month-day) | 2021-12-20 |
Date reviewed (year-month-day) | 2022-02-09 |
Summary
The goal of this RFC is to increase the reliability of capturing debugging data and minimize generation of unhelpful data during an out of memory ("OOM") event. This is accomplished by shutting down much more of userspace in an orderly fashion instead of the more basic mechanism used currently.
Motivation
Collecting data when the system runs out of memory helps identify the root cause of the event. Currently we collect some data, like memory reports, but we don't have guarantees about logs that improve our understanding of what was happening on the system. Collecting all relevant data is challenging today because few parts of the system know when the system runs out of memory. Additionally, analyzing available data following an OOM can be difficult. The way OOMs are handled often causes a cascade of secondary effects which adds noise to the data. This has repeatedly led to wasted time from confusion, lengthy discussions, and false conclusions.
Stakeholders
Facilitator: hjfreyer@google.com
Reviewers: dgilhooley@google.com, frousseau@google.com, maniscalco@google.com, palmer@google.com, pankhurst@google.com, ppi@google.com, pshickel@google.com, rashaeqbal@google.com, shayba@google.com, surajmalhotra@google.com
Consulted: adanis@google.com, alexlegg@google.com, geb@google.com, johngro@google.com, wez@google.com
Socialization: This idea was discussed over the course of various related bugs initially. After that key stakeholders were consulted about the proposed solution.
Background
The process of handling an out of memory (OOM) condition on Fuchsia works as follows:
- The memory watchdog detects the system's free memory is at the
ZX_SYSTEM_EVENT_OUT_OF_MEMORY
threshold if free memory continues to decline. The kernel is now committed to rebooting the system. - The memory watchdog generates the
ZX_SYSTEM_EVENT_OUT_OF_MEMORY
signal for userspace and claims the halt token. The halt token's purpose is to prevent multiple things in the kernel from trying to perform reboots concurrently. - Memory watchdog sleeps for 8 seconds before rebooting. There is no way for userspace to indicate to the kernel it is ready for reboot.
driver_manager
observes theZX_SYSTEM_EVENT_OUT_OF_MEMORY
signal. Whendriver_manager
starts it subscribes to the signal with thesystem_get_event
syscall which requires a handle to the root job.driver_manager
tellsfshost
to shut down and then tells all the drivers to stop. The system does this to try to minimize lost data and get hardware into a consistent state before reboot.- Outside
driver_manager
andfshost
, userspace continues to run until the memory watchdog's timer expires. In this period it is typical to see a variety of crashes as programs try to access executable pages that were paged out and can not be paged in because filesystems are gone. These crashes create a lot of noise which may be recorded if something is listening to the serial log.
The current OOM handling doesn't use it, but there already exists a way to gracefully tear down userspace. This mechanism stops components, including filesystems and drivers, in reverse dependency order and gives all components a chance to clean up.
memory_monitor
is not involved in OOM handling, but is
interested in low memory events. RFC-0091 created the
ZX_SYSTEM_EVENT_IMMINENT_OUT_OF_MEMORY
event which is generated by the kernel
when system free memory reaches a level a bit higher than the
ZX_SYSTEM_EVENT_OUT_OF_MEMORY
threshold. memory_monitor
observes this signal
and tries to persist its memory profile data to storage. This is best effort
because memory_monitor
has no way to signal it handled the event and the
system might hit the lower threshold at any time, possibly shutting down
filesystems before memory_monitor
can flush its data.
Design
The design is conceptually quite simple and uses a lot of existing pieces of the system, but composed in a new way. At a high level the strategy is when the kernel detects we've hit out of memory it records some state in memory, signals userspace about the OOM, the kernel waits with a timeout for userspace to call back into the kernel, userspace receives the signal, multiple userspace components play their already-established role in orderly shutdown, and finally userspace calls back into the kernel which allows the kernel to store data in NVRAM and finish the reboot.
In greater detail the sequence is:
- The kernel's memory watchdog detects the system is very low on memory and it
- Claims the halt token
- Signals userspace with
ZX_SYSTEM_EVENT_OUT_OF_MEMORY
- Sets a timer
- The userspace signal is received by
pwrbtn-monitor
and it talks topower_manager
via thefuchsia.hardware.power.statecontrol/Admin.Reboot
call. power_manager
then tellsdriver_manager
, via thefuchsia.device.manager/SystemStateTransition.SetTerminationSystemState
call, that before it exits it should move the system to theREBOOT_KERNEL_INITIATED
power state.power_manager
tellscomponent_manager
to tear down the component topology by callingfuchsia.sys2/SystemController.Shutdown
call.component_manager
tears down the component topology in reverse dependency order, eventually tellingdriver_manager
to exit.driver_manager
sees that it should execute the transition to theREBOOT_KERNEL_INITIATED
state before exiting.driver_manager
callszx_system_powerctl
, passingZX_SYSTEM_POWERCTL_ACK_KERNEL_INITIATED_REBOOT
for thecmd
value before it exits.- The kernel receives the syscall and signals the halt token.
- The rest of the OOM handling runs so that appropriate information is written to NVRAM to be read after reboot.
- The kernel reboots the system.
Implementation
Controlling Kernel OOM Timeout
This RFC proposes adding a kernel boot-option to control the OOM timeout. Currently the OOM timeout is hard-coded at 8 seconds. Implementation of this RFC increases the amount of code executed to handle the OOM and therefore a larger timeout is warranted.
Changes to the Halt Token
This RFC proposes changing the halt token to include a kernel event object instead of being simply an atomic boolean. The halt token would continue being an object which is irrevocably claimed. The halt token's event object will be used inside the kernel to coordinate reboots. The halt token would allow the event object to be signaled without the need to take the token.
OOM Handling Flow
Currently, when the kernel detects an OOM it generates the
ZX_SYSTEM_EVENT_OUT_OF_MEMORY
signal for userspace, takes the halt token, and
starts an 8 second timer. This RFC proposes that whenever the kernel wants to
reboot the system, but also give userspace a chance to do something, the kernel
should first take the halt token, then notify userspace, and wait some bounded
time. The wait bound should be equal to the maximum amount of time user mode is
allowed to run in response to the event. This is in contrast to the current
implementation, whose timeout value is both the maximum and minimum wait
period. Once the halt token's event is signaled or the timeout reached the
kernel will finish the reboot operation. In the case of an OOM the kernel code
that decides to reboot is in the memory watchdog which finishes the OOM handling
by creating the OOM crash log, storing it in NVRAM, and rebooting.
As mentioned previously this RFC proposes adding the ability to set the OOM timeout via a kernel boot-option.
Currently, the userspace ZX_SYSTEM_EVENT_OUT_OF_MEMORY
handler is in
driver_manager
. This RFC suggests moving the handler to pwrbtn-monitor
.
pwrbtn-monitor
is an existing component which is present on all builds and
used to control power state on some hardware. Effectively we can think of the
OOM as a software-generated power button press. As a result of
pwrbtn-monitor
's increased responsibility we propose renaming it to
system-event-monitor
.
When pwrbtn-monitor
receives the signal it will call
fuchsia.hardware.power.statecontrol/Admin.Reboot
. We will add a new
RebootReason
for OOM which pwrbtn-monitor
will pass to this call. The call
initiates the existing user mode graceful shutdown path which deconstructs the
component topology in reverse dependency order and concludes with
driver_manager
changing the hardware power state. This RFC proposes that when
servicing an OOM driver_manager
should always use a reboot path that uses the
zx_system_powerctl
syscall and passes the new value
ZX_SYSTEM_POWERCTL_ACK_KERNEL_INITIATED_REBOOT
as the value for the cmd
argument. The existing path in driver_manager happens to use
zx_system_powerctl
. On x86 this happens when driver_manager
delegates
completion of the reboot to the board driver and the board driver makes the
syscall. On arm64 driver_manager makes the syscall directly. The change in this
RFC is formally requiring zx_system_powerctl
in the reboot path.
When Zircon receives the zx_system_powerctl
with a cmd
value of
ZX_SYSTEM_POWERCTL_ACK_KERNEL_INITIATED_REBOOT
the handler code tries to
signal the halt token. If the halt token is not claimed, signaling fails and the
syscall returns an error. For other cmd
values the handler code remains the
same, specifically it tries to take the halt token and if fails to do so, sleeps
forever. In the case of a kernel-initiated reboot because of OOM, signaling the
token will allow the memory watchdog to complete its work and restart the
system. If userspace does not call zx_system_powerctl
before the memory
watchdog's timeout expires, the watchdog will continue its shutdown procedure
and reboot the system, this is unchanged from the current implementation.
Performance
We expect that OOM handling will take longer in some cases than it currently
does. Currently during an OOM driver_manager
stops the filesystems, stops the
drivers, and then does nothing else. 8 seconds after the kernel detected the
OOM, it reboots the system.
After implementation of this RFC much of userspace has a chance to react to the
system's impending reboot. Specifically power_manager
notifies listeners via
the RebootWatcher
protocol that a reboot is about to happen. power_manager
has a 5-second timeout for clients to respond. After power_manager
notifies
reboot watchers it tells component_manager
to tear down the component
topology. The component topology is torn down in reverse dependency order,
meaning not all components stop simultaneously. Components have a timeout period
to stop. After implementation of this RFC more code has a chance to execute
during an OOM and various timeouts may be hit. These factors make it possible
for a restart to take longer than 8 seconds, although on most systems today this
process is much shorter than 8 seconds.
This RFC does not attempt to solve the problem of something allocating more
memory after the kernel signals ZX_SYSTEM_EVENT_OUT_OF_MEMORY
.
Backwards Compatibility
There should be no backward compatibility concerns, the changes can be made as soft transitions.
Security considerations
This RFC proposes moving usage of the zx_system_get_event
syscall from
driver_manager
to pwrbtn-monitor
. This syscall requires a handle to the root
job, which is a highly sensitive handle. pwrbtn-monitor
is a small, focused
component which already has access to control system power state via the
fuchsia.hardware.power.statecontrol/Admin
capability. Adding access to the
root job increases this component's privilege.
This RFC also proposes increasing the reboot timeout. This is only a concern if we think an OOM is an attack vector and a longer reboot timeout gives an attacker more time to execute an exploit.
Testing
Tests will be needed to validate the system reboots as expected both when userspace shuts down before the kernel timeout and when it doesn't. If these tests don't exist they will be added.
We may also want performance tests to profile how long userspace takes to tear down. These profiling tests could be used to inform the kernel timeout value.
Documentation
Various pieces of API documentation should be updated, but no new conceptual updates are required since this RFC is more a re-wiring of signals than something which fundamentally changes system behavior.
Drawbacks, alternatives, and unknowns
Alternative: Userspace handler location
There are a number of options for where we could put the userspace handler of
the ZX_SYSTEM_EVENT_OUT_OF_MEMORY
. It is desirable for the handler to be in
the ZBI and present on all products so that the handling experience is available
early and consistently. The main alternate candidates are power_manager,
shutdown-shim, and component_manager. The primary reason to choose
pwrbtn-monitor
is that this responsibility fits with its overall job of
rebooting the system in response to an event, the OOM just being a software
generated event instead of a hardware one.
Alternative: Report NO_CRASH
for all userspace-initiated reboots
Currently Zircon writes data to persistent memory during an OOM and this RFC
proposes continuing that practice. As an alternative we could write the same
data to persistent RAM every time userspace triggers calls zx_system_powerctl
to reboot the system, regardless of whether the kernel detected an OOM and
signaled userspace about the OOM. If we did this then if the graceful shutdown
of userspace following an OOM was successful the Feedback component would see a
NO_CRASH
reboot reason from Zircon. If the kernel timer expired and Zircon
rebooted the system following the OOM then Feedback would see an OOM reboot
reason from Zircon.
The downside of this approach is that problems in userspace handling of the OOM could result in the system knowing it rebooted, but not knowing it was caused by an OOM. In this case Feedback would see that there was a NO_CRASH reboot reason from Zircon, but find no persisted crash information on disk. In this case Feedback would still file a report.
Alternative: Allow compatible requests to zx_system_powerctl
during an OOM reboot
This RFC proposes that once a kernel-iniated OOM reboot starts that only two
things complete the reboot: userspace calls zx_system_powerctl
with the cmd
value of ZX_SYSTEM_POWERCTL_ACK_KERNEL_INITIATED_REBOOT
or the kernel's reboot
timer expires. As an alternative we could allow any compatible call to
zx_system_powerctl
to complete the reboot. A compatible call is one that also
reboot's the system, regardless of the cmd
value passed. This would resolve
race situations where userspace independently decided to reboot the system
before the kernel signaled the OOM condition. Possibly the kernel would write
different reboot reasons based on the cmd
value it actually received in the
zx_system_powerctl
call. This would enable auditing of whether the system is
typically following the expected reboot path.
Alternative: Signal userspace handling completion via a kernel object
This RFC proposes completing the kernel-iniated OOM reboot by callng
zx_system_powerctl
with a specific cmd
value. Instead the kernel-to-
userspace signaling mechanism could be changed such that userspace receives
a channel or event object. Userspace could then send a message or
assert/deassert a signal to indicate reboot can proceed. This alternative has
the advantage that the component which completes the reboot would not need
access to the root resource. Access to the root resource is required to
zx_system_powerctl
. This alternative requires more work because it is a more
substantial change to how kernel/userspace signaling happens today.
Drawback: Certain races continue to be possible
Today it is possible that the kernel detects an OOM condition, claims the halt
token, and then userspace calls zx_system_powerctl
because userspace
previously decided to reboot. In this situation the call to zx_system_powerctl
will fail. driver_manager
then exits. component_manager
continues tearing
down the component topology, eventually arriving at power_manager
, which it
kills. Killing power_manager
crashes the root job because power_manager
is
set as critical to the root job. Normally Zircon would reboot the system when
the root job dies, but does not in this case since MemoryWatchdog
holds the
halt token. Instead the system eventually hits MemoryWatchdog
's timeout and it
reboots the system.
This RFC allows similar races. Userspace could be in the process of tearing down
in preparation for a reboot when an OOM condition is hit. It could be that
pwrbtn-monitor
already exited, which means nothing in userspace is around to
observe the OOM signal from the kernel. Alternately, pwrbtn-monitor
might be
running, but its attempt to reboot the system will fail because power_manager
only allows one in-flight request to shutdown or reboot the system. The earlier
teardown request will eventually get to driver_manager
. driver_manager
's
request to zx_system_powerctl
will fail as previously described and the root
job will crash, but the system doesn't reboot until MemoryWatchdog
's timeout
expires. How bad is this race? It is pretty benign because userspace is going to
clean itself up. By the time the root job crashes userspace has already cleaned
up as much of itself as it ever expects to. The biggest downside is that the
reboot will be less prompt.
Another possible race is if a reboot-on-terminate component exits
at just the wrong time. Components can configure themselves so that if they exit
component_manager
will reboot the system. component_manager
reboots the
system by calling fuchsia.hardware.power.statecontrol/Admin.Reboot
.
component_manager
does not try to reboot the system if a reboot-on-terminate
component exits after component_manager
is told to tear down the component
topology. If such a reboot-on-terminate component exits after a component calls
calls Admin.Reboot
, but before power_manager
tells component_manager
to
tear down the topology, the system will crash because component_manager
panics
if its call to power_manager
fails. This RFC does not propose a fix to this
race. The race may result in unpredictable behavior because we don't know how
much of the system is running when component_manager
panics.
Unknown: Impact on chance of completely running out of memory
The net impact of this RFC on the risk of the system completely running out of memory during OOM handling is uncertain. The proposed changes could reduce the chance of the system completely running out of memory because most userspace components do not listen for an exit signal and are killed as soon as their clients have exited. This should start freeing memory quickly. We expect that userspace components which do observe the exit signal should exit promptly, also freeing memory. The proposed changes could increase the chance of running out of memory because some systems could take longer to shut down than the existing, 8 second timeout and therefore giving more time for some components to allocate memory. This RFC also delays shutting down filesystems, which today exit quickly. Filesystem caches typically take the form of discardable memory managed by the kernel. It is unclear whether filesystems running longer will have a significant negative impact on memory.