RFC-0254: Changing Attribution for Copy-on-Write Pages | |
---|---|
Status | Accepted |
Areas |
|
Description | Change the APIs and semantics Zircon exposes for attributing copy-on-write pages. |
Issues | |
Gerrit change | |
Authors | |
Reviewers | |
Date submitted (year-month-day) | 2024-08-02 |
Date reviewed (year-month-day) | 2024-07-29 |
Problem Statement
The current behavior that Zircon exposes for attributing memory to VMOs is proving problematic for the Starnix kernel in two ways:
First, Zircon's current behavior for memory attribution is only compatible with the use of a widely-shared lock in Zircon's VMO implementation. This shared lock is the cause of some performance problems when running many Starnix processes.
Starnix backs each MAP_ANONYMOUS
Linux mapping with a VMO, and clones all such
VMOs in a Linux process when it makes a fork()
call. With many Linux processes
the result is many VMOs contending on a small set of shared locks. When we
prototyped a memory-saving optimization where Starnix used one large VMO for all
mappings in a process, the shared lock degenerated to a "global" lock that all
Starnix VMOs contended on.
Second, Starnix can't emulate the exact behavior of Linux memory attribution
with the APIs Zircon provides. They don't distinguish between private and shared
copy-on-write pages in a VMO and hence also don't expose the share counts for
any such shared pages. Starnix needs this information to accurately compute
USS and PSS values that are exposed in various /proc
filesystem entries.
We only propose changes to Zircon's memory attribution APIs and behavior that will solve these problems while maintaining the existing feature set of tools that rely on them. We don't seek here to solve the more general problem of attributing memory between user space entities like components and runners. We also don't intend these changes to be the "last word" for attribution - they are a means to an end to solve the specific performance bottlenecks and feature gaps Starnix faces today.
Summary
Improve Zircon's memory attribution APIs and behavior to account for shared copy-on-write pages.
Stakeholders
Facilitator:
neelsa@google.com
Reviewers:
- adanis@google.com
- jamesr@google.com
- etiennej@google.com
- wez@google.com
Consulted:
- maniscalco@google.com
- davemoore@google.com
Socialization:
This RFC was socialized in conversations with stakeholders from the Zircon, Starnix, and Memory teams. The stakeholders also reviewed a design document outlining the changes proposed in this RFC.
Requirements
- The kernel's memory attribution implementation must not prevent optimizing lock contention in Zircon by making the VMO hierarchy lock more fine-grained.
- When tools like
memory_monitor
sum attribution for all VMOs in the system, the total must agree with the system's actual physical memory usage. This is the "sums to one" property. - Starnix must be able to emulate Linux memory attribution APIS on top of the APIs provided by Zircon. This includes providing USS, RSS, and PSS measurements and other measurements in relevant proc filesystem entries.
Design
The changes we propose affect how Zircon attributes pages shared between VMO clones via copy-on-write. These are COW-private and COW-shared pages.
We don't propose changes to how Zircon attributes pages shared between processes via memory mappings or duplicate VMO handles. These are process-private and process-shared pages.
User space is responsible for attributing process-shared pages to one or more processes when appropriate. Zircon doesn't generally consider process sharing in its decision-making, with one exception. See "Backwards Compatibility" for more details.
The types of sharing are independent - a given page can be any combination of COW-private, COW-shared, process-private, and process-shared.
Zircon's Existing Behavior
Zircon's existing attribution behavior provide a single per-VMO measurement: Attributed memory from the VMO's COW-private and COW-shared pages, with each page unique attributed.
- It uniquely attributes COW-shared pages to the oldest living VMO which can reference them. COW-shared pages are sometimes shared between sibling VMOs so the oldest living VMO referencing a page need not be a parent of the others.
- Summing it over all VMOs system-wide measures the total system memory usage.
- It does not measure a VMO's USS, because it might include some COW-shared pages.
- It also does not measure a VMO's RSS, because it might not include all COW-shared pages.
- It does not measure a VMO's PSS because that measurement scales each page's contribution by the number of times it is shared.
Zircon's New Behavior
Zircon's new attribution behavior provides three per-VMO measurements:
- USS: Attributed memory from the VMO's COW-private pages. This measurement attributes the pages only to that VMO. It measures the memory which would be freed if the VMO were destroyed.
- RSS: Attributed memory from the VMO's COW-private and COW-shared pages. This measurement attributes the COW-shared pages to each VMO that shares them, so it counts COW-shared pages multiple times. It measures the total memory the VMO references.
- PSS: Attributed memory from the VMO's COW-private and COW-shared pages. This measurement evenly divides attribution for each COW-shared page between all VMOs which share that page. It can be a fractional number of bytes for a given VMO, because each VMO gets an equal fraction of each of its shared pages' bytes. It measures the VMOs "proportional" usage of memory. Summing it over all VMOs system-wide measures the total system memory usage.
We chose to change the attribution APIs by exposing all of USS, RSS, and PSS because it was the only option among the alternatives we explored that satisfied all requirements.
Impacts to User Space
This proposal continues to preserve the "sums to one" property which existing
user space tools rely on. Attribution queries may give different results under
the new behavior, but summing the queries for all VMOs in the system continues
to provide an accurate count of memory usage. These changes won't cause programs
like memory_monitor
and ps
to begin overcounting or undercounting memory,
and in fact will make their per-VMO and per-task counts more accurate.
The new APIs expose fractional bytes values for PSS and user space must opt-in
to observing these fractional bytes. The _fractional_scaled
fields contain the
partial bytes and user space will slightly undercount PSS unless it uses them.
Starnix underestimates the RSS values for Linux mappings today, because the current attribution behavior only counts COW-shared pages once. RSS should count shared pages multiple times. Starnix can't emulate this behavior without help from Zircon , so we change the Zircon attribution APIs to provide correct RSS values. We could do this with another round of API changes, but to avoid churn we chose to batch the changes together.
memory_monitor
considers a VMO as shared with processes which reference any
any of its children, even for processes that don't reference the VMO directly.
It equally divides the parent VMO's attributed memory between these processes.
This attempts to account for COW-shared pages and allows memory_monitor
to
"sum to one", but can lead to some incorrect results. For example, it can assign
a portion of a parent VMO's COW-private pages to processes which don't reference
that VMO and thus can't reference the pages. It also incorrectly scales parent
COW-shared pages when they are shared with some children but not others. Our
changes make this behavior redundant and we will remove it. The tool will still
scale the memory of VMOs when multiple processes reference them directly. See
"Backwards Compatibility" for more details.
Syscall API Changes
Zircon's attribution APIs consist of zx_object_get_info
topics which we will
change:
ZX_INFO_VMO
- We will preserve ABI and API backwards compatibility by:
- Renaming
ZX_INFO_VMO
->ZX_INFO_VMO_V3
while retaining its value. - Renaming
zx_info_vmo_t
->zx_info_vmo_v3_t
.
- Renaming
- We will add a new
ZX_INFO_VMO
andzx_info_vmo_t
, and change the meaning of two existing fields in all versions ofzx_info_vmo_t
:
- We will preserve ABI and API backwards compatibility by:
typedef struct zx_info_vmo {
// Existing and unchanged `zx_info_vmo_t` fields omitted for brevity.
// These fields already exist but change meaning.
//
// These fields include both private and shared pages accessible by a VMO.
// This is the RSS for the VMO.
//
// Prior versions of these fields assigned any copy-on-write pages shared by
// multiple VMO clones to only one of the VMOs, making the fields inadequate
// for computing RSS. If 2 VMO clones shared a single page then queries on
// those VMOs would count either `zx_system_get_page_size()` or 0 bytes in
// these fields depending on the VMO queried.
//
// These fields now include all copy-on-write pages accessible by a VMO. In
// the above example queries on both VMO's count `zx_system_get_page_size()`
// bytes in these fields. Queries on related VMO clones will count any shared
// copy-on-write pages multiple times.
//
// In all other respects these fields are the same as before.
uint64_t committed_bytes;
uint64_t populated_bytes;
// These are new fields.
//
// These fields include only private pages accessible by a VMO and not by any
// related clones. This is the USS for the VMO.
//
// These fields are defined iff `committed_bytes` and `populated_bytes` are,
// and they are the same re: the definition of committed vs. populated.
uint64_t committed_private_bytes;
uint64_t populated_private_bytes;
// These are new fields.
//
// These fields include both private and shared copy-on-write page that a VMO
// can access, with each shared page's contribution scaled by how many VMOs
// can access that page. This is the PSS for the VMO.
//
// The PSS values may contain fractional bytes, which are included in the
// "fractional_" fields. These fields are fixed-point counters with 63-bits
// of precision, where 0x800... represents a full byte. Users may accumulate
// these fractional bytes and count a full byte when the sum is 0x800... or
// greater.
//
// These fields are defined iff `committed_bytes` and `populated_bytes` are,
// and they are the same re: the definition of committed vs. populated.
uint64_t committed_scaled_bytes;
uint64_t populated_scaled_bytes;
uint64_t committed_fractional_scaled_bytes;
uint64_t populated_fractional_scaled_bytes;
} zx_info_vmo_t;
ZX_INFO_PROCESS_VMOS
- We will preserve ABI and API backwards compatibility by:
- Renaming
ZX_INFO_PROCESS_VMOS
->ZX_INFO_PROCESS_VMOS_V3
while retaining its value.
- Renaming
- We will add a new
ZX_INFO_PROCESS_VMOS
.- This topic re-uses the
zx_info_vmo_t
family of structs, so the changes to them all apply here.
- This topic re-uses the
- We will preserve ABI and API backwards compatibility by:
ZX_INFO_PROCESS_MAPS
- We will preserve ABI backwards compatibility by:
- Renaming
ZX_INFO_PROCESS_MAPS
->ZX_INFO_PROCESS_MAPS_V2
while retaining its value. - Renaming
zx_info_maps_t
->zx_info_maps_v2_t
. - Renaming
zx_info_maps_mapping_t
->zx_info_maps_mapping_v2_t
.
- Renaming
- We will break API compatibility in one case:
- Renaming
committed_pages
andpopulated_pages
is a breaking change. See "Backwards Compatibility" below.
- Renaming
- We will add a new
ZX_INFO_PROCESS_MAPS
,zx_info_maps_t
, andzx_info_maps_mapping_t
, and change the meaning of two fields in existing versions ofzx_info_maps_mapping_t
. These fields aren't present in the new version of this struct (see below):
- We will preserve ABI backwards compatibility by:
typedef struct zx_info_maps {
// No changes, but is required to reference the new `zx_info_maps_mapping_t`.
} zx_info_maps_t;
typedef struct zx_info_maps_mapping {
// Existing and unchanged `zx_info_maps_mapping_t` fields omitted for brevity.
// These fields are present in older versions of `zx_info_maps_mapping_t` but
// not this new version. In the older versions they change meaning.
//
// See `committed_bytes` and `populated_bytes` in the `zx_info_vmo_t` struct.
// These fields change meaning in the same way.
uint64_t committed_pages;
uint64_t populated_pages;
// These are new fields which replace `committed_pages` and `populated_pages`
// in this new version of `zx_info_maps_mapping_t`.
//
// These fields are defined in the same way as the ones in `zx_info_vmo_t`.
uint64_t committed_bytes;
uint64_t populated_bytes;
// These are new fields.
//
// These fields are defined in the same way as the ones in `zx_info_vmo_t`.
uint64_t committed_private_bytes;
uint64_t populated_private_bytes;
uint64_t committed_scaled_bytes;
uint64_t populated_scaled_bytes;
uint64_t committed_fractional_scaled_bytes;
uint64_t populated_fractional_scaled_bytes;
} zx_info_maps_mapping_t;
ZX_INFO_TASK_STATS
- We will preserve ABI and API backwards compatibility by:
- Renaming
ZX_INFO_TASK_STATS
->ZX_INFO_TASK_STATS_V1
while retaining its value. - Renaming
zx_info_task_stats_t
->zx_info_task_stats_v1_t
.
- Renaming
- We will add a new
ZX_INFO_TASK_STATS
andzx_info_task_stats_t
, and change the meaning of three existing fields in all versions of thezx_info_task_stats_t
struct:
- We will preserve ABI and API backwards compatibility by:
typedef struct zx_info_task_stats {
// These fields already exist but change meaning.
//
// These fields include either private or shared pages accessible by mappings
// in a task.
// `mem_private_bytes` is the USS for the task.
// `mem_private_bytes + mem_shared_bytes` is the RSS for the task.
// `mem_private_bytes + mem_scaled_shared_bytes` is the PSS for the task.
//
// Prior versions of these fields only considered pages to be shared when they
// were mapped into multiple address spaces. They could incorrectly attribute
// shared copy-on-write pages as "private".
//
// They now consider pages to be shared if they are shared via either multiple
// address space mappings or copy-on-write.
//
// `mem_private_bytes` contains only pages which are truly private - only one
// VMO can access the pages and that VMO is mapped into one address space.
//
// `mem_shared_bytes` and `mem_scaled_shared_bytes` contain all shared pages
// regardless of how they are shared.
//
// `mem_scaled_shared_bytes` scales the shared pages it encounters in two
// steps: first each page is scaled by how many VMOs share that page via
// copy-on-write, then each page is scaled by how many address spaces map the
// VMO in the mapping currently being considered.
//
// For example, consider a single page shared between 2 VMOs P and C.
//
// If P is mapped into task p1 and C is mapped into tasks p2 and p3:
// `mem_private_bytes` will be 0 for all 3 tasks.
// `mem_shared_bytes` will be `zx_system_get_page_size()` for all 3 tasks.
// `mem_scaled_shared_bytes` will be `zx_system_get_page_size() / 2` for p1
// and `zx_system_get_page_size() / 4` for both p2 and p3.
//
// If P is mapped into task p1 and C is mapped into tasks p1 and p2:
// `mem_private_bytes` will be 0 for both tasks.
// `mem_shared_bytes` will be `2 * zx_system_get_page_size()` for p1 and
// `zx_system_get_page_size()` for p2.
// `mem_scaled_shared_bytes` will be `3 * zx_system_get_page_size() / 4` for
// p1 and `zx_system_get_page_size() / 4` for p2.
uint64_t mem_private_bytes;
uint64_t mem_shared_bytes;
uint64_t mem_scaled_shared_bytes;
// This is a new field.
//
// This field is defined in the same way as the "_fractional" fields in
// `zx_info_vmo_t`.
uint64_t mem_fractional_scaled_shared_bytes;
} zx_info_task_stats_t;
Implementation
We will implement the structural kernel API changes (e.g. adding and renaming
fields) as a single initial CL to reduce churn. We will set the value of all
new fields to 0 except for the _fractional_scaled
fields which we will set
to a sentinel value of UINT64_MAX
.
Next we will change the memory_monitor
behavior both around attributing parent
VMO memory to processes and using the PSS _scaled_bytes
fields instead of
the RSS _bytes
fields. The sentinel values in the _fractional_scaled
fields will gate both of these backwards-incompatible behaviors for now. This
gating and the 0 values in the other new fields mean that no user space behavior
will change yet.
Then we will expose the new attribution behavior from the kernel. This change
will alter the meaning of existing fields such as committed_bytes
and the new
fields can take on non-zero values. The _fractional_scaled
fields cannot take
on the sentinel values any longer so user space will automatically begin using
the new behaviors and new fields to take advantage of the change.
Finally we will remove the checks on the _fractional_scaled
fields from
memory_monitor
.
The old versions of the APIs we change are considered deprecated. We can remove these at a later date when we are confident no binaries reference them anymore.
Performance
The new attribution implementation will improve performance for attribution queries. It processes VMO clones fewer times and avoids expensive checks that that were used to disambiguate shared COW pages between multiple VMOs.
Later, implementing the fine-grained hierarchy lock will improve performance for
page faulting and several VMO syscalls including zx_vmo_read
, zx_vmo_write
,
zx_vmo_op_range
, and zx_vmo_transfer_data
. Today these operations contend on
a hierarchy lock shared by all related VMOs, which serializes all of these
operations on related VMO clones. The serialization is much worse for VMOs that
are cloned many times. Many Starnix and bootfs VMOs fall into this category.
Zircon will perform more work when creating and destroying child VMOs. Creating
SNAPSHOT
and SNAPSHOT_MODIFIED
children becomes O(#parent_pages)
instead
of O(1)
. Destroying any type of child becomes O(#parent_pages)
in all cases
where before it could be O(#child_pages)
in some cases. Zircon currently
defers this work to more frequent operations such as page faults. Moving it to
the infrequent zx_vmo_create_child()
and zx_vmo_destroy()
operations will
make those other frequent operations faster in exhcnage. We don't expect any
user-visible performance regressions as a result.
We will use existing microbenchmarks and macrobenchmarks to verify the expected performance deltas.
Backwards Compatibility
ABI and API Compatibility
We will create additional versioned topics for all kernel API changes and preserve support for all existing topics. This will preserve ABI compatibility in all cases and API compatibility in all cases except one.
Renaming committed_pages
and populated_pages
in zx_info_maps_mapping_t
is an API-breaking change. These fields are only used in Fuchsia-internal
tools in a handful of places, so the CL which renames these fields will also
change the names at all call sites.
Other changes won't require updating existing callsites unless they want to take advantage of newly-added fields.
Behavioral Compatibility
Starnix, ps
, and a few libraries external to Fuchsia compute per-VMO or
per-task RSS values. The new attribution behavior will make these computations
more accurate, because they will count COW-shared pages multiple times. This
will change the RSS values in a backwards-incompatible way, but only by making
them more accurate so we don't expect any negative effects.
memory_monitor
's method of attributing memory for parent VMOs has correctness
issues which the new attribution behavior can fix. We must change this tool's
implementation in a backwards-incompatible way to implement the fix. To avoid
churn we will set the new _fractional_scaled
fields to a sentinel value of
UINT64_MAX
until we implement the new attribution behavior. We will gate
memory_monitor
's implementation on this value. Attribution never generates
this sentinel value normally, so it makes a great "feature flag".
We preserve the meaning of "shared" as "process-shared" in ZX_INFO_TASK_STATS
despite it functioning differently from the other attribution APIs. This allows
existing programs which depend on the ZX_INFO_TASK_STATS
behavior to continue
functioning without invasive changes. Future work might replace these uses of
ZX_INFO_TASK_STATS
with different queries, perhaps to memory_monitor
, at
which time we could deprecate and remove ZX_INFO_TASK_STATS
.
Testing
The core test suite already has many tests which validate memory attribution.
We will add a few new tests to validate edge cases around the new attribution APIs and behavior.
Documentation
We will update syscall documentation for zx_object_get_info
to reflect these
changes.
We will add a new page under docs/concepts/memory
which describes Zircon's
memory attribution APIs and provides examples.
Drawbacks, alternatives, and unknowns
We explored some alterantive solutions models in the process of developing this proposal. Most were easy to implement but had undesriable performance or maintainability characteristics.
Emulate Current Attribution Behavior
In this option we would change the kernel to emulate the existing attribution behavior on top of a new implementation.
It has no backwards-compatibility concerns and doesn't require any API changes.
However it blocks the use of fine-grained hierarchy locking because it needs the existing shared hierarchy lock to function. The emulation would cause attribution queries to deadlock under a finer-grained locking strategy. It would also make attribution queries perform more badly; their performance is already of concern today. Finally, it doesn't provide the information Starnix needs to compute USS, RSS, or PSS measurements.
Expose Only USS and RSS
In this option we would change the kernel to attribute shared copy-on-write pages to each VMO sharing them, and expose attribution for private and shared pages separately.
It is compatible with a fine-grained hierarchy lock.
However it doesn't provide user space with enough information to preserve the "sums to one" property or for Starnix to compute PSS measurements. The kernel attributes COW-shared pages multiple times with this option. Each COW-shared page can be shared a different number of times depending on write patterns, so user space would require per-page information to deduplicate COW-shared pages or evenly divide them amongst clones.
Expose Detailed per-VMO-tree Information
In this option we would change the kernel to attribute shared copy-on-write pages to each VMO sharing them, and expose per-tree and per-VMO information:
- A tree identifier which links a VMO to its tree
- A stable per-VMO identifier for tie-breaking, like a timestamp
- Private pages per VMO
- Total shared pages per tree
- Number of per-tree shared pages visible by each VMO
It provides user space with the information needed to satisfy the "sums to one" property.
However it blocks the lock contention optimizations we wish to make because it needs the existing shared hierarchy lock to function. It isn't feasible to compute the "total shared pages per tree" without such a shared lock. This option also exposes kernel implementation details to user space that would make future maintenance of VM code more difficult. It also doesn't provide the information Starnix needs to compute PSS measurements.
Prior art and references
Other operating systems provide the RSS, USS, and PSS measurements used in this RFC, although sometimes under different names. Windows, MacOS, and Linux all track RSS and USS. Linux-based operating systems track PSS.
Linux exposes these measurements via the proc filesystem. Attribution tools like ps and top exist to present them in a user-friendly way.