RFC-0247: Enabling LTO in Fuchsia | |
---|---|
Status | Accepted |
Areas |
|
Description | Enable a compiler feature called Link Time Optimization (LTO) in Fuchsia only for the target binaries that are generated by the Clang toolchain except for the kernel in non-debug (aka release) builds. |
Issues | |
Gerrit change | |
Authors | |
Reviewers | |
Date submitted (year-month-day) | 2024-03-29 |
Date reviewed (year-month-day) | 2024-05-02 |
Summary
We propose enabling a compiler feature called Link Time Optimization (LTO) in Fuchsia only for the target binaries that are generated by the Clang toolchain except for the kernel in our non-debug (aka release) builds. We do not propose enabling LTO in the kernel and Rust in this RFC.
Motivation
LTO achieves better runtime performance and reduces code size at the cost of increased build time. LTO is a stepping stone towards achieving a more performant and secure system when combined with:
Combining LTO with PGO typically achieves better runtime performance because LTO can make guided decisions based on the collected profiles. Besides, LTO makes it possible to use a security mitigation technique called CFI. Enabling LTO and CFI on Fuchsia helps to achieve feature parity with other OS vendors, such as Android and ChromeOS.
Stakeholders
Facilitator:
hjfreyer@google.com
Reviewers:
- aaronwood@google.com
- olivernewman@google.com
- phosek@google.com
Consulted:
- awolter@google.com
- davidroth@google.com
- fmeawad@google.com
- mseaborn@google.com
Socialization:
We socialized a version of this RFC with the EngProd, Performance, Release, Software Assembly, Toolchain and Zircon Kernel teams.
Design
We landed a CL to prepare enabling LTO by default in non-debug (aka release) builds. We only need to turn it on in a followup CL. Developers can use a GN argument to disable LTO completely to mitigate the build time impact.
Implementation
We implemented the proposal in this RFC and landed the necessary changes.
Performance
LTO provides size and performance benefits:
Size benefit: We run the size checker builders in CQ to measure the size benefit of LTO, and found out that enabling LTO only in the Clang generated binaries reduces the total code size by 0.17 MiB in smart displays and 0.24 MiB in core_size_limits.arm64, respectively.
Performance benefit: We ran the performance builders in CQ to measure the performance benefit of LTO. LTO improved 23 out of 405 tests in smart displays, where we see a range of gains and some of them are above 10%. Moreover, LTO improved 1,959 out of 5,226 tests in platform builders. We also see a large range of performance gains in the platform builders, and some of them are above 30%.
Backwards Compatibility
There is no concern about backwards compatibility.
Security considerations
Security review has been completed via https://fxbug.dev/317396428.
Privacy considerations
Privacy review has been completed via https://fxbug.dev/314790650.
Testing
We plan to rely on the automated tests once we enable LTO by default on ToT and flag any functional or performance issues.
Documentation
LTO needs to be added to the Fuchsia release notes.
Drawbacks, alternatives, and unknowns
Drawback: Increased link time
Compilers have limited optimization opportunities while compiling individual modules, and LTO expands the scope of optimizations to the whole program by performing optimizations during link time. This increases the link time to achieve better runtime performance through whole-program analysis and cross-module optimization. The downside is that LTO moves more of the build cost to link time and linking becomes more costly and benefits less from parallelism and caching.
Drawback: Increased build time
Enabling LTO increases the build time in the CI/CQ release builds and developer builds. We measured the build time impact on clean builds in CQ by applying this CL with and without RBE/Goma and reported the time from ninja step below in MM:SS format.
CQ Builder | RBE | No LTO | LTO | Time Change |
---|---|---|---|---|
core_size_limits.x64-release | No RBE/Goma | 39:00 | 41:00 | +2:00 |
core_size_limits.x64-release | RBE | 6:00 | 8:18 | +2:18 |
core.x64-release | No RBE/Goma | 66:00 | 66:00 | +0:00 |
core.x64-release | RBE | 20:15 | 26:00 | +5:45 |
minimal.x64-release | No RBE/Goma | 72:00 | 78:00 | +6:00 |
minimal.x64-release | RBE | 17:48 | 25:00 | +7:12 |
workbench_eng.x64-release | No RBE/Goma | 41:00 | 43:00 | +2:00 |
workbench_eng.x64-release | RBE | 6:30 | 9:18 | +2:48 |
In the table below, we show the developer build time impact on a full clean
build for build only core.x64-release configuration by running
NINJA_STATUS=["%es] " fx build
.
RBE | No LTO | LTO | Time Change |
---|---|---|---|
No RBE/Goma | 26:48 | 27:15 | +0:27 |
RBE cold cache | 13:58 | 17:52 | +3:54 |
RBE warm cache | 13:55 | 15:53 | +1:58 |
We measured the incremental build time overhead of a single binary called
driver_manager
. We performed the following steps to measure the LTO impact on
the full build pipeline.
- Perform a clean build by running
fx build driver_manager
. - Delete this line in this source file.
- Perform an incremental build and measure the elapsed time via
NINJA_STATUS=["%es] " fx build driver_manager
.
We display the incremental build time overhead of a driver_manager
in the
table below:
RBE | No LTO | LTO | Time Change |
---|---|---|---|
No RBE/Goma | 0:43 | 1:13 | +0:30 |
RBE | 1:08 | 1:22 | +0:14 |
Unknown: Surfacing Latent Issues
LTO might expose latent issues in our source code because it performs optimizations across modules, which might invalidate certain assumptions that are already in place or surface hidden bugs.
Unknown: Impact on Debuggability
There were also concerns raised about the impact of LTO on code generation and debugging. Although LTO should not have impact in debuggability in general and we have not identified concrete cases yet, this still stays unknown. When such cases are provided, we can investigate whether compiler debugging information and debugging tools interact poorly with LTO in particular cases.
Future Work: Enable LTO in kernel
We plan to study enabling LTO in the kernel to evaluate its benefit, and experiment combining LTO with other optimizations, such as PGO, before enabling LTO in the kernel. We decided to separate the benefit of enabling LTO in the user space vs kernel, and keep the kernel out of the scope of this RFC.
Future Work: Enable LTO in Rust
We also plan to enable LTO in Rust. Our preliminary results show that enabling LTO in Rust gives us further size and performance improvements. However, we decided to work on a staged approach where we enable LTO in different stages to to minimize the build time impact and rollout risk.
Future Work: Enable FatLTO and Unified LTO
We plan to investigate enabling FatLTO and Unified LTO in Fuchsia. Different targets may have different build time and performance requirements. For example, it might be beneficial to apply LTO to some targets, but not others, such as tests. FatLTO and Unified LTO might potentially help with the build time overhead and complexity. However, they are not ready for a full rollout and require some debugging and testing effort.
Prior art and references
Android has been shipping with LTO and CFI enabled in their kernel and other components since Android 9. Chrome has deployed LTO, PGO and CFI.
We currently use LTO extensively in Fuchsia, specifically most host tools we use in Fuchsia development, such as ffx, are built with LTO by default and LTO has been essential to improving performance of these tools.
FEC decision
The Fuchsia Engineering Council (FEC) has voted to accept this RFC, with some caveats.
First, the FEC agrees that LTO represents a step in the right direction for the binaries that we ship to our partners (and then to end-users), for all the reasons raised by gulfem@google.com and phosek@google.com in the RFC text and comment threads.
Second, the FEC acknowledges that the impact of this change is uncertain. Enabling LTO for release builds will impact stakeholders in both positive and negative ways: end users will benefit from improved performance; platform developers will find optimized code more difficult to debug; developers and product owners will benefit from smaller binary sizes, especially on devices with tight storage constraints; platform developers will experience longer build times; infrastructure will spend more CPU cycles on compilation, but fewer on test execution; etc. No one can predict the full impact, and even if someone could, reasonable stakeholders could disagree about whether the impact is positive overall. Given this uncertainty, we believe the Toolchain team has done due diligence.
Once the change lands, we encourage any stakeholders who believe they have been excessively negatively impacted to bring their concerns to the Toolchain team, and if their concerns are not resolved, to escalate to the FEC. We can always reevaluate this decision as new evidence emerges.
Last, we'd like to acknowledge one negatively-impacted group of stakeholders in particular: C++ platform developers outside the kernel who build in release mode as part of their normal development workflow. This change will most likely make incremental build times meaningfully slower. While this is unfortunate, we believe enabling LTO is still the right choice.
However, this does point to a larger concern, which we share: as we roll out smarter optimizers over more of our codebase, the set of impacted developers will grow, and the extent of the slowdowns will increase. Developers are unlikely to know exactly why their builds are getting slower, and may not even consciously notice the slowdown, but their productivity will be damaged nonetheless.
To avoid this scenario, we're calling for a moratorium on additional toolchain-related changes that increase build times until we can apply them to the artifacts we ship without impacting a significant number of developers on the team - taking into account that a significant number of developers on the team can't use debug builds. For example, we could unblock further changes by making our current two major compilation modes (debug and release) into three (perhaps mirroring Bazel's "debug, opt, and fastbuild" compilation modes). Expensive optimizations would run only in one of these modes, and the other two would cover the vast majority of developer use cases. As a counterexample, documenting a specific set of low-level compiler flags that individual developers can choose to adopt would not be sufficient, as those individual sets of flags would be difficult to discover, lack builder coverage, and have poor cache hit rates.