RFC-0071: OTA backstop | |
---|---|
Status | Accepted |
Areas |
|
Description | Prevent devices from OTAing backwards across a version boundary. |
Issues | |
Gerrit change | |
Authors | |
Reviewers | |
Date submitted (year-month-day) | 2021-02-03 |
Date reviewed (year-month-day) | 2021-02-24 |
Summary
This document proposes a plan to prevent devices from installing over-the-air (OTA) updates backwards across a version boundary.
Motivation
When the storage stack makes breaking changes to a filesystem format, they roll the major version number of the format, which prevents drivers running on older system versions from attempting to mount and use the images in the new format.
Having an equivalent version number in the system update stack would prevent users from attempting to OTA backwards to a driver version that does not support the filesystem image their device contains. In other words: It would allow us to fail the "backwards OTA" operation before the device is bricked.
This would add value because:
- it would be highly useful for any application that persists state. For instance, applications maintaining a sqlite database, which has a schema that could change over time.
- specifically, it would be highly useful for the storage team, since they have in the past had to invest a lot of time into triaging issues that were ultimately caused by reverse-OTAing across a version boundary.
- this would reinforce that Fuchsia does not support backwards OTAs; they are strictly best effort.
It's important to note that this proposal doesn't change which OTA sequences are supported and which are not. It just makes this support explicit. The main purpose of the OTA backstop is to prevent developer devices from being put into invalid states. For production devices, the no-backwards-OTAs invariant should be primarily enforced by release management.
Without this proposal, attempting to backwards-OTA across an incompatible boundary will cause problems when developers attempt to boot the device (e.g. the filesystem format might not be supported by the driver). With this proposal, developers will find out about this before they do the OTA (and the error will be much clearer), which is a better developer experience.
Background
Terminology
An OTA is a mechanism for upgrading the underlying operating system. Fuchsia devices can receive and install OTA updates to system and application software.
A Stepping Stone build is a build that cannot be skipped over in OTAs. For example, consider three
sequential releases A, B, and C. Traditionally, we'd need to support OTAs from A->B
, B->C
, and
A->C
. If we declare B
as a stepping stone release, this removes the A->C
edge, so the only way
for A to upgrade to C is to OTA A->B
then B->C
. In practice, this is useful for risky migrations
and for reducing the number of forward OTAs we need to test.
How the OTA backstop relates to stepping stones
The OTA backstop and stepping-stone releases are both primitives that we have to do safe migrations (for example, storage format migrations). The exact playbook for how the OTA backstop and stepping-stone releases should be used is out of scope of this RFC. Instead, here we provide an example of how these primitives may be used to support a safe migration.
Consider a storage format migration. The steps we might take are:
- Add support for the new format, but don't enable/migrate it yet. Bump the OTA backstop.
- Wait some time.
- Enable the new format with one of the above migration strategies.
For cases where we do actually migrate devices, we have two further steps we can take to enable cleanup:
- Cut a stepping-stone release that includes (3).
- Remove the migration code and support for the old format.
The stepping stone release allows us to assume that devices will have gone through a build that has the migration code, and thus we can remove read support for the old format going forward.
Bumping the OTA backstop in (1) ensures that devices don't downgrade to a version that doesn't have support for the new format.
Policy for bumping the backstop
The backstop should be bumped one-off as needed. The vast majority of changes should not require backstop bumps. If this RFC is approved, an official playbook doc should be published to describe specific steps for bumping the backstop. In the meantime, here we propose a high level overview of this policy.
When proposing a CL to bump the backstop, authors should:
- Provide a link to an issue on bugs.fuchsia.dev which describes why the bump is necessary and how developers can proceed if they absolutely need to downgrade their device across the backstop (e.g. the answer is probably flash or pave).
- Obtain //src/sys/pkg/OWNERS approval.
Design
Let's introduce an epoch.json
file to be present both in the
update package and on the system.
It should be a JSON file with two string keys:
- "version", which should have a single string value for the
epoch.json
schema version. In practice, this will not be checked when performing updates -- this key only exists to make it obvious whenepoch.json
schema changes are made in production. - "epoch", which should have a single integer value for the OTA backstop. If the epoch of the update
package < epoch of system, we should fail OTAs in the prepare phase with
UNSUPPORTED_DOWNGRADE
.
For example, epoch.json
may look like:
{
"version": "1",
"epoch": 5
}
In order to safely bump the epoch, let's also introduce an epoch_history
file that gets compiled
into epoch.json
via the build system. The epoch_history
file could be in the form:
0=Initial epoch (https://fxbug.dev/42144857)
1=Storage format migration (https://fxbug.dev/XXXXX)
...
N=Most recent change (https://fxbug.dev/YYYYY)
The epoch_history
file should be manually bumped each time a backwards incompatible change is
introduced.
While the intermediary epoch_history
file adds another layer of complexity, this approach is
advantageous because:
- It provides a log of all version bump changes (forced documentation!)
- It produces a merge conflict if two people try to bump the epoch for different reasons.
Implementation
The changes will occur entirely in the platform (specifically, the system update stack).
In order to land the change, we need to:
- Add
epoch_history
to //src/sys/pkg/bin/system-updater.- Also, make a script that converts
epoch_history
toepoch.json
. - Have the build system use this script to add
epoch.json
to the system-updater's out directory.
- Also, make a script that converts
- Modify the BUILD
so that
epoch.json
also gets put into the update package. - The system-updater should examine
epoch.json
at the end of the Prepare phase.- If there is no
epoch.json
in the update package or there is a problem with deserializing it, assume epoch is 0. We deliberately ignore errors so that we can still OTA if theepoch.json
schema changes. - If there is no
epoch.json
in system-updater's out directory or if there is a problem with deserializing it, fail because this is unexpected. Consider using theinclude_str
macro to read from the out directory. - If epoch in update package < epoch in system-updater, fail prepare with
reason
UNSUPPORTED_DOWNGRADE
. We'll need to create a new PrepareFailureReason forUNSUPPORTED_DOWNGRADE
.
- If there is no
Security
This is not a security feature. However, it may interact with security features to improve developer
workflows. For example, consider a rollback protection feature that refuses to boot an image below
version N
. If we increment the epoch when we land image version N
, this will prevent developers
from downgrading an unbootable version because those downgrades will fail at the OTA backstop.
Beyond that, we choose to embed epoch.json
in the system-updater binary (rather than in
config-data) to make OTAs resilient to config-data corruption.
Privacy, and performance considerations
N/A
Testing
We can use the existing system update testing framework in //src/sys/pkg, which is a mix of unit and integration tests.
Additionally, the OTA e2e tests will ensure both the backstop is non-decreasing and in a valid format. For example:
- if build
N
lowers the OTA backstop, we'll fail in CI to OTA from buildN-1
toN
. - if build
N
produces an invalidepoch.json
in the system-updater, we'll fail in CI to OTA from - build
N
toN'
.
Documentation
We'll need to create a document to describe the policy for updating epoch_history
.
Also, we'll need to modify:
- Update package documentation.
- OTA documentation (not yet posted on fuchsia.dev).
Drawbacks, alternatives, and unknowns
What are the costs of implementing this proposal?
The main cost of implementing this proposal is increased platform complexity, since we are adding yet another version identifier to the platform.
What other strategies might solve the same problem?
Another strategy is to officially support all backwards OTAs. This is impractical because we can't write code resilient to future changes if we don't know what those changes are.
Another strategy is to explicitly prohibit all backwards OTAs (even ones that would otherwise be possible). For example, we could automatically bump the backstop on every new build. We decided not to do this because in practice, some developers do rely on these backwards OTAs and we'd like to not break these developers.
Another approach might be to directly integrate with Fuchsia platform versioning (see RFC-0002). However, there are several ambiguous questions with this. For example, should all backwards OTAs across an API level be prevented, or should we pick specific levels? Who would we break? Since there is precedent on Fuchsia for using different version identifiers for different parts of the system (e.g. file systems have their own version identifiers), it seems that would be a simpler option.
Prior art and references
Android has more info on OTAs.
Acknowledgements
James Sullivan contributed to the motiviation and stepping stone sections. Zach Kirschenbaum wrote the original design doc, which was reviewed by Dan Johnson.