Build graph convergence

Build graph convergence is that property that a single build invocation will perform all the actions necessary and in the right order so that every action's outputs are newer than its inputs.

Fuchsia uses the Ninja build system, which is timestamp-driven. Ninja expresses the build as a graph of input/output files and actions that take inputs and produce outputs.

When you run a build, e.g. with fx build, Ninja will traverse the build graph and perform any actions whose outputs are not present or whose inputs have changed since they last run, all in topological order (dependencies before dependents).

However, build graph actions are not verified to fulfill the promise that outputs are newer than inputs, which can lead to convergence issues.

Common root causes

There are infinitely many ways to create Ninja convergence issues. That said, prior experience taught us that there are common root causes for these problems.

An output isn't generated

If a build action is declared to produce an output but doesn't actually produce that output (in some circumstances, or ever) then this will cause convergence issues. For instance, an action might declare that it generates a stamp file on success but fail to generate or touch this stamp file, or save it to the wrong location.

An output is stale (not newer than all inputs)

Ninja knows that an output is fresh if it's newer than all inputs. If one or more inputs have changed since the output was saved, then Ninja will repeat the step(s) necessary to generate the output.

However if the action that generates the output doesn't update the output when inputs have changed, this creates the appearance of a perpetual state of staleness.

A common mistake that causes this is when actions review their inputs, decide they have nothing to do/change with the contents of their outputs, but fail to update the modification timestamp on their outputs (i.e. "touch" or "stamp" their outputs).

Modifying inputs

It is possible for an action to modify its inputs. Typically inputs to an action should be opened with read access only, however it's not out of the question to write to them. That said, if your action needs to modify an input, it should do so before writing any outputs. Or if you must modify inputs after writing outputs, be sure to update the timestamp on your outputs before exiting the action. Otherwise you will have updated one or more of your inputs to be newer than one or more of your outputs, and thus confused Ninja into thinking that your outputs are stale.

Modifying inputs in actions can also introduce race conditions that make reproducing the problem non-deterministic. If multiple actions depend on the same input and one of them modifies the input, then the build will fail to converge if one of those actions results in an input timestamp that is newer than any of the actions' outputs. In dependency-ordered execution, the relative ordering of independent actions cannot be guaranteed.

Avoid modifying inputs.

Symbolic links and hard links

The Ninja build system follows symbolic links to determine timestamps. This can have surprising consequences when soft symlinks participate in Ninja rules as input dependencies or outputs. The timestamps of symlinks themselves (as opposed to their destinations) are not considered for staleness and freshness. See ninja#1186 for an explanation in terms of stat() and lstat() and a demonstration. Hard links (ln without -s), have the problem where multiple references point to the same filesystem object, and hence, have the same timestamp.

Even a simple link action can cause issues. Consider a simple action with the input src, the output $target_out_dir/dst, and the invoked action being ln src $target_out_dir/dst. At face value this action converges correctly. But the behavior of action() may be overridden elsewhere in the build system, such as to wrap actions with other actions. As a result your inner action may not converge to no-op when src has an older timestamp than the wrapper action's script, which will then in turn be considered older than dst (its output, which carries the timestamp of the input). copy() doesn't suffer from the same problem because it's never wrapped.

Avoid symlinks and hard-links in action inputs and outputs.

For making copies, prefer the built-in copy() target.

Timestamp granularity

Modern filesystems store timestamps on files (such as the time of last modification) in nanosecond resolution. Some older runtimes, such as Python 2.7, persist file timestamps in lower resolution, for instance milliseconds. It is therefore possible for an action to read an input and write an output with a timestamp that it considers to be "now" but is actually older than the timestamp of the input, if for instance the input and output were both written at the same millisecond and the output's timestamp is truncated after the millisecond digits.

At the time of this writing we have mechanisms in place to ensure that all Python actions in the build run with Python 3.x, in part to avoid this problem.

Build convergence diagnostics

We have the following tools to diagnose build convergence issues:

Ninja no-op check in the Commit Queue
Filesystem access action tracing

Ninja no-op check

Fuchsia's Commit Queue (CQ) verifies that changes not only build successfully, but also keep the build system in a state that it converges to no-op in a single build invocation.

Example of a build convergence error from CQ:

fuchsia confirm no-op
ninja build does not converge to a no-op

The same build is run in CQ before changes can be merged into the source tree, to ensure that changes don't break the build. After completing a build successfully, CQ will invoke Ninja again and expect Ninja to report "no work to do". This serves as a soundness check, since a correct build graph is expected to "converge" to no-op.

If this soundness check fails then CQ will report a failure on a step named fuchsia confirm no-op.

Reproducing Ninja convergence issues

With a source tree synced to your change, simply try the following:

fx build

This command should print:

ninja: no work to do.

If this is not the case, and actual build actions are being performed, run the same command again. If the second invocation still didn't produce "no work", then you've reproduced the issue. If you've arrived at "no work" still, try the following:

# Clean your build cache
rm -rf out
# Set up the build specification again
fx set ...
# Build
fx build
# Build again, expecting no-op
fx build

Troubleshooting Ninja convergence issues

In the CQ results page, under the failed step confirm no-op, you will see several links:

execution details
ninja -d explain -n -v
dirty paths

The link to ninja -d explain -n -v shows information that you should be able to reproduce locally with the following command:

fx ninja -C $(fx get-build-dir) -d explain -n -v

This link to "dirty paths" shows the most relevant subset of the same information. You will see a text file that will most likely begin as follows:

ninja explain: output <...> doesn't exist
...

Every line in this file is like a domino brick. You should begin troubleshooting the problem by looking at the first domino brick that started the chain reaction of extra work being done. For instance in the example above a particular output file doesn't exist, which causes Ninja to re-run the build action that's supposed to produce this output, and then subsequently rerun dependent actions.

Filesystem access tracing

There are also builders that trace actions' file system accesses. Diagnostics for stale or missing outputs look like:

Not all outputs of //your:label were written or touched, which can cause subsequent
build invocations to re-execute actions due to a missing file or old timestamp.

Required writes:
...

Missing outputs:
...

Stale outputs:
...

Diagnostics for unallowed writes to inputs look like:

Unexpected file accesses building //your/target:label, following the order they are accessed:
(FileAccessType.WRITE /path/to/input-that-should-not-be-touched.txt)

Compared to the Ninja no-op check, this check is done on every individual action and diagnoses one of the many causes of convergence issues immediately as the action happens, rather than later through a full fx build command. This approach can catch some issues that would otherwise be hard to reproduce due to race conditions.

Troubleshooting traced action failures

To locally enable action tracing, do one of the following:

Run fx set ... --args=build_should_trace_actions=true
Run fx args, add build_should_trace_actions=true in the editor, save-and-exit

and then fx build //your/failing:target.

Examine the files in the message in the context of the action's script or command, and see if they fall in one of the categories of common issues.