Defining a Stable Driver Runtime

Project lead: surajmalhotra@google.com
Status: Approved
Area(s): Devices

Problem statement

Banjo is an interface definition language (IDL) used to express interfaces used between drivers. It is a derivative of FIDL, with a forked syntax from 2018. While the syntax is similar, unlike FIDL, banjo was designed for synchronous in-process communication, and the resulting codegen amounts to a very barebones struct of function pointers, associated with a context pointer.

A non-exhaustive list of problems with banjo include:

The generated code for banjo lacks a strategy for interface and type evolution. This is a critical requirement for interface stability.
Since early 2019, banjo has been largely in maintenance mode, and has fallen behind FIDL in terms of ergonomics and features. Understanding how to write banjo syntax has become confusing because the Fuchsia project has relied on FIDL documentation to address the current gaps in banjo's features and ergonomics.
Banjo is optimized to be low overhead, placing a great deal of burden on driver authors to figure out how to move state onto the heap or handle an operation asynchronously. There is a great deal of boilerplate involved with manual serialization logic required to do so.
There are no strict requirements on how driver authors may invoke banjo protocol methods, nor any guarantees on which context their own protocol methods may be invoked, leading to unnecessary spawning of threads in order to achieve safety (avoiding deadlocks).
Banjo types are incompatible with FIDL types often leading to much boilerplate when shifting to out of process communication.

Solution statement

We aim to solve these problems by evolving banjo into something better. The three key features of the new transport will be:

A forced layer of indirection between drivers, to allow a runtime to mediate driver-to-driver communications within the same process
Migration away from C structs towards types built with evolution in mind.
Enforcement of a threading model which is well defined

We are expecting to find a solution with the following characteristics:

Shift all communication between drivers to be message oriented, utilizing the FIDL wire format between drivers.
Allow drivers to make synchronous calls into other drivers.
- With the caveat it is only allowed on threads owned by the driver.
Share threads between drivers
- With the caveat that all communication on shared threads must be asynchronous.
Allow drivers to never deal with re-entrance or synchronization if they don't opt-in (allowing them to avoid locks altogether).
Allow for zero copy and zero serialization / deserialization between drivers.

We reserve the right to change our minds depending on the benchmark results of early prototypes. If we cannot outperform mechanisms provided by the kernel, we will need to try alternative designs. We also need to prove out our assumptions that the mechanisms provided by the kernel are insufficient for our needs.

We will try to track progress towards a new banjo with the following milestones:

Update banjo syntax to match fidl syntax, use fidlc as the frontend, and implement a custom backend which generates output equivalent to what banjoc generates today.
1. This allows us to avoid maintenance burden and future syntax drift.
Architect a threading model for drivers that we want to design around.
Decide on metrics/benchmarks to judge any forthcoming designs.
Run experiments to see if we can meet required benchmarks with newer transport.
Implement new fidl backend and driver runtime.
1. We will likely start by creating a variant of LLCPP fidl bindings which targets new transport.
Repeat the following steps for each driver stack in a loop:
1. Migrate drivers which are co-resident in the same driver host over to the new threading model, utilizing existing banjo transport.
2. Migrate drivers which are co-resident in the same driver host over to the new in-process FIDL transport.

Dependencies

We will likely need to work with the FIDL team to allow LLCPP bindings to be abstracted away from zircon channels and ports to allow us to repurpose the bindings mostly as-is on a new transport with minimal user visible differences. We don't anticipate any changes necessary to the frontend IDL, but changes to FIDL IR may be necessary.

Additionally, migrating 300+ drivers will take a lot of effort and time, and will require various teams throughout the organization to be involved to ensure nothing breaks.

Risks and mitigations

A major change like this has long-term implications on performance characteristics of our system by inducing additional overhead. Luckily, we have built in some evolutionary support directly into our framework's architecture to enable us to move towards another technology if the solution we build is unable to meet future needs. We can do this by implementing new component runners and having drivers target the new runner, which may have a different driver runtime. Switching every driver over to the new driver runner will likely be impractical, however, so we will end up needing to maintain both in parallel, which has costs of its own. As such, we really want to get this approach mostly right to avoid needing to take this course.

Switching drivers to a new threading model also is a large cost to pay, and may induce new bugs along the way. Many drivers lack tests. Additionally, for drivers that do have tests, unit tests may also lose their validity after the switch and may have to be rewritten alongside the transition. We have written a great deal of our driver tests as integration tests which should continue to be valid even after migration without any changes. We will continue to try to invest in more integration tests and e2e tests prior to migration to prevent introduction of new bugs.

Estimating the migration timeline for the migration is another large risk. It is hard to accurately estimate the cost here without having built a replacement and trialed migration on at least one driver. We will need to continually be cognizant of the cost as we implement our design, and automate as much of the migration as possible.