Making Linux syscalls in Fuchsia

This page describes how Starnix handles Linux system calls (syscalls) in Fuchsia.

Address space of a Linux program in a Zircon process

Before a Linux program can make any syscalls, it must be loaded into a Zircon process. starnix_kernel is the component responsible for both loading and executing Linux programs in Fuchsia.

starnix_kernel first creates a Zircon process with an address space that is divided into halves. The top half is a shared address space and the bottom half is a private address space. starnix_kernel then sets up the address space of the Zircon process in the following way:

The Linux program is loaded into the process's private address space (bottom half).
starnix_kernel itself is loaded into the process's shared address space (top half).

The Linux portion of the address space is constructed differently, compared to that of a normal Fuchsia process, to match the layout expected of a Linux program.

Address space for a Zircon process

Figure 1. The address space of a Zircon process loaded with a Linux program and starnix_kernel.

Shared Starnix instance

Each Linux program is loaded into a separate Zircon process and each Linux thread runs in a dedicated Zircon thread. However, all the Zircon processes in a Starnix container share the same starnix_kernel instance. This shared starnix_kernel instance manages all the state of the Linux programs in the container, such as file descriptor tables, a global virtual filesystem, and a table of running threads.

For more information on the state shared between the Zircon processes that run the starnix_kernel, see the restricted mode RFC and ShareableProcessState struct.

Shared Starnix instance

Figure 2. The starnix_kernel instance is shared across all Linux processes in the same container.

When in normal execution mode, any thread in these processes can access the shared address space. However, in restricted execution mode, the thread can only access the private address space of its own process. In other words, when the Linux program is running in a process, Zircon restricts its thread's access to only the Linux (private) portion of the address space in the process.

Restricted mode

Figure 3. Restricted mode prevents Linux threads from accessing the starnix_kernel portion of the address space.

Running a Linux program in restricted mode

Once the Linux program is loaded to a process, starnix_kernel directs Zircon to start the process. The process starts executing in the starnix_kernel portion of the address space where it enters the syscall dispatch loop.

starnix_kernel inspects the task state to determine where to enter the Linux program. The restricted_enter() method shown below is used to enter a specific location in the Linux program (for more information, see the zx_restricted_enter syscall in Zircon).

// Enter the restricted portion of the address space at the location pointed to by `rip`.
restricted_enter(task.registers.rip);

However, the complete call is more complicated than what's shown above. The method takes the full register state of restricted mode as input through a separate VMO called a "sidecar." (For more information on this VMO "sidecar" which stores the zx_restricted_state object containing the architectural state used by restricted mode, see the zx_restricted_bind_state syscall in Zircon.)

After the restricted_enter() method is called, Zircon will note that the thread is entering restricted mode and set the thread to access only the private address space before jumping to the Linux code.

The flow of a Linux syscall

The process continues to run until the Linux program makes a syscall. The assembly instruction that executes a syscall varies by architecture, but they all result in the transfer of control to Zircon. When Zircon inspects the thread and sees that the thread is in restricted mode, instead of handling the syscall itself, Zircon immediately jumps back to Starnix.

To discover which syscall the Linux program made, Starnix reads the restricted mode register state that Zircon saved. Then Starnix invokes the handler for this particular syscall. The result of this syscall is saved to the restricted mode register state as well.

Once the syscall is returned, Starnix again instructs Zircon to put the thread into restricted mode. This loop continues until the thread exits.

Flow of a Linux syscall

Figure 4. The flow of a Linux syscall in a Zircon process.

Below is the pseudocode representation of the flow in Figure 4:

loop {
  // Enter restricted mode at the instruction pointer address.
  restricted_enter(task.registers.rip);

  // Update the register state of the task to match what Zircon saved before exiting
  // restricted mode.
  task.registers = thread_state_general_regs_t::from(&restricted_state);

  // Figure out which syscall was made based on the appropriate register value.
  let syscall = Syscall::from_number(task.registers.rax);

  // Execute the syscall for the current task.
  execute_syscall(task, syscall);

  // After handling the syscall, check for signals, exit status, etc. before
  // continuing the loop.
  if let Some(exit_status) = process_completed_syscall(task)? {
      return Ok(exit_status);
  }
}