This page describes how Starnix handles Linux system calls (syscalls) in Fuchsia.
Address space of a Linux program in a Zircon process
Before a Linux program can make any syscalls, it must be loaded into a Zircon
process. starnix_kernel
is the component responsible for both
loading and executing Linux programs in Fuchsia.
starnix_kernel
first creates a Zircon process with an address space that is
divided into halves. The top half is a shared address space and the bottom half is
a private address space. starnix_kernel
then sets up the address space of the
Zircon process in the following way:
- The Linux program is loaded into the process's private address space (bottom half).
starnix_kernel
itself is loaded into the process's shared address space (top half).
The Linux portion of the address space is constructed differently, compared to that of a normal Fuchsia process, to match the layout expected of a Linux program.
Figure 1. The address space of a Zircon process loaded with a Linux program
and starnix_kernel
.
Shared Starnix instance
Each Linux program is loaded into a separate Zircon process and each Linux
thread runs in a dedicated Zircon thread. However, all the Zircon processes in a
Starnix container share the same starnix_kernel
instance.
This shared starnix_kernel
instance manages all the state of the Linux programs
in the container, such as file descriptor tables, a global virtual filesystem,
and a table of running threads.
For more information on the state shared between the Zircon processes
that run the starnix_kernel
, see the restricted mode RFC and
ShareableProcessState
struct.
Figure 2. The starnix_kernel
instance is shared across all Linux processes
in the same container.
When in normal execution mode, any thread in these processes can access the shared address space. However, in restricted execution mode, the thread can only access the private address space of its own process. In other words, when the Linux program is running in a process, Zircon restricts its thread's access to only the Linux (private) portion of the address space in the process.
Figure 3. Restricted mode prevents Linux threads from accessing the
starnix_kernel
portion of the address space.
Running a Linux program in restricted mode
Once the Linux program is loaded to a process, starnix_kernel
directs Zircon
to start the process. The process starts executing in the starnix_kernel
portion of the address space where it enters the syscall dispatch loop.
starnix_kernel
inspects the task state to determine where to enter the Linux
program. The restricted_enter()
method shown below is used to enter a specific
location in the Linux program (for more information, see the
zx_restricted_enter
syscall in Zircon).
// Enter the restricted portion of the address space at the location pointed to by `rip`.
restricted_enter(task.registers.rip);
However, the complete call is more complicated than what's shown above. The
method takes the full register state of restricted mode as input through a separate
VMO called a "sidecar." (For more information on this VMO "sidecar" which stores the
zx_restricted_state
object containing the architectural state used by restricted
mode, see the zx_restricted_bind_state
syscall in Zircon.)
After the restricted_enter()
method is called, Zircon will note that the thread
is entering restricted mode and set the thread to access only the private
address space before jumping to the Linux code.
The flow of a Linux syscall
The process continues to run until the Linux program makes a syscall. The assembly instruction that executes a syscall varies by architecture, but they all result in the transfer of control to Zircon. When Zircon inspects the thread and sees that the thread is in restricted mode, instead of handling the syscall itself, Zircon immediately jumps back to Starnix.
To discover which syscall the Linux program made, Starnix reads the restricted mode register state that Zircon saved. Then Starnix invokes the handler for this particular syscall. The result of this syscall is saved to the restricted mode register state as well.
Once the syscall is returned, Starnix again instructs Zircon to put the thread into restricted mode. This loop continues until the thread exits.
Figure 4. The flow of a Linux syscall in a Zircon process.
Below is the pseudocode representation of the flow in Figure 4:
loop {
// Enter restricted mode at the instruction pointer address.
restricted_enter(task.registers.rip);
// Update the register state of the task to match what Zircon saved before exiting
// restricted mode.
task.registers = thread_state_general_regs_t::from(&restricted_state);
// Figure out which syscall was made based on the appropriate register value.
let syscall = Syscall::from_number(task.registers.rax);
// Execute the syscall for the current task.
execute_syscall(task, syscall);
// After handling the syscall, check for signals, exit status, etc. before
// continuing the loop.
if let Some(exit_status) = process_completed_syscall(task)? {
return Ok(exit_status);
}
}