Hardware Interfacing

This document is part of the Driver Development Kit tutorial documentation.

Overview

In past chapters, we saw how the protocol stack was organized within a devhost, and some of the work that goes into binding the individual driver protocols into a device driver.

In this section, we'll look at practical considerations of dealing with hardware such as determining configuration, binding to interrupts, allocating memory, and performing DMA operations.

Here, we'll look at the concepts involved, and show snippets of code as required. Complete working code is shown in subsequent chapters (e.g., Ethernet Devices).

For the most part, we'll focus on the PCI bus, and we'll cover the following functions:

  • Access related:
    • pci_map_bar()
  • Interrupt related:
    • pci_map_interrupt()
    • pci_query_irq_mode()
    • pci_set_irq_mode()
  • DMA related:
    • pci_enable_bus_master()
    • pci_get_bti()

Configuration

Hardware peripherals are attached to the CPU via a bus, such as the PCI bus.

During bootup, the BIOS (or equivalent platform startup software) discovers all of the peripherals attached to the PCI bus. Each peripheral is assigned resources (notably interrupt vectors, and address ranges for configuration registers).

The impact of this is that the actual resources assigned to each peripheral may be different across reboots. When the operating system software starts up, it enumerates the bus and starts drivers for all supported devices. The drivers then call PCI functions in order to obtain configuration information about their device(s) so that they can map registers and bind to interrupts.

Base address register

The Base Address Register (BAR) is a configuration register that exists on each PCI device. It's where the BIOS stores information about the device, such as the assigned interrupt vector and addresses of control registers. Other, device specific information, is stored there as well.

Call pci_map_bar() to cause the BAR register to be mapped into the devhost's address space:

zx_status_t pci_map_bar(const pci_protocol_t* pci, uint32_t bar_id,
                        uint32_t cache_policy, void** vaddr, size_t* size,
                        zx_handle_t* out_handle);

The first parameter, pci, is a pointer to the PCI protocol. Typically, you obtain this in your bind() function via device_get_protocol().

The second parameter, bar_id, is the BAR register number, starting with 0.

The third parameter, cache_policy, determines the caching policy for access, and can take on the following values:

cache_policy value Meaning
ZX_CACHE_POLICY_CACHED use hardware caching
ZX_CACHE_POLICY_UNCACHED disable caching
ZX_CACHE_POLICY_UNCACHED_DEVICE disable caching, and treat as device memory
ZX_CACHE_POLICY_WRITE_COMBINING uncached with write combining

Note that ZX_CACHE_POLICY_UNCACHED_DEVICE is architecture dependent and may in fact be equivalent to ZX_CACHE_POLICY_UNCACHED on some architectures.

The next three arguments are return values. The vaddr and size return a pointer (and length) of the register region, while out_handle stores the created handle to the VMO.

Reading and writing memory

Once the pci_map_bar() function returns with a valid result, you can access the BAR via simple pointer operations, for example:

volatile uint32_t* base;
...
zx_status_t rc;
rc = pci_map_bar(dev->pci, 0, ZX_CACHE_POLICY_UNCACHED_DEVICE, &base, &size, &handle);
if (rc == ZX_OK) {
    base[REGISTER_X] = 0x1234;  // configure register X for deep sleep mode
}

It's important to declare base as volatile — this tells the compiler not to make any assumptions about the contents of the data that base points to. For example:

int timeout = 1000;
while (timeout-- > 0 && !(base[REGISTER_READY] & READY_BIT)) ;

is a typical (bounded) polling loop, intended for short polling sequences. Without the volatile keyword in the declaration, the compiler would have no reason to believe that the value at base[REGISTER_READY] would ever change, so it would cause it to be read only once.

Interrupts

An interrupt is an asynchronous event, generated by a device when it needs servicing. For example, an interrupt is generated when data is available on a serial port, or an ethernet packet has arrived. Interrupts allow a driver to know about an event as soon as it occurs, but without the driver spending time polling (actively waiting) for it.

The general architecture of a driver that uses interrupts is that a background Interrupt Handling Thread (IHT) is created during the driver startup / binding operation. This thread waits for an interrupt to happen, and, when it does, performs some kind of servicing action.

As an example, consider a serial port driver. It may receive interrupts due to any of the following events happening:

  • one or more characters have arrived,
  • room is now available to transmit one or more characters,
  • a control line (like DTR, for example) has changed state.

The interrupt wakes up the IHT. The IHT determines the cause of the event, usually by reading some status registers. Then, it runs an appropriate service function to handle the event. Once done, the IHT goes back to sleep, waiting for the next interrupt.

For example, if a character arrives, the IHT wakes up, reads a status register that indicates "data is available," and then calls a function that drains all available characters from the serial port FIFO into the driver's buffer.

No kernel-level code required

You may be familiar with other operating systems which use Interrupt Service Routines (ISR). These are kernel-level handlers that run in privileged mode and interface with the interrupt controller hardware.

In Fuchsia, the kernel deals with the privileged part of the interrupt handling, and provides thread-level functions for driver use.

The difference is that the IHT runs at thread level, whereas the ISR runs at kernel level in a very restricted (and sometimes fragile) environment. A principal advantage is that if the IHT crashes, it takes out only the driver, whereas a failing ISR can take out the entire operating system.

Attaching to an interrupt

Currently, the only bus that provides interrupts is the PCI bus. It supports two kinds: legacy and Message Signaled Interrupts (MSI).

Therefore, in order to use interrupts on PCI:

  1. determine which kind your device supports (legacy or MSI),
  2. set the interrupt mode to match,
  3. get a handle to your device's interrupt vector (usually one, but may be multiple),
  4. start IHT background thread,
  5. arrange for IHT thread to wait for interrupts (on handle(s) from step 3).

Steps 1 and 2 are usually done closely together, for example:

// Query whether we have MSI or Legacy interrupts.
uint32_t irq_cnt = 0;
if ((pci_query_irq_mode(&edev->pci, ZX_PCIE_IRQ_MODE_MSI, &irq_cnt) == ZX_OK) &&
    (pci_set_irq_mode(&edev->pci, ZX_PCIE_IRQ_MODE_MSI, 1) == ZX_OK)) {
    // using MSI interrupts
} else if ((pci_query_irq_mode(&edev->pci, ZX_PCIE_IRQ_MODE_LEGACY, &irq_cnt) == ZX_OK) &&
           (pci_set_irq_mode(&edev->pci, ZX_PCIE_IRQ_MODE_LEGACY, 1) == ZX_OK)) {
    // using legacy interrupts
} else {
    // an error
}

The pci_query_irq_mode() function takes three arguments:

zx_status_t pci_query_irq_mode(const pci_protocol_t* pci,
                               zx_pci_irq_mode_t mode,
                               uint32_t* out_max_irqs);

The first argument, pci, is a pointer to the PCI protocol stack bound to your device just like we saw above, in the BAR documentation.

The second argument, mode, is the kind of interrupt that you are interested in; it's one of the two constants shown in the example.

@@@ there's also a ZX_PCIE_IRQ_MODE_MSI_X in the syscalls/pci.h file; should I say anything about that? How would we use it in the above case, just make a third condition?

The third argument is a pointer to integer that returns how many interrupts of the specified type your device supports.

Having determined the kind of interrupt supported, you then call pci_set_irq_mode() to indicate that this is indeed the kind of interrupt that you wish to use.

Finally, you call pci_map_interrupt() to create a handle to the selected interrupt. Note that pci_map_interrupt() has the following prototype:

zx_status_t pci_map_interrupt(const pci_protocol_t* pci,
                              int which_irq,
                              zx_handle_t* out_handle);

The first argument is the same as in the previous call, the second argument, which_irq indicates the device-relative interrupt number you'd like, and the third argument is a pointer to the created interrupt handle.

You now have an interrupt handle.

Note that the vast majority of devices have just one interrupt, so simply passing 0 for which_irq is normal. If your device does have more than one interrupt, the common practice is to run the pci_map_interrupt() function in a for loop and bind handles to each interrupt.

Waiting for the interrupt

In your IHT, you call zx_interrupt_wait() to wait for the interrupt. The following prototype applies:

zx_status_t zx_interrupt_wait(zx_handle_t handle,
                              zx_time_t* out_timestamp);

The first argument is the handle you obtained via the call to pci_map_interrupt(), and the second parameter can be NULL (typical), or it can be a pointer to a time stamp that indicates when the interrupt was triggered (in nanoseconds, relative to the clock source ZX_CLOCK_MONOTONIC).

Therefore, a typical IHT would have the following shape:

static int irq_thread(void* arg) {
    my_device_t* dev = arg;
    for (;;) {
        zx_status_t rc;
        rc = zx_interrupt_wait(dev->irq_handle, NULL);
        // do stuff
    }
}

The convention is that the argument passed to the IHT is your device context block. The context block has a member (here irq_handle) that is the handle you obtained via pci_map_interrupt().

Edge vs level interrupt mode

The interrupt hardware can operate in one of two modes; "edge" or "level".

In edge mode, the interrupt is armed on the active-going edge (when the hardware signal goes from inactive to active), and works as a one-shot. That is, the signal must go back to inactive before it can be recognized again.

In level mode, the interrupt is active when the hardware signal is in the active state.

Typically, edge mode is used when the interrupt is dedicated, and level mode is used when the interrupt is shared by multiple devices (because you want the interrupt to remain active until all devices have de-asserted their request line).

The Zircon kernel automatically masks and unmasks the interrupt as appropriate. For level-triggered hardware interrupts, zx_interrupt_wait() masks the interrupt before returning, and unmasks it when called the next time. For edge-triggered interrupts, the interrupt remains unmasked.

The IHT should not perform any long-running tasks. For drivers that perform lengthy tasks, use a worker thread.

Shutting down a driver that uses interrupts

In order to cleanly shut down a driver that uses interrupts, you can use zx_interrupt_destroy() to abort the zx_interrupt_wait() call.

The idea is that when the foreground thread determines that the driver should be shut down, it simply destroys the interrupt handle, causing the IHT to shut down:

static void main_thread() {
    ...
    if (shutdown_requested) {
        // destroy the handle, this will cause zx_interrupt_wait() to pop
        zx_interrupt_destroy(dev->irq_handle);

        // wait for the IHT to finish
        thrd_join(dev->iht, NULL);
    }
    ...
}

static int irq_thread(void* arg) {
    ...
    for(;;) {
        zx_status_t rc;
        rc = zx_interrupt_wait(dev->irq_handle, NULL);
        if (rc == ZX_ERR_CANCELED) {
            // we are being shut down, do any cleanups required
            ...
            return;
        }
        ...
    }
}

The main thread, when requested to shut down, destroys the interrupt handle. This causes the IHT's zx_interrupt_wait() call to wake up with an error code. The IHT looks at the error code (in this case, ZX_ERR_CANCELED) and makes the decision to end. Meanwhile, the main thread is waiting to join the IHT via the call to thrd_join(). Once the IHT exits, thrd_join() returns, and the main thread can finish its processing.

The advanced reader is invited to look at some of the other interrupt related functions available:

DMA

Direct Memory Access (DMA) is a feature that allows hardware to access memory without CPU intervention. At the highest level, the hardware is given the source and destination of the memory region to transfer (along with its size) and told to copy the data. Some hardware peripherals even support the ability to do multiple "scatter / gather" style operations, where several copy operations can be performed, one after the other, without additional CPU intervention.

DMA considerations

In order to fully appreciate the issues involved, it's important to keep the following in mind:

  • each process operates in a virtual address space,
  • an MMU can map a contiguous virtual address range onto multiple, discontiguous physical address ranges (and vice-versa),
  • each process has a limited window into physical address space,
  • some peripherals support their own virtual addresses via an Input / Output Memory Management Unit (IOMMU).

Let's discuss each point in turn.

Virtual, physical, and device-physical addresses

The addresses that the process has access to are virtual; that is, they are an illusion created by the CPU's Memory Management Unit (MMU). A virtual address is mapped by the MMU into a physical address. The mapping granularity is based on a parameter called "page size," which is at least 4k bytes, though larger sizes are available on modern processors.

Figure: Relationship between virtual and physical addresses

In the diagram above, we show a specific process (process 12) with a number of virtual addresses (in blue). The MMU is responsible for mapping the blue virtual addresses into CPU physical bus addresses (red). Each process has its own mapping; so even though process 12 has a virtual address 300, some other process may also have a virtual address 300. That other process's virtual address 300 (if it exists) would be mapped to a different physical address than the one in process 12.

Note that we've used small decimal numbers as "addresses" to keep the discussion simple. In reality, each square shown above represents a page of memory (4k or more), and is identified by a 32 or 64 bit value (depending on the platform).

The key points shown in the diagram are:

  1. virtual addresses can be allocated in groups (three are shown, 300-303, 420-421, and 770-771),
  2. virtually contiguous (e.g., 300-303) is not necessarily physically contiguous.
  3. some virtual addresses are not mapped (for example, there is no virtual address 304)
  4. not all physical addresses are available to each process (for example, process 12 doesn't have access to physical address 120).

Depending on the hardware available on the platform, a device's address space may or may not follow a similar translation. Without an IOMMU, the addresses that the peripheral uses are the same as the physical addresses used by the CPU:

Figure: A device that doesn't use an IOMMU

In the diagram above, portions of the device's address space (for example, a frame buffer, or control registers), appear directly in the CPU's physical address range. That is to say, the device occupies physical addresses 122 through 125 inclusive.

In order for the process to access the device's memory, it would need to create an MMU mapping from some virtual addresses to the physical addresses 122 through 125. We'll see how to do that, below.

But with an IOMMU, the addresses seen by a peripheral may be different than the CPU's physical addresses:

Figure: A device that uses an IOMMU

Here, the device has its own "device-physical" addresses that it knows about, that is, addresses 0 through 3 inclusive. It's up to the IOMMU to map the device-physical addresses 0 through 3 into CPU physical addresses 109, 110, 101, and 119, respectively.

In this scenario, in order for the process to use the device's memory, it needs to arrange two mappings:

  • one set from the virtual address space (e.g., 300 through 303) to the CPU physical address space (109, 110, 101, and 119, respectively), via the MMU, and
  • one set from the CPU physical address space (addresses 109, 110, 101, and 119) to the device-physical addresses (0 through 3) via the IOMMU.

While this may seem complicated, Zircon provides an abstraction that removes the complexity.

Also, as we'll see below, the reason for having an IOMMU, and the benefits provided, are similar to those obtained by having an MMU.

Contiguity of memory

When you allocate a large chunk of memory (e.g. via calloc()), your process will, of course, see a large, contiguous virtual address range. The MMU creates the illusion of contiguous memory at the virtual addressing level, even though the MMU may choose to back that memory area with physically discontiguous memory at the physical address level.

Furthermore, as processes allocate and deallocate memory, the mapping of physical memory to virtual address space tends to become more complex, encouraging more "swiss cheese" holes to appear (that is, more discontiguities in the mapping).

Therefore, it's important to keep in mind that contiguous virtual addresses are not necessarily contiguous physical addresses, and indeed that contiguous physical memory becomes more precious over time.

Access controls

Another benefit of the MMU is that processes are limited in their view of physical memory (for security and reliability reasons). The impact on drivers, though, is that a process has to specifically request a mapping from virtual address space to physical address space, and have the requisite privilege in order to do so.

IOMMU

Contiguous physical memory is generally preferred. It's more efficient to do one transfer (with one source address and one destination address) than it is to set up and manage multiple individual transfers (which may require CPU intervention between each transfer in order to set up the next one).

The IOMMU, if available, alleviates this problem by doing the same thing for the peripherals that the CPU's MMU does for the process — it gives the peripheral the illusion that it's dealing with a contiguous address space by mapping multiple discontiguous chunks into a virtually contiguous space. By limiting the mapping region, the IOMMU also provides security (in the same way as the MMU does), by preventing the peripheral from accessing memory that's not "in scope" for the current operation.

Tying it all together

So, it may appear that you need to worry about virtual, physical, and device-physical address spaces when you are writing your driver. But that's not the case.

DMA and your driver

Zircon provides a set of functions that allow you to cleanly deal with all of the above. The following work together:

  • a Bus Transaction Initiator (BTI), and
  • a Virtual Memory Object (VMO).

The BTI kernel object provides an abstraction of the model, and an API to deal with physical (or device-physical) addresses associated with VMOs.

In your driver's initialization, call pci_get_bti() to obtain a BTI handle:

zx_status_t pci_get_bti(const pci_protocol_t* pci,
                        uint32_t index,
                        zx_handle_t* bti_handle);

The pci_get_bti() function takes a pci protocol pointer (just like all the other pci_...() functions discussed above) and an index (reserved for future use, use 0). It returns a BTI handle through the bti_handle pointer argument.

Next, you need a VMO. Simplistically, you can think of the VMO as a pointer to a chunk of memory, but it's more than that — it's a kernel object that represents a set of virtual pages (that may or may not have physical pages committed to them), which can be mapped into the virtual address space of the driver process. (It's even more than that, but that's a discussion for a different chapter.)

Ultimately, these pages serve as the source or destination of the DMA transfer.

There are two functions, zx_vmo_create() and zx_vmo_create_contiguous() that allocate memory and bind it to a VMO:

zx_status_t zx_vmo_create(uint64_t size,
                          uint32_t options,
                          zx_handle_t* out);

zx_status_t zx_vmo_create_contiguous(zx_handle_t bti,
                                     size_t size,
                                     uint32_t alignment_log2,
                                     zx_handle_t* out);

As you can see, they both take a size parameter indicating the number of bytes required, and they both return a VMO (via out). They both allocate virtually contiguous pages, for a given size.

Note that this differs from the standard C library memory allocation functions, (e.g., malloc()), which allocate virtually contiguous memory, but without regard to page boundaries. Two small malloc() calls in a row might allocate two memory regions from the same page, for instance, whereas the VMO creation functions will always allocate memory starting with a new page.

The zx_vmo_create_contiguous() function does what zx_vmo_create() does, and ensures that the pages are suitably organized for use with the specified BTI (which is why it needs the BTI handle). It also features an alignment_log2 parameter that can be used to specify a minimum alignment requirement. As the name suggests, it must be an integer power of 2 (with the value 0 indicating page aligned).

At this point, you have two "views" of the allocated memory:

  • one contiguous virtual address space that represents memory from the point of view of the driver, and
  • a set of (possibly contiguous, possibly committed) physical pages for use by the peripheral.

Before using these pages, you need to ensure that they are present in memory (that is, "committed" — the physical pages are accessible to your process), and that the peripheral has access to them (via the IOMMU if present). You will also need the addresses of the pages (from the point of view of the device) so that you can program the DMA controller on your device to access them.

The zx_bti_pin() function is used to do all that:

#include <zircon/syscalls.h>

zx_status_t zx_bti_pin(zx_handle_t bti, uint32_t options,
                       zx_handle_t vmo, uint64_t offset, uint64_t size,
                       zx_paddr_t* addrs, size_t addrs_count,
                       zx_handle_t* pmt);

There are 8 parameters to this function:

Parameter Purpose
bti the BTI for this peripheral
options options (see below)
vmo the VMO for this memory region
offset offset from the start of the VMO
size total number of bytes in VMO
addrs list of return addresses
addrs_count number of elements in addrs
pmt returned PMT (see below)

The addrs parameter is a pointer to an array of zx_paddr_t that you supply. This is where the peripheral addresses for each page are returned into. The array is addrs_count elements long, and must match the count of elements expected from zx_bti_pin().

The values written into addrs are suitable for programming the peripheral's DMA controller — that is, they take into account any translations that may be performed by an IOMMU, if present.

On a technical note, the other effect of zx_bti_pin() is that the kernel will ensure those pages are not decommitted (i.e., moved or reused) while pinned.

The options argument is actually a bitmap of options:

Option Purpose
ZX_BTI_PERM_READ pages can be read by the peripheral (written by the driver)
ZX_BTI_PERM_WRITE pages can be written by the peripheral (read by the driver)
ZX_BTI_COMPRESS (see "Minimum contiguity property," below)

For example, refer to the diagrams above showing "Device #3". If an IOMMU is present, addrs would contain 0, 1, 2, and 3 (that is, the device-physical addresses). If no IOMMU is present, addrs would contain 109, 110, 101, and 119 (that is, the physical addresses).

Permissions

Keep in mind that the permissions are from the perspective of the peripheral, and not the driver. For example, in a block device write operation, the device reads from memory pages and therefore the driver specifies ZX_BTI_PERM_READ, and vice versa in the block device read.

Minimum contiguity property

By default, each address returned through addrs is one page long. Larger chunks may be requested by setting the ZX_BTI_COMPRESS option in the options argument. In that case, the length of each entry returned corresponds to the "minimum contiguity" property. While you can't set this property, you can read it via zx_object_get_info(). Effectively, the minimum contiguity property is a guarantee that zx_bti_pin() will always be able to return addresses that are contiguous for at least that many bytes.

For example, if the property had the value 1MB, then a call to zx_bti_pin() with a requested size of 2MB would return at most two physically-contiguous runs. If the requested size was 2.5MB, it would return at most three physically-contiguous runs, and so on.

Pinned Memory Token (PMT)

zx_bti_pin() returns a Pinned Memory Token (PMT) upon success in the pmt argument. The driver must call zx_pmt_unpin() when the device is done with the memory transaction to unpin and revoke access to the memory pages by the device.

Advanced topics

Cache Coherency

On fully DMA-coherent architectures, hardware ensures the data in the CPU cache is the same as the data in main memory without software intervention. Not all architectures are DMA-coherent. On these systems, the driver must ensure the CPU cache is made coherent by invoking appropriate cache operations on the memory range before performing DMA operations, so that no stale data will be accessed.

To invoke cache operations on the memory represented by VMOs, use the zx_vmo_op_range() syscall. Prior to a peripheral-read (driver-write) operation, clean the cache using ZX_VMO_OP_CACHE_CLEAN to write out dirty data to main memory. Prior to a peripheral-write (driver-read), mark the cache lines as invalid using ZX_VMO_OP_CACHE_INVALIDATE to ensure data is fetched from main memory on the next access.