Saturday, October 8, 2016

Using PCI pass through with bhyve

I have recently been using PCI pass through with bhyve while working on some device drivers for FreeBSD.  bhyve has included PCI pass through from day one, but I recently made a few changes to improve the experience for my particular workflow.

A Workflow

PCI pass through in bhyve requires a couple of steps. First, a PCI device must be attached to the ppt(4) device driver included as part of bhyve. This device driver claims a PCI device so that other drivers on the host cannot attach to it. It also interfaces with bhyve to ensure a device is only attached to a single VM and that it is correctly configured in the I/O MMU. Second, the PCI device must be added as a virtual PCI device of a bhyve instance.

Before the ppt(4) driver can be attached to a specific PCI device, the driver must be loaded. It is included as part of bhyve's vmm.ko:

kldload vmm.ko

Once the driver is loaded, the "set driver" command from devctl(8) can be used to attach the ppt(4) driver to the desired PCI device.  In this example, the PCI device at bus 3, slot 0, function 8 is our target device:

devctl set driver pci0:3:0:8 ppt

Finally, the bhyve instance must be informed about this device when starting.  The  /usr/share/examples/bhyve/vmrun.sh script has included a "-p" option to support this since r279925.  The command below uses vmrun.sh to start a VM running FreeBSD HEAD with the device passed through:

sh /usr/share/examples/bhyve/vmrun.sh -c 4 -d /dev/zvol/bhyve/head -p 3/0/8 head

Once I am finished with testing, I can revert the PCI device back to it's normal driver (if one exists) by using devctl's "clear driver" command:

devctl clear driver -f ppt0

Relevant Commits

The workflow described above requires several recent changes.  Most of these have been merged to stable/10 and stable/11 but are not present in 10.3 or 11.0.  Some of them can be worked around or are optional however.

I/O MMU Fixes

One of the requirements of using PCI pass through is the use of an I/O MMU to adjust DMA requests made by the PCI device to treat addresses in this requests as guest physical addresses (GPAs) rather than host physical addresses (HPAs).  This means that the addresses in each DMA request must be mapped to the "real" host physical address using a set of page tables similar to the "nested page tables" used to map GPAs to HPAs in the CPU.

bhyve's PCI pass through support was originally designed to work with a static set of PCI devices assigned at boot time via a loader tunable.  As a result, it only enabled the I/O MMU if PCI pass through devices were configured when the kernel module was initialized.  It also didn't check to see of the I/O MMU was initialized when a PCI pass through device was added to a new VM.  Instead, if PCI pass through devices were only configured dynamically at runtime and added to a VM, their DMA requests were not translated and always operated on HPAs.  Generally this meant that devices simply didn't work as they were DMA'ing to/from the wrong memory (and quite possibly to memory not belonging to the VM).  This was resolved in r304858, though it can be worked around on older versions by setting the tunable hw.vmm.force_iommu=1 to enable the I/O MMU even if no  PCI pass through devices are configured when the module is loaded.  One other edge case fixed by this commit is that bhyve will now fail to start a VM with a PCI pass through device if the I/O MMU fails to initialize (for example, it doesn't exist).

A related issue was that bhyve assumed that PCI devices marked for pass through would always be marked as pass through and would never be used on the host system.  The result was that PCI devices attached to the ppt(4) device driver were only permitted by the I/O MMU configuration to perform DMA while they were active in a VM.  In particular, if a PCI device were detached from the ppt(4) device driver on the host so that another driver could attach to the PCI device, the PCI device would not be able to function on the host.  This was fixed in r305485 by enabling ppt(4) devices to use DMA on the host while they are not active in a VM.

Finally, FreeBSD 11 introduced other changes in the PCI system that permitted the dynamic arrival and departure of PCI devices in a running system through support for SR-IOV (creating/destroying VFs) and native PCI-express HotPlug.  (Technically, older versions of FreeBSD also supported dynamic arrival and departure of PCI devices via CardBus, but most systems with CardBus are too old to support the I/O MMU included with VT-d.)  The I/O MMU support in bhyve assumed a static PCI device tree and performed a single scan of the PCI device tree during I/O MMU initialization to initialize the I/O MMU's device tables.  The I/O MMU was not updated if PCI devices were later added or removed.  Revision r305497 adds event handler hooks when PCI devices are added or removed.  bhyve uses these to update the I/O MMU device tables when PCI devices are added or removed.

All of these fixes were merged to stable/11 in r306471.  The first two were merged to stable/10 in r306472.

To assist with debugging these issues and testing the fixes, I also hacked together a quick utility to dump the device tables used by the I/O MMU.  This tool is present in FreeBSD's source tree at tools/tools/dmardump. (r303887, merged in r306467)

PCI Device Resets

Commit r305502 added support for resetting PCI-express devices passed through to a VM on VM startup and teardown via a Function Level Reset (FLR).  This ensures the device is idle and does not issue any DMA requests while being moved between the host and the VM memory domains.  It also ensures the device is quiesced to a reset state during VM startup and after VM teardown.  Support for this was merged to stable/10 and stable/11 in r306520.

devctl

The devctl utility first appeared in 10.3 including the "set driver" command.  The "clear driver" command was added more recently in r305034 (merged in r306533).  However, "clear driver" is only needed to revert a PCI pass through driver back to a regular driver on the host.  The "set driver" command can be used with the name of the regular driver as a workaround on older systems.

Future Work

The I/O MMU implementation under the ACPI_DMAR option added more recently to FreeBSD is more advanced than the earlier support used by bhyve.  At some point bhyve should be updated to use ACPI_DMAR to manage the I/O MMU rather than its own driver.