jhb: 2016

Saturday, October 8, 2016

Using PCI pass through with bhyve

I have recently been using PCI pass through with bhyve while working on some device drivers for FreeBSD. bhyve has included PCI pass through from day one, but I recently made a few changes to improve the experience for my particular workflow.

A Workflow

PCI pass through in bhyve requires a couple of steps. First, a PCI device must be attached to the ppt(4) device driver included as part of bhyve. This device driver claims a PCI device so that other drivers on the host cannot attach to it. It also interfaces with bhyve to ensure a device is only attached to a single VM and that it is correctly configured in the I/O MMU. Second, the PCI device must be added as a virtual PCI device of a bhyve instance.

Before the ppt(4) driver can be attached to a specific PCI device, the driver must be loaded. It is included as part of bhyve's vmm.ko:

kldload vmm.ko

Once the driver is loaded, the "set driver" command from devctl(8) can be used to attach the ppt(4) driver to the desired PCI device. In this example, the PCI device at bus 3, slot 0, function 8 is our target device:

devctl set driver pci0:3:0:8 ppt

Finally, the bhyve instance must be informed about this device when starting. The /usr/share/examples/bhyve/vmrun.sh script has included a "-p" option to support this since r279925. The command below uses vmrun.sh to start a VM running FreeBSD HEAD with the device passed through:

sh /usr/share/examples/bhyve/vmrun.sh -c 4 -d /dev/zvol/bhyve/head -p 3/0/8 head

Once I am finished with testing, I can revert the PCI device back to it's normal driver (if one exists) by using devctl's "clear driver" command:

devctl clear driver -f ppt0

Relevant Commits

The workflow described above requires several recent changes. Most of these have been merged to stable/10 and stable/11 but are not present in 10.3 or 11.0. Some of them can be worked around or are optional however.

I/O MMU Fixes

One of the requirements of using PCI pass through is the use of an I/O MMU to adjust DMA requests made by the PCI device to treat addresses in this requests as guest physical addresses (GPAs) rather than host physical addresses (HPAs). This means that the addresses in each DMA request must be mapped to the "real" host physical address using a set of page tables similar to the "nested page tables" used to map GPAs to HPAs in the CPU.

bhyve's PCI pass through support was originally designed to work with a static set of PCI devices assigned at boot time via a loader tunable. As a result, it only enabled the I/O MMU if PCI pass through devices were configured when the kernel module was initialized. It also didn't check to see of the I/O MMU was initialized when a PCI pass through device was added to a new VM. Instead, if PCI pass through devices were only configured dynamically at runtime and added to a VM, their DMA requests were not translated and always operated on HPAs. Generally this meant that devices simply didn't work as they were DMA'ing to/from the wrong memory (and quite possibly to memory not belonging to the VM). This was resolved in r304858, though it can be worked around on older versions by setting the tunable hw.vmm.force_iommu=1 to enable the I/O MMU even if no PCI pass through devices are configured when the module is loaded. One other edge case fixed by this commit is that bhyve will now fail to start a VM with a PCI pass through device if the I/O MMU fails to initialize (for example, it doesn't exist).

A related issue was that bhyve assumed that PCI devices marked for pass through would always be marked as pass through and would never be used on the host system. The result was that PCI devices attached to the ppt(4) device driver were only permitted by the I/O MMU configuration to perform DMA while they were active in a VM. In particular, if a PCI device were detached from the ppt(4) device driver on the host so that another driver could attach to the PCI device, the PCI device would not be able to function on the host. This was fixed in r305485 by enabling ppt(4) devices to use DMA on the host while they are not active in a VM.

Finally, FreeBSD 11 introduced other changes in the PCI system that permitted the dynamic arrival and departure of PCI devices in a running system through support for SR-IOV (creating/destroying VFs) and native PCI-express HotPlug. (Technically, older versions of FreeBSD also supported dynamic arrival and departure of PCI devices via CardBus, but most systems with CardBus are too old to support the I/O MMU included with VT-d.) The I/O MMU support in bhyve assumed a static PCI device tree and performed a single scan of the PCI device tree during I/O MMU initialization to initialize the I/O MMU's device tables. The I/O MMU was not updated if PCI devices were later added or removed. Revision r305497 adds event handler hooks when PCI devices are added or removed. bhyve uses these to update the I/O MMU device tables when PCI devices are added or removed.

All of these fixes were merged to stable/11 in r306471. The first two were merged to stable/10 in r306472.

To assist with debugging these issues and testing the fixes, I also hacked together a quick utility to dump the device tables used by the I/O MMU. This tool is present in FreeBSD's source tree at tools/tools/dmardump. (r303887, merged in r306467)

PCI Device Resets

Commit r305502 added support for resetting PCI-express devices passed through to a VM on VM startup and teardown via a Function Level Reset (FLR). This ensures the device is idle and does not issue any DMA requests while being moved between the host and the VM memory domains. It also ensures the device is quiesced to a reset state during VM startup and after VM teardown. Support for this was merged to stable/10 and stable/11 in r306520.

devctl

The devctl utility first appeared in 10.3 including the "set driver" command. The "clear driver" command was added more recently in r305034 (merged in r306533). However, "clear driver" is only needed to revert a PCI pass through driver back to a regular driver on the host. The "set driver" command can be used with the name of the regular driver as a workaround on older systems.

Future Work

The I/O MMU implementation under the ACPI_DMAR option added more recently to FreeBSD is more advanced than the earlier support used by bhyve. At some point bhyve should be updated to use ACPI_DMAR to manage the I/O MMU rather than its own driver.

Monday, August 8, 2016

Using ZVOLs for bhyve VMs

One of the followup changes after converting my desktop from UFS to ZFS was to convert my bhyve VMs from raw "disk.img" files in my home directory to being backed by ZFS volumes (ZVOLs). The conversion process was fairly simple.

First, I created a new namespace to hold them:

# zfs create zroot/bhyve

Next, for each disk image I created a new volume and dd'd the raw disk image over to it. For example:

# zfs create -V 16G zroot/bhyve/head
# dd if=bhyve/head/disk.img of=/dev/zvol/zroot/bhyve/head bs=1m

I then booted the virtual machine passing "-d /dev/zvol/zroot/bhyve/head" to vmrun.sh. Once that was successful I removed the old disk.img files.

By default, FreeBSD exports ZFS volumes as GEOM providers. This means that volumes are able to be tasted on the host. For example:

# gpart show zvol/zroot/bhyve/head
=>      34  33554365  zvol/zroot/bhyve/head  GPT  (16G)
        34       128                      1  freebsd-boot  (64K)
       162  32714523                      2  freebsd-ufs  (16G)
  32714685    839714                      3  freebsd-swap  (410M)

Once nice thing about this is that I can run fsck(8) directly against /dev/zvol/zroot/bhyve/headp2 without having to use mdconfig(8). I could also choose to mount it on the host.

Secondly, ZFS volumes show up in gstat(8) output. With all the volumes the output can be quite long, but "gstat -p" can give you nice output breaking down I/O by VM:

dT: 1.001s  w: 1.000s

 L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name

    0      0      0      0    0.0      0      0    0.0    0.0| cd0

    6    408    271   9934   42.9    138  17534    7.7  100.6| ada0

   12    443    272   5987   43.9    172  21844    5.7  100.0| ada1

    0      0      0      0    0.0      0      0    0.0    0.0| zvol/zroot/bhyve/head

    0      0      0      0    0.0      0      0    0.0    0.0| zvol/zroot/bhyve/head-i386

    0     18      2     64  393.8     16    321  119.4   90.7| zvol/zroot/bhyve/fbsd11-i386

    1     65     24   3068   53.0     41   3320   18.7  109.9| zvol/zroot/bhyve/fbsd11

    6     59     22    786   50.8     37    578   61.2  100.1| zvol/zroot/bhyve/fbsd10

    2     61      0      0    0.0     61   6609   77.0  101.8| zvol/zroot/bhyve/fbsd9

    1     18      8     94   60.4     10     33   10.6   51.3| zvol/zroot/bhyve/fbsd10-i386

    1     14      6     50   92.3      8    108  109.3   60.7| zvol/zroot/bhyve/fbsd9-i386

    0      0      0      0    0.0      0      0    0.0    0.0| zvol/zroot/bhyve/fbsd8-i386

    0      0      0      0    0.0      0      0    0.0    0.0| zvol/zroot/bhyve/fbsd8

Monday, August 1, 2016

Adventures in Disk Replacement

The Problem

A few years ago I built a new FreeBSD desktop at home. For simplicity of booting, etc. I used the built-in RAID1 mirroring provided by the on-board SATA controller. This worked fine.

Recently one of my drives began reporting SMART errors (I am running smartd from sysutils/smartmontools in daemon mode and it sends emails to root@ for certain types of errors). Originally the drive logged two errors:

Device: /dev/ada0, 8 Currently unreadable (pending) sectors
Device: /dev/ada0, 8 Offline uncorrectable sectors

It logged these two (seemingly related?) errors once a month for the past three months. This month it logged an additional error at which point I decided to swap out the drive:

Device: /dev/ada0, ATA error count increased from 0 to 5

The simple solution would be to just swap out the dying drive for the replacement, reboot and and let the rebuild chug away. However, I decided to make a few changes which made things not quite so simple.

First, my existing system was laid out with UFS with separate partitions for /, /usr, and /var. It was at least using GPT instead of MBR. However, I wanted to switch from UFS to ZFS. I'm not exactly ecstatic about how ZFS' ARC interfaces with FreeBSD's virtual memory subsystem (a bit of a square peg in a round hole). However, for my desktop the additional data integrity of ZFS' integrated checksums is very compelling. In addition, switching to ZFS enables more flexibility in the future for growing the pool as well as things like boot environments, ZFS integration with poudriere, zvols for my bhyve VMs, etc.

Second, since I was going to be doing a complicated data migration, I figured I might as well redo my partitioning layout to support EFI booting. In this case I wanted the flexibility to boot via legacy mode (CSM) if need be but having the option of switching to EFI. This isn't that complicated (the install images for FreeBSD 11 are laid out for this), but FreeBSD's installer doesn't support this type of layout out of the box.

Step 1: Partitioning

I initially tried to see if I could do some of the initial setup using partedit from FreeBSD's installer. However, I quickly ran into a few issues. One, my desktop was still 10-STABLE which didn't support ZFS on GPT for booting. (Even partedit in 11 doesn't seem to handle this by my reading of the source.) Second, partedit in HEAD doesn't support creating a dual-mode (EFI and BIOS) disk. Thus, I resorted to doing this all by hand.

First, I added a GPT table which is pretty simple (and covered in manual page examples):

# gpart create -s gpt ada2

To make the disk support dual-mode booting it needs both a EFI partition and a freebsd-boot partition. For the EFI partition, FreeBSD ships a pre-formatted FAT image that can be written directly to the partition (/boot/boot1.efifat). However, the formatted filesystem in this image is relatively small, and I wanted to use a larger EFI partition to match a recent change in FreeBSD 11 (200MB). Instead of using the formatted filesystem, I formatted the EFI partition directly and copied the /boot/boot1.efi binary into the right subdirectory. Ideally I think bsdinstall should do this as well rather than using the pre-formatted image.

# gpart add -t efi -s 200M -a 4k ada2
# newfs_msdos -L EFI /dev/ada2p1
# mount -t msdos /dev/ada2p1 /mnt
# mkdir -p /mnt/efi/boot
# cp /boot/boot1.efi /mnt/efi/boot/BOOTx64.efi
# umount /mnt

To handle BIOS booting, I installed the /boot/pmbr MBR bootstrap and /boot/gptzfsboot into a freebsd-boot partition.

# gpart bootcode -b /boot/pmbr ada2
# gpart add -t freebsd-boot -s 512k -a 4k ada2
# gpart bootcode -p /boot/gptzfsboot -i 2 ada2

Finally, I added partitions for swap and ZFS:

# gpart add -t freebsd-swap -a 4k -s 16G ada2
# gpart add -t freebsd-zfs -a 4k ada2

At this point the disk layout looked like this:

# gpart show ada2
=>        34  1953525101  ada2  GPT  (932G)
          34           6        - free -  (3.0K)
          40      409600     1  efi  (200M)
      409640        1024     2  freebsd-boot  (512K)
      410664    33554432     3  freebsd-swap  (16G)
    33965096  1919560032     4  freebsd-zfs  (915G)
  1953525128           7        - free -  (3.5K)

Step 2: Laying out ZFS

Now that partitioning was complete, the next step was to create a ZFS pool. The ultimate plan is to add the "good" remaining disk as a mirror of the new disk, but I started with a single-device pool backed by the new disk. I would have liked to using the existing zfsboot script from FreeBSD's installer to create the pool and layout the various filesystems, but trying to use bsdconfig to do this just resulted in confusion. It refused to do anything when I first ran the disk editor from the bsdconfig menu because no filesystem was marked as '/'. Once I marked the new ZFS partition as '/' the child partedit process core dumped and bsdconfig returned to its main menu. So, I punted and did this step all by hand as well.

I assumed that the instructions on FreeBSD's wiki from the old sysinstall days were stale as they predated the use of boot environments in FreeBSD. Thankfully, Kevin Bowling has more recent instructions here.

Of course, one important step is that you need the ZFS kernel module to use ZFS. The custom kernel I used on my desktop had a stripped down set of kernel modules so I had to add ZFS to the list and reinstall the kernel.

First, I created the pool:

# mkdir /tmp/zroot
# zpool create -o altroot=/tmp/zroot -O compress=lz4 -O atime=off -m none zroot /dev/ada2p4

Next, I added the various datasets (basically copied from Kevin's instructions):

# zfs create -o mountpoint=none zroot/ROOT
# zfs create -o mountpoint=/ zroot/ROOT/default
# zfs create -o mountpoint=/tmp -o exec=on -o setuid=off zroot/tmp
# zfs create -o mountpoint=/usr -o canmount=off zroot/usr
# zfs create zroot/usr/home
# zfs create -o setuid=off zroot/usr/ports
# zfs create -o mountpoint=/var -o canmount=off zroot/var
# zfs create -o exec=off -o setuid=off zroot/var/audit
# zfs create -o exec=off -o setuid=off zroot/var/crash
# zfs create -o exec=off -o setuid=off zroot/var/log
# zfs create -o atime=on zroot/var/mail
# zfs create -o setuid=off zroot/var/tmp
# zpool set bootfs=zroot/ROOT/default zroot
# chmod 1777 /tmp/zroot/tmp
# chmod 1777 /tmp/zroot/var/tmp

Step 3: Copy the Data

In the past when I've migrated UFS partitions across a drive, I used 'dump | restore' which worked really well (preserved sparse files, etc.). For this migration that wasn't an option. Since I had seperate UFS partitions I had to copy each one over:

# tar -cp --one-file-system -f - -C / . | tar -xSf - -C /tmp/zroot
# tar -cp --one-file-system -f - -C /var . | tar -xSf - -C /tmp/zroot/var
# tar -cp --one-file-system -f - -C /usr . | tar -xSf - -C /tmp/zroot/usr

Since I had been using UFS SU+J, I had copied over the .sujournal files, so I deleted those.

# rm /tmp/zroot/.sujournal /tmp/zroot/var/.sujournal /tmp/zroot/usr/.sujournal

Step 4: Adjust Boot Configuration

I added the following to /etc/rc.conf:

zfs_enable="YES"

and to /boot/loader.conf:

zfs_load="YES"
kern.geom.label.disk_ident.enable=0
kern.geom.label.gptid.enable=0

I also removed all references to the old RAID1 mirror from /etc/fstab.

With all this done I was ready to reboot.

Step 5: Test Boot

My BIOS does not permit selecting a different hard disk at boot, so I had to change the default boot disk in the BIOS settings. Once this was done the system booted to ZFS just fine.

Step 6: Convert to Mirror

After powering down the box I unplugged the dead drive and booted up. I verified that the remaining drive's serial number did not match the drive that had reported errors previously. (I actually got this wrong the first time so had to boot a few times.) Once this was correct I proceeded to destroy the now-degraded RAID1 in preparation for reusing the disk as a mirror.

# graid delete raid/r0

At this point, the raw disk (/dev/ada0) still had the underlying data (in particular a GPT), so that had to be destroyed as well:

# gpart destroy -F ada0

Now the ada0 disk needed to be partitioned identically to the new disk (now ada1). I was able to copy the GPT over to save a few steps.

# gpart backup ada1 | gpart restore ada0
# gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 2 ada0
# newfs_msdos -L EFI /dev/ada0p1
# mkdir -p /mnt/efi/boot
# cp /boot/boot1.efi /mnt/efi/boot/BOOTx64.efi
# umount /mnt

Next, I added the two swap partitions to /etc/fstab and ran the /etc/rc.d/swap and /etc/rc.d/dumpon scripts.

Finally, I attached the ZFS partition on ada0 to the pool as a mirror. NB: I was warned previously to be sure to use 'zpool attach' and not 'zpool add' as the latter would simply concatenate the disks and not provide redundancy.

# zpool attach zroot /dev/ada1p4 /dev/ada0p4
Make sure to wait until resilver is done before rebooting.

If you boot from pool 'zroot', you may need to update
boot code on newly attached disk '/dev/ada0p4'.

Assuming you use GPT partitioning and 'da0' is your new boot disk
you may use the following command:

 gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 da0

The one nit about the otherwise helpful error messages is that they are hardcoded to assume the freebsd-boot partition is at partition 1. I suspect it is not easy to auto generate the correct command (as otherwise it would already do so), but it may need a language tweak to note that the index may also need updating, not just the disk name. Also, this doesn't cover the EFI booting case (which admittedly is new in FreeBSD 11).

Anyway, the pool is now happily reslivering:

# zpool status

  pool: zroot
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
 continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Mon Aug  1 08:12:58 2016
        1.63G scanned out of 207G at 98.1M/s, 0h35m to go
        1.63G resilvered, 0.79% done
config:

 NAME        STATE     READ WRITE CKSUM
 zroot       ONLINE       0     0     0
   mirror-0  ONLINE       0     0     0
     ada1p4  ONLINE       0     0     0
     ada0p4  ONLINE       0     0     0  (resilvering)

errors: No known data errors

Testing EFI will have to wait until I upgrade my desktop to 11. Perhaps next weekend.

Updates

Some feedback from readers:

1) restore doesn't actually assume a UFS destination, so I probably could have used 'dump | restore'.

2) The 'zpool export/import' wasn't actually needed and has been removed (the create is sufficient).

Also, for the curious, the resliver finished in less than an hour:

# zpool status
  pool: zroot
 state: ONLINE
  scan: resilvered 207G in 0h53m with 0 errors on Mon Aug  1 09:06:26 2016