Thursday, October 25, 2018

Using bhyve for FreeBSD Development

Note: This was originally published as an article in the July/August 2014 issue of the FreeBSD Journal.  Some details have changed since the time of writing, for example bhyve now supports AMD-v extensions on AMD CPUs.  However, I still use the network setup I describe today and several folks have asked for an example of NATed VMs on a laptop, so I've republished the article below.

One of the exciting new features in FreeBSD 10.0 is the bhyve hypervisor.  Hypervisors and virtual machines are used in a wide variety of applications.  This article will focus on using bhyve as a
tool for aiding development of FreeBSD itself.  Not all of the details covered are specific to FreeBSD development, however, and many may prove useful for other applications.

Note that the bhyve hypervisor is under constant development and some of the features described have been added since FreeBSD 10.0 was released.  Most of these features should be present in FreeBSD 10.1.

The Hypervisor


The bhyve hypervisor requires a 64-bit x86 processor with hardware support for virtualization.  This requirement allows for a simple, clean hypervisor implementation, but it does require a fairly recent
processor.  The current hypervisor requires an Intel processor, but there is an active development branch with support for AMD processors.

The hypervisor itself contains both user and kernel components.  The kernel driver is contained in the vmm.ko module and can be loaded either at boot from the boot loader or at runtime.  It must
be loaded before any guests can be created.  When a guest is created, the kernel driver creates a device file in /dev/vmm which is used by the user programs to interact with the guest.

The primary user component is the bhyve(8) program.  It constructs the emulated device tree in the guest and provides the implementation for most of the emulated devices.  It also calls the kernel driver to execute the guest.  Note that the guest always executes inside the driver itself, so guest execution time in the host is counted as system time in the bhyve process.

Currently, bhyve does not provide a system firmware interface to the guest (neither BIOS nor UEFI).  Instead, a user program running on the host is used to perform boot time operations including loading the guest operating system kernel into the guest's memory and setting the initial guest state so that the guest begins execution at the kernel's entry point.  For FreeBSD guests, the bhyveload(8) program can be used to load the kernel and prepare the guest for execution.  Support for some other operating systems is available via the grub2-bhyve program which is available via the sysutils/grub2-bhyve port or as a prebuilt package.

The bhyveload(8) program in FreeBSD 10.0 only supports 64-bit guests.  Support for 32-bit guests will be included in FreeBSD 10.1.

Network Setup


The network connections between the guests and the host can be configured in several different ways.  Two different setups are described below, but they are not the only possible configurations.

The only guest network driver currently supported by bhyve is the VirtIO network interface driver.  Each network interface exposed to the guest is associated with a tap(4) interface in the host.  The tap(4) driver allows a user program to inject packets into the network stack and accept packets from the network stack.  By design, each tap(4) interface will only pass traffic if it is opened by a user process and it is administratively enabled via ifconfig(8).  As a result, each tap(4) interface must be explicitly enabled each time a guest is booted.  This can be inconvenient for frequently
restarted guests.  The tap(4) driver can be changed to automatically enable an interface when it is opened by a user process by setting the net.link.tap.up_on_open sysctl to 1.

Bridged Configuration


One simple network setup bridges the guest network interfaces directly onto a network to which the host is connected.  On the host, a single if_bridge(4) interface is created.  The tap(4) interfaces for the guest are added to the bridge along with the network interface on the host that is attached to the desired network.  Example 1 connects a guest using tap0 to a LAN on the host's re0 interface:

Example 1: Manually Connecting a Guest to the Host's LAN

# ifconfig bridge0 create
# ifconfig bridge0 addm re0
# ifconfig bridge0 addm tap0
# ifconfig bridge0 up

The guest can then configure the VirtIO network interface bound to tap0 for the LAN on the host's re0 interface using DHCP or a static address.

The /etc/rc.d/bridge script allows bridges to be configured during boot automatically by variables in /etc/rc.conf.  The autobridge_interfaces variable lists the bridge interfaces to configure.  For each bridge interface, a autobridge_name variable lists other network interfaces that should be added as bridge members.  The list can include shell globs to match multiple interfaces.  Note that /etc/rc.d/bridge will not create the named bridge interfaces.  They should be created by listing them in the cloned_interfaces variable along with the desired tap(4) interfaces.  Example 2 lists the /etc/rc.conf settings to create three tap(4) interfaces bridged to a local LAN on the host's re0 interface.

Example 2: Bridged Configuration

/etc/rc.conf:

autobridge_interfaces="bridge0"
autobridge_bridge0="re0 tap*"
cloned_interfaces="bridge0 tap0 tap1 tap2"
ifconfig_bridge0="up"

Private Network with NAT


A more complex network setup creates a private network on the host for the guests and uses network address translation (NAT) to provide limited access from guests to other networks.  This may be a more appropriate setup when the host is mobile or connects to untrusted networks.

This setup also uses an if_bridge(4) interface, but only the tap(4) interfaces used by guests are added as members to the bridge.  The bridge members are assigned addresses from a private subnet.  The
bridge interface is assigned an address from the private subnet as well to connect the bridge to the host's network stack.  This allows the guests and host to communicate over the private subnet used by the bridge.

The host acts as a router for the guests to permit guest access to remote systems.  IP forwarding is enabled on the host and guest connections are translated via natd(8).  The guests use the host's address in the private subnet as their default route.

Example 3 lists the/etc/rc.conf settings to create three tap(4) interfaces and a bridge interface using the 192.168.16.0/24 subnet.  It also translates network connections over the the host's wlan0 interface using natd(8).

Example 3: Private Network Configuration

/etc/rc.conf:

autobridge_interfaces="bridge0"
autobridge_bridge0="tap*"
cloned_interfaces="bridge0 tap0 tap1 tap2"
ifconfig_bridge0="inet 192.168.16.1/24"
gateway_enable="YES"
natd_enable="YES"
natd_interface="wlan0"
firewall_enable="YES"
firewall_type="open"

Using dnsmasq with a Private Network


The private network from the previous example works well, but it is a bit tedious to work with.  Guests must statically configure their network interfaces, and network connections between guests and the host must use hardcoded IP addresses.  The dnsmasq utility can alleviate much of the tedium.  It can be installed via the dns/dnsmasq port or as a prebuilt package.

The dnsmasq daemon provides a DNS forwarding server as well as a DHCP server.  It can serve local DNS requests to map the hostnames of its DHCP clients to the addresses it assigns to the clients.  For the private network setup, this means that each guest can use DHCP to configure its network interface.  In addition, all of the guests and the host can resolve each guest's hostname.

The dnsmasq daemon is configured by settings in the /usr/local/etc/dnsmasq.conf configuration file. A sample configuration file is installed by the port by default.  The configuration file suggests enabling the domain-needed and bogus-priv settings for the DNS server to avoid sending useless DNS requests to upstream DNS servers.  To enable the DHCP server, interface must be set to the network interface on the host where the server should run, and dhcp-range must be set to configure the range of IP addresses that can be assigned to guests.

Example 4 instructs the dnsmasq daemon to run a DHCP server on the bridge0 interface and assign a subset of the 192.168.16.0/24 subnet to guests.

Example 4: Enabling dnsmasq's DNS and DHCP Servers

/usr/local/etc/dnsmasq.conf:

domain-needed
bogus-priv
interface=bridge0
dhcp-range=192.168.16.10,192.168.16.200,12h

In addition to providing local DNS names for DHCP clients, dnsmasq also provides DNS names for any entries in /etc/hosts on the host.  An entry to /etc/hosts that maps the IP address assigned to bridge0 to a hostname (e.g. "host") will allow guests to use that hostname to contact the host.

The last thing remaining is to configure the host machine to use dnsmasq's DNS server.  Allowing the host to use dnsmasq's DNS server allows the host to resolve the name of each guest.  The dnsmasq daemon can use resolvconf(8) to seamlessly handle updates to the host's DNS configuration provided
by DHCP or VPN clients.  This is implemented by resolvconf(8) updating two configuration files that are read by dnsmasq each time the host's DNS configuration changes.  Finally, the host should always use dnsmasq's DNS server and rely on it to forward requests to other upstream DNS servers.  Enabling all this requires changes to both dnsmasq's configuration file and /etc/resolvconf.conf.  More details about configuring resolvconf(8) can be found in resolvconf.conf(5).  Example 5 gives the changes to both files to use dnsmasq as the host's name resolver.

Example 5: Use dnsmasq as the Host's Resolver

/usr/local/etc/dnsmasq.conf:

conf-file=/etc/dnsmasq-conf.conf
resolv-file=/etc/dnsmasq-resolv.conf

/etc/resolvconf.conf:

name_servers=127.0.0.1
dnsmasq_conf=/etc/dnsmasq-conf.conf
dnsmasq_resolv=/etc/dnsmasq-resolv.conf

Running Guests via vmrun.sh


Executing a guest requires several steps.  First, any state from a previous guest using the same name must be cleared before a new guest can begin.  This is done by passing the --destroy flag to
bhyvectl(8).  Second, the guest must be created and the guest's kernel must be loaded into its address space by bhyveload(8) or grub2-bhyve.  Finally, the bhyve(8) program is used to create virtual devices and provide runtime support for guest execution.  Doing this all by hand for each guest invocation can be a bit tedious, so FreeBSD ships with a wrapper script for FreeBSD guests:
/usr/share/examples/bhyve/vmrun.sh.

The vmrun.sh script manages a simple FreeBSD guest.  It performs the three steps above in a loop so that the guest restarts after a reboot similar to real hardware.  It provides a fixed set of virtual devices
to the guest including a network interface backed by a tap(4) interface, a local disk backed by a disk image, and an optional second disk backed by an install image.  To make guest installations easier,
vmrun.sh checks the provided disk image for a valid boot sector.  If none is found, it instructs bhyveload(8) to boot from the install image, otherwise it boots from the disk image.  In FreeBSD 10.1 and later, vmrun.sh will terminate its loop when the guest requests soft off via ACPI.

The simplest way to bootstrap a new FreeBSD guest is to install the guest from an install ISO image.  For a FreeBSD guest running 9.2 or later, the standard install process can be used by using the normal
install ISO as the optional install image passed to vmrun.sh.  FreeBSD 8.4 also works as a bhyve guest.  However, its installer does not fully support VirtIO block devices, so the initial install must be
performed manually using steps similar to those from the RootOnZFS guide.  Example 6 creates a 64-bit guest named "vm0" and boots the install CD for FreeBSD 10.0-RELEASE.  Once the guest has been installed, the -I argument can be dropped to boot the guest from the disk image.

Example 6: Creating a FreeBSD/amd64 10.0 Guest

# mkdir vm0
# truncate -s 8g vm0/disk.img
# sh /usr/share/examples/bhyve/vmrun.sh -t tap0 -d vm0/disk.img \
  -I FreeBSD-10.0-RELEASE-amd64-disc1.iso vm0

The vmrun.sh script runs bhyve(8) synchronously and uses its standard file descriptors as the backend of the first serial port assigned to the guest.  This serial port is used as the system console device for
FreeBSD guests.  The simplest way to run a guest in the background using vmrun.sh is to use a tool such as screen or tmux.

FreeBSD 10.1 and later treat the SIGTERM signal sent to bhyve(8) as a virtual power button.  If the guest supports ACPI, then sending SIGTERM interrupts the guest to request a clean shutdown.  The guest should then initiate an ACPI soft-off which will terminate the vmrun.sh loop.  If the guest does not respond to SIGTERM, the guest can still be forcefully terminated from the host via SIGKILL.  If the guest does not support ACPI, then SIGTERM will immediately terminate the guest.

The vmrun.sh script accepts several different arguments to control the behavior of bhyve(8) and bhyveload(8), but these arguments only permit enabling a subset of the features supported by these programs.  To control all available features or use alternate virtual device configurations (e.g. multiple virtual drives or network interfaces), either invoke bhyveload(8) and bhyve(8) manually or use vmrun.sh as the basis of a custom script.

Configuring Guests


FreeBSD guests do not require extensive configuration settings to run, and most settings can be set by the system installer.  However, there are a few conventions and additional settings which can be useful.

Out of the box, FreeBSD releases prior to 9.3 and 10.1 expect to use a video console and keyboard as the system console.  As such, they do not enable a login prompt on the serial console.  A login prompt
should be enabled on the serial console by editing /etc/ttys and marking the ttyu0 terminal
"on".  Note that this can be done from the host after the install has completed by mounting the disk image on the host using mdconfig(8).  (Note: Be sure the guest is no longer accessing the disk image
before mounting its filesystem on the host to avoid data corruption.)

If a guest requires network access, it will require configuration similar to that of a normal host.  This includes configuring the guest's network interface (vtnet0) and assigning a hostname.  A useful convention is to re-use the name of the guest ("vm0" in Example 6) as the hostname.  The sendmail(8) daemon may hang attempting to resolve the guest's hostname during boot.  This can be worked around by completely disabling sendmail(8) in the guest.  Finally, most guests with network access will want to enable remote logins via sshd(8).

Example 7 lists the /etc/rc.conf file for a simple FreeBSD guest.

Example 7: Simple FreeBSD Guest Configuration

/etc/rc.conf:

hostname="vm0"
ifconfig_vtnet0="DHCP"
sshd_enable="YES"
dumpdev="AUTO"
sendmail_enable="NONE"

Using a bhyve Guest as a Target


One way bhyve can be used while developing FreeBSD is to allow a host to debug a guest as if the guest were a remote target.  Specifically, a test kernel can be built on the host, booted inside of the guest, and debugged from the host using kgdb(1).

Once a guest is created and configured and a test kernel has been built on the host, the next step is to boot the guest with the test kernel.  The traditional method is to install the kernel into the
guest's filesystem either by exporting the build directory to the guest via NFS, copying the kernel into the guest over the network, or mounting the guest's filesystem on the host directly via mdconfig(8).
An alternate method is available via bhyveload(8) which is similar to booting a test machine over the network.

Using bhyveload(8)'s Host Filesystem


The bhyveload(8) program allows a directory on the host's filesystem to be exported to the loader environment.  This can be used to load a kernel and modules from a host filesystem rather than the guest's disk image.  The directory on the host's filesystem is passed to bhyveload(8) via the -h flag.  The bhyveload(8) program exports a host0: device to the loader environment.  The path passed to the host0: device in the loader environment is appended to the configured directory to generate a host pathname.  Note that the directory passed to bhyveload(8) must be an absolute pathname.

The vmrun.sh script in FreeBSD 10.1 and later allow the directory to be set via the -H argument.  The script will convert a relative pathname to an absolute pathname before passing it to bhyveload(8).

Booting a test kernel from the host inside of the guest involves the following three steps:
  1. Install the kernel into the directory on the host by setting the DESTDIR variable to the directory when invoking make install or make installkernel.  A non-root user with write access to the directory can perform this step directly by setting the KMODOWN make variable to the invoking user.
  2. Pass the directory's path to bhyveload(8) either via the the -h flag to bhyveload(8) or the -H flag to vmrun.sh.
  3. Explicitly load the new kernel at the bhyveload(8) prompt via the loader path host0:/boot/kernel/kernel.

Example 8 installs a kernel with the kernel config GUEST into a host directory for the guest "vm0".  It uses vmrun.sh's -H argument to specify the host directory passed to bhyveload(8).  It also shows the commands used at the loader prompt to boot the test kernel.

Example 8: Booting a Kernel from the Host

> cd ~/work/freebsd/head/sys/amd64/compile/GUEST
> make install DESTDIR=~/bhyve/vm0/host KMODOWN=john
...
> cd ~/bhyve
> sudo sh vmrun.sh -t tap0 -d vm0/disk.img -H vm0/host vm0
...
OK unload
OK load host0:/boot/kernel/kernel
host0:/boot/kernel/kernel text=0x523888 data=0x79df8+0x10e2e8 syms=[0x8+0x9fb58+0x8+0xbaf41]
OK boot
...
Copyright (c) 1992-2014 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
        The Regents of the University of California. All rights reserved.
FreeBSD is a registered trademark of The FreeBSD Foundation.
FreeBSD 11.0-CURRENT #6 r261528M: Fri Feb  7 09:55:45 EST 2014
    john@pippin.baldwin.cx:/usr/home/john/work/freebsd/head/sys/amd64/compile/GUEST amd64

The guest can be configured to load a kernel from the host0: filesystem on the next boot using nextboot(8).  To boot the host0:/boot/kernel/kernel kernel, run nextboot -e bootfile=host0:/boot/kernel/kernel before
rebooting.  Note that this will not adjust the module path used to load kernel modules, so it only works with a monolothic kernel.

Using bhyve(8)'s Debug Port


The bhyve(8) hypervisor provides an optional debug port that can be used by the host to debug the guest's kernel using kgdb(1).  To use this feature, the guest kernel must include the bvmdebug device driver, the KDB kernel debugger, and the GDB debugger backend.  The debug port must also be
enabled by passing the -g flag to bhyve(8).  The flag requires an argument to specify the local TCP port on which bhyve(8) will listen for a connection from kgdb(1).  The vmrun.sh script also accepts a -g flag which is passed through to bhyve(8).

When the guest boots, its kernel will detect the debug port as an available GDB backend automatically.  To connect kgdb(1) on the host to the guest, first enter the kernel debugger by setting the debug.kdb.enter system control node to a non-zero value.  At the debugger prompt, invoke the gdb command.  On the host, run kgdb(1) using the guest's kernel as the kernel image.  The target remote command can be used to connect to the TCP port passed to bhyve(8).  Once kgdb(1) attaches to the remote target, it can be used to debug the guest kernel.  Examples 9 and 10
demonstrate these steps using a guest kernel built on the host.

Example 9: Using kgdb(1) with bvmdebug: In the Guest

> sudo sh vmrun.sh -t tap0 -d vm0/disk.img -H vm0/host -g 1234 vm0
...
OK load host0:/boot/kernel/kernel
host0:/boot/kernel/kernel text=0x523888 data=0x79df8+0x10e2e8 syms=[0x8+0x9fb58+0x8+0xbaf41]
OK boot
Booting...
GDB: debug ports: bvm
GDB: current port: bvm
...
root@vm0:~ # sysctl debug.kdb.enter=1
debug.kdb.enter: 0KDB: enter: sysctl debug.kdb.enter
[ thread pid 693 tid 100058 ]
Stopped at      kdb_sysctl_enter+0x87:  movq    $0,kdb_why
db> gdb
(ctrl-c will return control to ddb)
Switching to gdb back-end
Waiting for connection from gdb
 -> 0
root@vm0:~ #

Example 10: Using kgdb(1) with bvmdebug: On the Host

> cd ~/work/freebsd/head/sys/amd64/compile/GUEST
> kgdb kernel.debug 
...
(kgdb) target remote localhost:1234
Remote debugging using localhost:1234
warning: Invalid remote reply: 
kdb_sysctl_enter (oidp=<value optimized out>, arg1=<value optimized out>, 
    arg2=1021, req=<value optimized out>) at ../../../kern/subr_kdb.c:446
446                     kdb_why = KDB_WHY_UNSET;
Current language:  auto; currently minimal
(kgdb) c
Continuing.

Using kgdb(1) with a Virtual Serial Port


A serial port can also be used to allow the host to debug the guest's kernel.  This can be done by loading the nmdm(4) driver on the host and using a nmdm(4) device for the serial port used
for debugging.

To avoid spewing garbage on the console, connect the nmdm(4) device to the second serial port.  This is enabled in the hypervisor by passing -l com2,/dev/nmdm0B to bhyve(8).  The guest must be
configured to use the second serial port for debugging by setting the kernel environment variable hint.uart.1.flags=0x80 from bhyveload(8).  The kgdb(1) debugger on the host connects to the guest by using target remote /dev/nmdm0A.

Conclusion


The bhyve hypervisor is a nice addition to a FreeBSD developer's toolbox.  Guests can be used both to develop new features and to test merges to stable branches.  The hypervisor has a wide variety of uses beyond developing FreeBSD as well.

Wednesday, May 9, 2018

External GCC for FreeBSD/mips

FreeBSD currently supports two compiler toolchains in the base system: Clang/LLVM of various versions (presently 6.0 in 12-CURRENT and 11-STABLE) and GCC 4.2.1 + binutils 2.17.50 (the last  versions released under GPLv2).  Clang/LLVM does not fully support all of FreeBSD's architectures, but the existing GNU toolchain is increasingly ancient.  It does not support C++11 or later standards, newer versions of DWARF, compressed debug symbols, ifuncs, etc.  Due to non-technical reasons, including a modern GNU toolchain in FreeBSD's base system is not feasible.  However, there has been ongoing work to support building a modern GNU toolchain from ports and installing it as the base system toolchain (e.g. /usr/bin/cc and /usr/bin/ld rather than /usr/local/bin/gcc<mumble>).

I've been working with this recently trying to polish up the existing bits and add a few missing pieces to make it viable using FreeBSD/mips (O32) as my initial testing platform since it runs fairly well under QEMU.

Building a sysroot

Cross building a toolchain via ports requires two things: an external toolchain to use on the build host and a system root ('sysroot') containing headers and libraries for the target host.  The external toolchain is available as a package from ports:

# pkg install mips-xtoolchain-gcc

The easiest way currently to build a sysroot is to build the world using the external toolchain and then install it:

# cd /usr/src
# make buildworld TARGET_ARCH=mips CROSS_TOOLCHAIN=mips-gcc WITHOUT_GCC=yes WITHOUT_BINUTILS=yes
# mkdir -p /ufs/mips/rootfs
# make installworld TARGET_ARCH=mips CROSS_TOOLCHAIN=mips-gcc WITHOUT_GCC=yes WITHOUT_BINUTILS=yes DESTDIR=/ufs/mips/rootfs

With these in place, one can now cross-build packages for a native toolchain.

Building Toolchain Packages

To build a native toolchain, three packages are required: ports-mgmt/pkg, base/binutils, and base/gcc.  Each package can be built via the following command:

# make CROSS_TOOLCHAIN=mips-gcc CROSS_SYSROOT=/ufs/mips/rootfs package

After the port finishes building the package, the package file can be found in the work/pkg subdirectory of the port.  Build all three packages and copy the packages someplace.  I put them in /ufs/mips/rootfs/root.

Using the Toolchain Packages

For my testing I fully populated the sysroot at /ufs/mips/rootfs (build and install MALTA kernel, 'make distribution', setup /etc/fstab and /etc/rc.conf) and then generated a disk image using makefs:

# makefs -M 32g -B be /ufs/mips/disk.img /ufs/mips/rootfs

I used this image as a disk for a QEMU instance:

# qemu-system-mips -kernel /ufs/mips/rootfs/boot/kernel/kernel \
    -nographic -drive file=/ufs/mips/disk.img,format=raw -m 2048 \
    -net nic -net user,hostfwd=tcp::8022-:22

Once QEMU is running, I can login to the virtual machine via ssh to port 8022 of the host.

To use the toolchain packages, you need to install them using pkg(8).  However, this requires manually bootstrapping pkg(8) itself.  It is probably simpler if you first disable the default pkg(8) repository following the instructions in /etc/pkg/FreeBSD.conf.  With that out of the way, manually extract the pkg-static binary and use it to install pkg:

# tar xf pkg-1.10.5.txz -C / /usr/local/sbin/pkg-static
# pkg-static install pkg-1.10.5.txz

Once pkg is installed, the binutils and gcc packages can be installed:

# pkg install freebsd-binutils-2.30_2.txz freebsd-gcc-6.3.0.txz
# cc --version
cc (GNU Collection for FreeBSD) 6.3.0
...
# ld --version
GNU ld (GNU Binutils) 2.30
...

Building a Native World

The next step I am still working on is testing building of newer worlds with the binutils and gcc packages installed.  Note that you have to be careful to ensure that you don't overwrite the packages you installed when eventually doing installworld, so to start with I added this to /etc/src.conf:

WITHOUT_GCC=yes
WITHOUT_BINUTILS=yes

I also wanted to force the build system to use /usr/bin/cc instead of trying to build GCC 4.2.1 and use it as the system compiler.  The current approach I'm using is to make a freebsd-gcc.mk cross toolchain make file in /usr/local/share/toolchains/freebsd-gcc.mk:

XCC=/usr/bin/cc
XCXX=/usr/bin/c++
XCPP=/usr/bin/cpp
CROSS_BINUTILS_PREFIX=/usr/bin/
X_COMPILER_TYPE=gcc

With this in place I tripped over one more bug which is that bsd.compiler.mk doesn't recognize the base/gcc compiler.  With that addressed I can now kick off a build world:

# make CROSS_TOOLCHAIN=freebsd-gcc buildworld TARGET_CPUTYPE=mips3

Unfortunately QEMU's timekeeping is a bit iffy and jemalloc keeps crashing due to an internal assertion failure which is slowing the build, but it does seem to make some progress on each run, so I keep restarting it with NO_CLEAN=yes.

Saturday, October 8, 2016

Using PCI pass through with bhyve

I have recently been using PCI pass through with bhyve while working on some device drivers for FreeBSD.  bhyve has included PCI pass through from day one, but I recently made a few changes to improve the experience for my particular workflow.

A Workflow

PCI pass through in bhyve requires a couple of steps. First, a PCI device must be attached to the ppt(4) device driver included as part of bhyve. This device driver claims a PCI device so that other drivers on the host cannot attach to it. It also interfaces with bhyve to ensure a device is only attached to a single VM and that it is correctly configured in the I/O MMU. Second, the PCI device must be added as a virtual PCI device of a bhyve instance.

Before the ppt(4) driver can be attached to a specific PCI device, the driver must be loaded. It is included as part of bhyve's vmm.ko:

kldload vmm.ko

Once the driver is loaded, the "set driver" command from devctl(8) can be used to attach the ppt(4) driver to the desired PCI device.  In this example, the PCI device at bus 3, slot 0, function 8 is our target device:

devctl set driver pci0:3:0:8 ppt

Finally, the bhyve instance must be informed about this device when starting.  The  /usr/share/examples/bhyve/vmrun.sh script has included a "-p" option to support this since r279925.  The command below uses vmrun.sh to start a VM running FreeBSD HEAD with the device passed through:

sh /usr/share/examples/bhyve/vmrun.sh -c 4 -d /dev/zvol/bhyve/head -p 3/0/8 head

Once I am finished with testing, I can revert the PCI device back to it's normal driver (if one exists) by using devctl's "clear driver" command:

devctl clear driver -f ppt0

Relevant Commits

The workflow described above requires several recent changes.  Most of these have been merged to stable/10 and stable/11 but are not present in 10.3 or 11.0.  Some of them can be worked around or are optional however.

I/O MMU Fixes

One of the requirements of using PCI pass through is the use of an I/O MMU to adjust DMA requests made by the PCI device to treat addresses in this requests as guest physical addresses (GPAs) rather than host physical addresses (HPAs).  This means that the addresses in each DMA request must be mapped to the "real" host physical address using a set of page tables similar to the "nested page tables" used to map GPAs to HPAs in the CPU.

bhyve's PCI pass through support was originally designed to work with a static set of PCI devices assigned at boot time via a loader tunable.  As a result, it only enabled the I/O MMU if PCI pass through devices were configured when the kernel module was initialized.  It also didn't check to see of the I/O MMU was initialized when a PCI pass through device was added to a new VM.  Instead, if PCI pass through devices were only configured dynamically at runtime and added to a VM, their DMA requests were not translated and always operated on HPAs.  Generally this meant that devices simply didn't work as they were DMA'ing to/from the wrong memory (and quite possibly to memory not belonging to the VM).  This was resolved in r304858, though it can be worked around on older versions by setting the tunable hw.vmm.force_iommu=1 to enable the I/O MMU even if no  PCI pass through devices are configured when the module is loaded.  One other edge case fixed by this commit is that bhyve will now fail to start a VM with a PCI pass through device if the I/O MMU fails to initialize (for example, it doesn't exist).

A related issue was that bhyve assumed that PCI devices marked for pass through would always be marked as pass through and would never be used on the host system.  The result was that PCI devices attached to the ppt(4) device driver were only permitted by the I/O MMU configuration to perform DMA while they were active in a VM.  In particular, if a PCI device were detached from the ppt(4) device driver on the host so that another driver could attach to the PCI device, the PCI device would not be able to function on the host.  This was fixed in r305485 by enabling ppt(4) devices to use DMA on the host while they are not active in a VM.

Finally, FreeBSD 11 introduced other changes in the PCI system that permitted the dynamic arrival and departure of PCI devices in a running system through support for SR-IOV (creating/destroying VFs) and native PCI-express HotPlug.  (Technically, older versions of FreeBSD also supported dynamic arrival and departure of PCI devices via CardBus, but most systems with CardBus are too old to support the I/O MMU included with VT-d.)  The I/O MMU support in bhyve assumed a static PCI device tree and performed a single scan of the PCI device tree during I/O MMU initialization to initialize the I/O MMU's device tables.  The I/O MMU was not updated if PCI devices were later added or removed.  Revision r305497 adds event handler hooks when PCI devices are added or removed.  bhyve uses these to update the I/O MMU device tables when PCI devices are added or removed.

All of these fixes were merged to stable/11 in r306471.  The first two were merged to stable/10 in r306472.

To assist with debugging these issues and testing the fixes, I also hacked together a quick utility to dump the device tables used by the I/O MMU.  This tool is present in FreeBSD's source tree at tools/tools/dmardump. (r303887, merged in r306467)

PCI Device Resets

Commit r305502 added support for resetting PCI-express devices passed through to a VM on VM startup and teardown via a Function Level Reset (FLR).  This ensures the device is idle and does not issue any DMA requests while being moved between the host and the VM memory domains.  It also ensures the device is quiesced to a reset state during VM startup and after VM teardown.  Support for this was merged to stable/10 and stable/11 in r306520.

devctl

The devctl utility first appeared in 10.3 including the "set driver" command.  The "clear driver" command was added more recently in r305034 (merged in r306533).  However, "clear driver" is only needed to revert a PCI pass through driver back to a regular driver on the host.  The "set driver" command can be used with the name of the regular driver as a workaround on older systems.

Future Work

The I/O MMU implementation under the ACPI_DMAR option added more recently to FreeBSD is more advanced than the earlier support used by bhyve.  At some point bhyve should be updated to use ACPI_DMAR to manage the I/O MMU rather than its own driver.

Monday, August 8, 2016

Using ZVOLs for bhyve VMs

One of the followup changes after converting my desktop from UFS to ZFS was to convert my bhyve VMs from raw "disk.img" files in my home directory to being backed by ZFS volumes (ZVOLs).   The conversion process was fairly simple.

First, I created a new namespace to hold them:

# zfs create zroot/bhyve

Next, for each disk image I created a new volume and dd'd the raw disk image over to it.  For example:

# zfs create -V 16G zroot/bhyve/head
# dd if=bhyve/head/disk.img of=/dev/zvol/zroot/bhyve/head bs=1m

I then booted the virtual machine passing "-d /dev/zvol/zroot/bhyve/head" to vmrun.sh.  Once that was successful I removed the old disk.img files.

By default, FreeBSD exports ZFS volumes as GEOM providers.  This means that volumes are able to be tasted on the host.  For example:

# gpart show zvol/zroot/bhyve/head
=>      34  33554365  zvol/zroot/bhyve/head  GPT  (16G)
        34       128                      1  freebsd-boot  (64K)
       162  32714523                      2  freebsd-ufs  (16G)
  32714685    839714                      3  freebsd-swap  (410M)

Once nice thing about this is that I can run fsck(8) directly against /dev/zvol/zroot/bhyve/headp2 without having to use mdconfig(8).  I could also choose to mount it on the host.

Secondly, ZFS volumes show up in gstat(8) output.  With all the volumes the output can be quite long, but "gstat -p" can give you nice output breaking down I/O by VM:

dT: 1.001s  w: 1.000s
 L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
    0      0      0      0    0.0      0      0    0.0    0.0| cd0
    6    408    271   9934   42.9    138  17534    7.7  100.6| ada0
   12    443    272   5987   43.9    172  21844    5.7  100.0| ada1
    0      0      0      0    0.0      0      0    0.0    0.0| zvol/zroot/bhyve/head
    0      0      0      0    0.0      0      0    0.0    0.0| zvol/zroot/bhyve/head-i386
    0     18      2     64  393.8     16    321  119.4   90.7| zvol/zroot/bhyve/fbsd11-i386
    1     65     24   3068   53.0     41   3320   18.7  109.9| zvol/zroot/bhyve/fbsd11
    6     59     22    786   50.8     37    578   61.2  100.1| zvol/zroot/bhyve/fbsd10
    2     61      0      0    0.0     61   6609   77.0  101.8| zvol/zroot/bhyve/fbsd9
    1     18      8     94   60.4     10     33   10.6   51.3| zvol/zroot/bhyve/fbsd10-i386
    1     14      6     50   92.3      8    108  109.3   60.7| zvol/zroot/bhyve/fbsd9-i386
    0      0      0      0    0.0      0      0    0.0    0.0| zvol/zroot/bhyve/fbsd8-i386
    0      0      0      0    0.0      0      0    0.0    0.0| zvol/zroot/bhyve/fbsd8

Monday, August 1, 2016

Adventures in Disk Replacement

The Problem

A few years ago I built a new FreeBSD desktop at home.  For simplicity of booting, etc. I used the built-in RAID1 mirroring provided by the on-board SATA controller.  This worked fine.

Recently one of my drives began reporting SMART errors (I am running smartd from sysutils/smartmontools in daemon mode and it sends emails to root@ for certain types of errors).  Originally the drive logged two errors:

Device: /dev/ada0, 8 Currently unreadable (pending) sectors
Device: /dev/ada0, 8 Offline uncorrectable sectors

It logged these two (seemingly related?) errors once a month for the past three months.  This month it logged an additional error at which point I decided to swap out the drive:

Device: /dev/ada0, ATA error count increased from 0 to 5

The simple solution would be to just swap out the dying drive for the replacement, reboot and and let the rebuild chug away.  However, I decided to make a few changes which made things not quite so simple.

First, my existing system was laid out with UFS with separate partitions for /, /usr, and /var.  It was at least using GPT instead of MBR.  However, I wanted to switch from UFS to ZFS.  I'm not exactly ecstatic about how ZFS' ARC interfaces with FreeBSD's virtual memory subsystem (a bit of a square peg in a round hole).  However, for my desktop the additional data integrity of ZFS' integrated checksums is very compelling.  In addition, switching to ZFS enables more flexibility in the future for growing the pool as well as things like boot environments, ZFS integration with poudriere, zvols for my bhyve VMs, etc.

Second, since I was going to be doing a complicated data migration, I figured I might as well redo my partitioning layout to support EFI booting.  In this case I wanted the flexibility to boot via legacy mode (CSM) if need be but having the option of switching to EFI.  This isn't that complicated (the install images for FreeBSD 11 are laid out for this), but FreeBSD's installer doesn't support this type of layout out of the box.

Step 1: Partitioning

I initially tried to see if I could do some of the initial setup using partedit from FreeBSD's installer.  However, I quickly ran into a few issues.  One, my desktop was still 10-STABLE which didn't support ZFS on GPT for booting.  (Even partedit in 11 doesn't seem to handle this by my reading of the source.)  Second, partedit in HEAD doesn't support creating a dual-mode (EFI and BIOS) disk.  Thus, I resorted to doing this all by hand.

First, I added a GPT table which is pretty simple (and covered in manual page examples):

# gpart create -s gpt ada2

To make the disk support dual-mode booting it needs both a EFI partition and a freebsd-boot partition.  For the EFI partition, FreeBSD ships a pre-formatted FAT image that can be written directly to the partition (/boot/boot1.efifat).  However, the formatted filesystem in this image is relatively small, and I wanted to use a larger EFI partition to match a recent change in FreeBSD 11 (200MB).  Instead of using the formatted filesystem, I formatted the EFI partition directly and copied the /boot/boot1.efi binary into the right subdirectory.  Ideally I think bsdinstall should do this as well rather than using the pre-formatted image.

# gpart add -t efi -s 200M -a 4k ada2
# newfs_msdos -L EFI /dev/ada2p1
# mount -t msdos /dev/ada2p1 /mnt
# mkdir -p /mnt/efi/boot
# cp /boot/boot1.efi /mnt/efi/boot/BOOTx64.efi
# umount /mnt

To handle BIOS booting, I installed the /boot/pmbr MBR bootstrap and /boot/gptzfsboot into a freebsd-boot partition.

# gpart bootcode -b /boot/pmbr ada2
# gpart add -t freebsd-boot -s 512k -a 4k ada2
# gpart bootcode -p /boot/gptzfsboot -i 2 ada2

Finally, I added partitions for swap and ZFS:

# gpart add -t freebsd-swap -a 4k -s 16G ada2
# gpart add -t freebsd-zfs -a 4k ada2

At this point the disk layout looked like this:

# gpart show ada2
=>        34  1953525101  ada2  GPT  (932G)
          34           6        - free -  (3.0K)
          40      409600     1  efi  (200M)
      409640        1024     2  freebsd-boot  (512K)
      410664    33554432     3  freebsd-swap  (16G)
    33965096  1919560032     4  freebsd-zfs  (915G)
  1953525128           7        - free -  (3.5K)

Step 2: Laying out ZFS

Now that partitioning was complete, the next step was to create a ZFS pool.  The ultimate plan is to add the "good" remaining disk as a mirror of the new disk, but I started with a single-device pool backed by the new disk.  I would have liked to using the existing zfsboot script from FreeBSD's installer to create the pool and layout the various filesystems, but trying to use bsdconfig to do this just resulted in confusion.  It refused to do anything when I first ran the disk editor from the bsdconfig menu because no filesystem was marked as '/'.  Once I marked the new ZFS partition as '/' the child partedit process core dumped and bsdconfig returned to its main menu.  So, I punted and did this step all by hand as well.

I assumed that the instructions on FreeBSD's wiki from the old sysinstall days were stale as they predated the use of boot environments in FreeBSD.  Thankfully, Kevin Bowling has more recent instructions here.

Of course, one important step is that you need the ZFS kernel module to use ZFS.  The custom kernel I used on my desktop had a stripped down set of kernel modules so I had to add ZFS to the list and reinstall the kernel.

First, I created the pool:

# mkdir /tmp/zroot
# zpool create -o altroot=/tmp/zroot -O compress=lz4 -O atime=off -m none zroot /dev/ada2p4

Next, I added the various datasets (basically copied from Kevin's instructions):

# zfs create -o mountpoint=none zroot/ROOT
# zfs create -o mountpoint=/ zroot/ROOT/default
# zfs create -o mountpoint=/tmp -o exec=on -o setuid=off zroot/tmp
# zfs create -o mountpoint=/usr -o canmount=off zroot/usr
# zfs create zroot/usr/home
# zfs create -o setuid=off zroot/usr/ports
# zfs create -o mountpoint=/var -o canmount=off zroot/var
# zfs create -o exec=off -o setuid=off zroot/var/audit
# zfs create -o exec=off -o setuid=off zroot/var/crash
# zfs create -o exec=off -o setuid=off zroot/var/log
# zfs create -o atime=on zroot/var/mail
# zfs create -o setuid=off zroot/var/tmp
# zpool set bootfs=zroot/ROOT/default zroot
# chmod 1777 /tmp/zroot/tmp
# chmod 1777 /tmp/zroot/var/tmp

Step 3: Copy the Data

In the past when I've migrated UFS partitions across a drive, I used 'dump | restore' which worked really well (preserved sparse files, etc.).  For this migration that wasn't an option.  Since I had seperate UFS partitions I had to copy each one over:

# tar -cp --one-file-system -f - -C / . | tar -xSf - -C /tmp/zroot
# tar -cp --one-file-system -f - -C /var . | tar -xSf - -C /tmp/zroot/var
# tar -cp --one-file-system -f - -C /usr . | tar -xSf - -C /tmp/zroot/usr

Since I had been using UFS SU+J, I had copied over the .sujournal files, so I deleted those.

# rm /tmp/zroot/.sujournal /tmp/zroot/var/.sujournal /tmp/zroot/usr/.sujournal

Step 4: Adjust Boot Configuration

I added the following to /etc/rc.conf:

zfs_enable="YES"

and to /boot/loader.conf:

zfs_load="YES"
kern.geom.label.disk_ident.enable=0
kern.geom.label.gptid.enable=0

I also removed all references to the old RAID1 mirror from /etc/fstab.

With all this done I was ready to reboot.

Step 5: Test Boot

My BIOS does not permit selecting a different hard disk at boot, so I had to change the default boot disk in the BIOS settings.  Once this was done the system booted to ZFS just fine.

Step 6: Convert to Mirror

After powering down the box I unplugged the dead drive and booted up.  I verified that the remaining drive's serial number did not match the drive that had reported errors previously.  (I actually got this wrong the first time so had to boot a few times.)  Once this was correct I proceeded to destroy the now-degraded RAID1 in preparation for reusing the disk as a mirror.

# graid delete raid/r0

At this point, the raw disk (/dev/ada0) still had the underlying data (in particular a GPT), so that had to be destroyed as well:

# gpart destroy -F ada0

Now the ada0 disk needed to be partitioned identically to the new disk (now ada1).  I was able to copy the GPT over to save a few steps.

# gpart backup ada1 | gpart restore ada0
# gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 2 ada0
# newfs_msdos -L EFI /dev/ada0p1
# mkdir -p /mnt/efi/boot
# cp /boot/boot1.efi /mnt/efi/boot/BOOTx64.efi
# umount /mnt

Next, I added the two swap partitions to /etc/fstab and ran the /etc/rc.d/swap and /etc/rc.d/dumpon scripts.

Finally, I attached the ZFS partition on ada0 to the pool as a mirror.  NB: I was warned previously to be sure to use 'zpool attach' and not 'zpool add' as the latter would simply concatenate the disks and not provide redundancy.

# zpool attach zroot /dev/ada1p4 /dev/ada0p4
Make sure to wait until resilver is done before rebooting.

If you boot from pool 'zroot', you may need to update
boot code on newly attached disk '/dev/ada0p4'.

Assuming you use GPT partitioning and 'da0' is your new boot disk
you may use the following command:

 gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 da0

The one nit about the otherwise helpful error messages is that they are hardcoded to assume the freebsd-boot partition is at partition 1.  I suspect it is not easy to auto generate the correct command (as otherwise it would already do so), but it may need a language tweak to note that the index may also need updating, not just the disk name.  Also, this doesn't cover the EFI booting case (which admittedly is new in FreeBSD 11).

Anyway, the pool is now happily reslivering:

# zpool status
  pool: zroot
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
 continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Mon Aug  1 08:12:58 2016
        1.63G scanned out of 207G at 98.1M/s, 0h35m to go
        1.63G resilvered, 0.79% done
config:

 NAME        STATE     READ WRITE CKSUM
 zroot       ONLINE       0     0     0
   mirror-0  ONLINE       0     0     0
     ada1p4  ONLINE       0     0     0
     ada0p4  ONLINE       0     0     0  (resilvering)

errors: No known data errors

Testing EFI will have to wait until I upgrade my desktop to 11.  Perhaps next weekend.

Updates

Some feedback from readers:
1) restore doesn't actually assume a UFS destination, so I probably could have used 'dump | restore'.
2) The 'zpool export/import' wasn't actually needed and has been removed (the create is sufficient).

Also, for the curious, the resliver finished in less than an hour:

# zpool status
  pool: zroot
 state: ONLINE
  scan: resilvered 207G in 0h53m with 0 errors on Mon Aug  1 09:06:26 2016