Monday, August 8, 2016

Using ZVOLs for bhyve VMs

One of the followup changes after converting my desktop from UFS to ZFS was to convert my bhyve VMs from raw "disk.img" files in my home directory to being backed by ZFS volumes (ZVOLs).   The conversion process was fairly simple.

First, I created a new namespace to hold them:

# zfs create zroot/bhyve

Next, for each disk image I created a new volume and dd'd the raw disk image over to it.  For example:

# zfs create -V 16G zroot/bhyve/head
# dd if=bhyve/head/disk.img of=/dev/zvol/zroot/bhyve/head bs=1m

I then booted the virtual machine passing "-d /dev/zvol/zroot/bhyve/head" to vmrun.sh.  Once that was successful I removed the old disk.img files.

By default, FreeBSD exports ZFS volumes as GEOM providers.  This means that volumes are able to be tasted on the host.  For example:

# gpart show zvol/zroot/bhyve/head
=>      34  33554365  zvol/zroot/bhyve/head  GPT  (16G)
        34       128                      1  freebsd-boot  (64K)
       162  32714523                      2  freebsd-ufs  (16G)
  32714685    839714                      3  freebsd-swap  (410M)

Once nice thing about this is that I can run fsck(8) directly against /dev/zvol/zroot/bhyve/headp2 without having to use mdconfig(8).  I could also choose to mount it on the host.

Secondly, ZFS volumes show up in gstat(8) output.  With all the volumes the output can be quite long, but "gstat -p" can give you nice output breaking down I/O by VM:

dT: 1.001s  w: 1.000s
 L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
    0      0      0      0    0.0      0      0    0.0    0.0| cd0
    6    408    271   9934   42.9    138  17534    7.7  100.6| ada0
   12    443    272   5987   43.9    172  21844    5.7  100.0| ada1
    0      0      0      0    0.0      0      0    0.0    0.0| zvol/zroot/bhyve/head
    0      0      0      0    0.0      0      0    0.0    0.0| zvol/zroot/bhyve/head-i386
    0     18      2     64  393.8     16    321  119.4   90.7| zvol/zroot/bhyve/fbsd11-i386
    1     65     24   3068   53.0     41   3320   18.7  109.9| zvol/zroot/bhyve/fbsd11
    6     59     22    786   50.8     37    578   61.2  100.1| zvol/zroot/bhyve/fbsd10
    2     61      0      0    0.0     61   6609   77.0  101.8| zvol/zroot/bhyve/fbsd9
    1     18      8     94   60.4     10     33   10.6   51.3| zvol/zroot/bhyve/fbsd10-i386
    1     14      6     50   92.3      8    108  109.3   60.7| zvol/zroot/bhyve/fbsd9-i386
    0      0      0      0    0.0      0      0    0.0    0.0| zvol/zroot/bhyve/fbsd8-i386
    0      0      0      0    0.0      0      0    0.0    0.0| zvol/zroot/bhyve/fbsd8

Monday, August 1, 2016

Adventures in Disk Replacement

The Problem

A few years ago I built a new FreeBSD desktop at home.  For simplicity of booting, etc. I used the built-in RAID1 mirroring provided by the on-board SATA controller.  This worked fine.

Recently one of my drives began reporting SMART errors (I am running smartd from sysutils/smartmontools in daemon mode and it sends emails to root@ for certain types of errors).  Originally the drive logged two errors:

Device: /dev/ada0, 8 Currently unreadable (pending) sectors
Device: /dev/ada0, 8 Offline uncorrectable sectors

It logged these two (seemingly related?) errors once a month for the past three months.  This month it logged an additional error at which point I decided to swap out the drive:

Device: /dev/ada0, ATA error count increased from 0 to 5

The simple solution would be to just swap out the dying drive for the replacement, reboot and and let the rebuild chug away.  However, I decided to make a few changes which made things not quite so simple.

First, my existing system was laid out with UFS with separate partitions for /, /usr, and /var.  It was at least using GPT instead of MBR.  However, I wanted to switch from UFS to ZFS.  I'm not exactly ecstatic about how ZFS' ARC interfaces with FreeBSD's virtual memory subsystem (a bit of a square peg in a round hole).  However, for my desktop the additional data integrity of ZFS' integrated checksums is very compelling.  In addition, switching to ZFS enables more flexibility in the future for growing the pool as well as things like boot environments, ZFS integration with poudriere, zvols for my bhyve VMs, etc.

Second, since I was going to be doing a complicated data migration, I figured I might as well redo my partitioning layout to support EFI booting.  In this case I wanted the flexibility to boot via legacy mode (CSM) if need be but having the option of switching to EFI.  This isn't that complicated (the install images for FreeBSD 11 are laid out for this), but FreeBSD's installer doesn't support this type of layout out of the box.

Step 1: Partitioning

I initially tried to see if I could do some of the initial setup using partedit from FreeBSD's installer.  However, I quickly ran into a few issues.  One, my desktop was still 10-STABLE which didn't support ZFS on GPT for booting.  (Even partedit in 11 doesn't seem to handle this by my reading of the source.)  Second, partedit in HEAD doesn't support creating a dual-mode (EFI and BIOS) disk.  Thus, I resorted to doing this all by hand.

First, I added a GPT table which is pretty simple (and covered in manual page examples):

# gpart create -s gpt ada2

To make the disk support dual-mode booting it needs both a EFI partition and a freebsd-boot partition.  For the EFI partition, FreeBSD ships a pre-formatted FAT image that can be written directly to the partition (/boot/boot1.efifat).  However, the formatted filesystem in this image is relatively small, and I wanted to use a larger EFI partition to match a recent change in FreeBSD 11 (200MB).  Instead of using the formatted filesystem, I formatted the EFI partition directly and copied the /boot/boot1.efi binary into the right subdirectory.  Ideally I think bsdinstall should do this as well rather than using the pre-formatted image.

# gpart add -t efi -s 200M -a 4k ada2
# newfs_msdos -L EFI /dev/ada2p1
# mount -t msdos /dev/ada2p1 /mnt
# mkdir -p /mnt/efi/boot
# cp /boot/boot1.efi /mnt/efi/boot/BOOTx64.efi
# umount /mnt

To handle BIOS booting, I installed the /boot/pmbr MBR bootstrap and /boot/gptzfsboot into a freebsd-boot partition.

# gpart bootcode -b /boot/pmbr ada2
# gpart add -t freebsd-boot -s 512k -a 4k ada2
# gpart bootcode -p /boot/gptzfsboot -i 2 ada2

Finally, I added partitions for swap and ZFS:

# gpart add -t freebsd-swap -a 4k -s 16G ada2
# gpart add -t freebsd-zfs -a 4k ada2

At this point the disk layout looked like this:

# gpart show ada2
=>        34  1953525101  ada2  GPT  (932G)
          34           6        - free -  (3.0K)
          40      409600     1  efi  (200M)
      409640        1024     2  freebsd-boot  (512K)
      410664    33554432     3  freebsd-swap  (16G)
    33965096  1919560032     4  freebsd-zfs  (915G)
  1953525128           7        - free -  (3.5K)

Step 2: Laying out ZFS

Now that partitioning was complete, the next step was to create a ZFS pool.  The ultimate plan is to add the "good" remaining disk as a mirror of the new disk, but I started with a single-device pool backed by the new disk.  I would have liked to using the existing zfsboot script from FreeBSD's installer to create the pool and layout the various filesystems, but trying to use bsdconfig to do this just resulted in confusion.  It refused to do anything when I first ran the disk editor from the bsdconfig menu because no filesystem was marked as '/'.  Once I marked the new ZFS partition as '/' the child partedit process core dumped and bsdconfig returned to its main menu.  So, I punted and did this step all by hand as well.

I assumed that the instructions on FreeBSD's wiki from the old sysinstall days were stale as they predated the use of boot environments in FreeBSD.  Thankfully, Kevin Bowling has more recent instructions here.

Of course, one important step is that you need the ZFS kernel module to use ZFS.  The custom kernel I used on my desktop had a stripped down set of kernel modules so I had to add ZFS to the list and reinstall the kernel.

First, I created the pool:

# mkdir /tmp/zroot
# zpool create -o altroot=/tmp/zroot -O compress=lz4 -O atime=off -m none zroot /dev/ada2p4

Next, I added the various datasets (basically copied from Kevin's instructions):

# zfs create -o mountpoint=none zroot/ROOT
# zfs create -o mountpoint=/ zroot/ROOT/default
# zfs create -o mountpoint=/tmp -o exec=on -o setuid=off zroot/tmp
# zfs create -o mountpoint=/usr -o canmount=off zroot/usr
# zfs create zroot/usr/home
# zfs create -o setuid=off zroot/usr/ports
# zfs create -o mountpoint=/var -o canmount=off zroot/var
# zfs create -o exec=off -o setuid=off zroot/var/audit
# zfs create -o exec=off -o setuid=off zroot/var/crash
# zfs create -o exec=off -o setuid=off zroot/var/log
# zfs create -o atime=on zroot/var/mail
# zfs create -o setuid=off zroot/var/tmp
# zpool set bootfs=zroot/ROOT/default zroot
# chmod 1777 /tmp/zroot/tmp
# chmod 1777 /tmp/zroot/var/tmp

Step 3: Copy the Data

In the past when I've migrated UFS partitions across a drive, I used 'dump | restore' which worked really well (preserved sparse files, etc.).  For this migration that wasn't an option.  Since I had seperate UFS partitions I had to copy each one over:

# tar -cp --one-file-system -f - -C / . | tar -xSf - -C /tmp/zroot
# tar -cp --one-file-system -f - -C /var . | tar -xSf - -C /tmp/zroot/var
# tar -cp --one-file-system -f - -C /usr . | tar -xSf - -C /tmp/zroot/usr

Since I had been using UFS SU+J, I had copied over the .sujournal files, so I deleted those.

# rm /tmp/zroot/.sujournal /tmp/zroot/var/.sujournal /tmp/zroot/usr/.sujournal

Step 4: Adjust Boot Configuration

I added the following to /etc/rc.conf:

zfs_enable="YES"

and to /boot/loader.conf:

zfs_load="YES"
kern.geom.label.disk_ident.enable=0
kern.geom.label.gptid.enable=0

I also removed all references to the old RAID1 mirror from /etc/fstab.

With all this done I was ready to reboot.

Step 5: Test Boot

My BIOS does not permit selecting a different hard disk at boot, so I had to change the default boot disk in the BIOS settings.  Once this was done the system booted to ZFS just fine.

Step 6: Convert to Mirror

After powering down the box I unplugged the dead drive and booted up.  I verified that the remaining drive's serial number did not match the drive that had reported errors previously.  (I actually got this wrong the first time so had to boot a few times.)  Once this was correct I proceeded to destroy the now-degraded RAID1 in preparation for reusing the disk as a mirror.

# graid delete raid/r0

At this point, the raw disk (/dev/ada0) still had the underlying data (in particular a GPT), so that had to be destroyed as well:

# gpart destroy -F ada0

Now the ada0 disk needed to be partitioned identically to the new disk (now ada1).  I was able to copy the GPT over to save a few steps.

# gpart backup ada1 | gpart restore ada0
# gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 2 ada0
# newfs_msdos -L EFI /dev/ada0p1
# mkdir -p /mnt/efi/boot
# cp /boot/boot1.efi /mnt/efi/boot/BOOTx64.efi
# umount /mnt

Next, I added the two swap partitions to /etc/fstab and ran the /etc/rc.d/swap and /etc/rc.d/dumpon scripts.

Finally, I attached the ZFS partition on ada0 to the pool as a mirror.  NB: I was warned previously to be sure to use 'zpool attach' and not 'zpool add' as the latter would simply concatenate the disks and not provide redundancy.

# zpool attach zroot /dev/ada1p4 /dev/ada0p4
Make sure to wait until resilver is done before rebooting.

If you boot from pool 'zroot', you may need to update
boot code on newly attached disk '/dev/ada0p4'.

Assuming you use GPT partitioning and 'da0' is your new boot disk
you may use the following command:

 gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 da0

The one nit about the otherwise helpful error messages is that they are hardcoded to assume the freebsd-boot partition is at partition 1.  I suspect it is not easy to auto generate the correct command (as otherwise it would already do so), but it may need a language tweak to note that the index may also need updating, not just the disk name.  Also, this doesn't cover the EFI booting case (which admittedly is new in FreeBSD 11).

Anyway, the pool is now happily reslivering:

# zpool status
  pool: zroot
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
 continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Mon Aug  1 08:12:58 2016
        1.63G scanned out of 207G at 98.1M/s, 0h35m to go
        1.63G resilvered, 0.79% done
config:

 NAME        STATE     READ WRITE CKSUM
 zroot       ONLINE       0     0     0
   mirror-0  ONLINE       0     0     0
     ada1p4  ONLINE       0     0     0
     ada0p4  ONLINE       0     0     0  (resilvering)

errors: No known data errors

Testing EFI will have to wait until I upgrade my desktop to 11.  Perhaps next weekend.

Updates

Some feedback from readers:
1) restore doesn't actually assume a UFS destination, so I probably could have used 'dump | restore'.
2) The 'zpool export/import' wasn't actually needed and has been removed (the create is sufficient).

Also, for the curious, the resliver finished in less than an hour:

# zpool status
  pool: zroot
 state: ONLINE
  scan: resilvered 207G in 0h53m with 0 errors on Mon Aug  1 09:06:26 2016