[talk] zfs disk outage and aftermath

Mon Feb 4 16:46:08 EST 2019

Well, yesterday morning I suffered my first drive failure on one of my
ZFS boxes (running FreeBSD 12.0), it actually happened on my primary
backup server.

"zpool status" showed that my regularly scheduled scrub had found (and
fixed) some errors on one of the disks in a mirrored pair.

I made sure that my replica box had up to date snapshots transferred
over, shut down the machine, and asked the datacenter team to check.
They indeed found that the drive was faulty and replaced it.

It took about 4 hours for the drive to be resilvered, and that was it.
Back to normal with almost no issues -- apart from the few minutes that
the machine was down while its drive was being replaced.

My takeaways:

    - use ZFS

    - take regular snapshots

    - replicate your snapshots to another machine

    - scrub your disks regularly (unlike fsck, this can be run while the
      drive is mounted and active)

    - monitor zfs health (I use this script from Calomel.org:
      https://calomel.org/zfs_health_check_script.html)

The first three points are kinda obvious, the last two I picked up from
other, more experienced, ZFS users.

I had been waiting for this day since I first started using ZFS years
ago and am very happy with that decision to use this filesystem.

Thomas