[nycbug-talk] Day 2, EuroBSDCon 2007 report

Miles Nordin carton at Ivy.NET
Sun Sep 16 09:00:02 EDT 2007


>>>>> "il" == Isaac Levy <ike at lesmuug.org> writes:

    il> Pawel Jakub Dawidek, "FreeBSD and ZFS".

I have been using it on Solaris for a little over a year, and it
really is ``that good''.  and some of the problems I blogged about
last December have been fixed between nevada b44 and b71.  It's still
not perfect, though, and some of these problems will certainly spill
into FreeBSD's implementation:

 * I'm still having some problems that the machine panics if a disk
   goes away.  panic on strange filesystem stuff (and even in some
   cases I think kernel memory corruption if some on-disk data
   structure is garbage?) was the norm with FFS.  but this norm needs
   to end.

 * I still don't understand the state machine for mirrors---if half a
   mirror goes away, then comes back, when will ZFS notice it's out of
   sync, right away or after scrub? 

    - claim is that it notices right away, and yes, there is a
      mini-resilver that happens after the mirror is rejoined.  But if
      I do 'zpool scrub pool' after the mini-resliver finishes, scrub
      still finds inconsistencies.

    - errors reported by 'zpool status' including mirror inconsistency
      ``please scrub me by hand'' errors tend to vanish after
      rebooting.  It forgets that it noticed the mirror was
      inconsistent.  That doesn't seem okay.

   for things like iSCSI (restarting the daemon) or scratchy firewire
   connections (targets go away and come back, at worst maybe even
   with a different device name), it's important to deal with a mirror
   component that vanishes for, say, 2.5 seconds, then comes back, in
   a solid and graceful way. The real message here, though, is an
   optimistic one: that ZFS has given an architecture and a style that
   makes it possible to ask for something so ridiculous as ``please
   gracefully deal with targets that vanish for 2.5 seconds and
   re-appear on a different device node,'' which would be impossible
   with a regular LVM/geom/RAIDframe type system, or even for a
   hardware system without a gigabyte of NVRAM.

 * Also there is a missing feature which would be very nice: LVM's
   'pvmove' command to migrate data off a vdev onto the other
   (possibly just-added) empty vdevs, so that you can safely remove
   the whole old vdev from the pool.  but yeah ZFS is probably better
   than everything else.

but there is so much obvious and non-obvious stuff that's fantastic
about it.  For example I think the idea of scrubbing is a non-obvious
fantastic thing.  For near-line storage, it's common for disks to go
bad quietly---you don't know they're bad until you try to access the
seldomly-read data, which is terrible because you end up with mirrors
and RAID5's that have multiple bad components.  so in my opinion disks
in a mirror or RAID5 should be tested with 'dd if=/dev/disk
of=/dev/null', or with some kind of SMART testing (offline testing or
background testing?) every couple months, something that reads every
block.  but this practice is not common, and even if you think a drive
might be bad and fsck it, this practice is not even done by modern
fsck invoked in the normal way.  It hasn't been done since the ancient
days of disks that didn't remap bad blocks.  The practice needs to
come back---not necessarily reading of unallocated areas of the
disk, but at least for every block that's holding data there should be
a bimonthly test-read, and 'zpool scrub' does this in an I think O(n)
way, and its use is a common ZFS best-practice.

People always have strange, complicated, long stories about how they
lose their data, but my impression is home users tend to lose
everything about once every one or two years, and experienced Unix
people maybe every five years or so?  I think sometihng like scrubbing
a giant near-line array vs. not doing that can increase its
in-practice lifespan by many years.  

I haven't lost everything yet, but I do have these habitual
mini-disasters that need to stop.  I had an unmirrorred single-disk
ZFS go bad recently---the drive was still working but had read and
write errors.  I go through this marginal-disk problem a lot, and the
answer for me is usually:

  dd if=/dev/broken of=/dev/newdisk bs=512 conv=noerror,sync
  fsck /dev/newdisk

so, with ZFS, this becomes more like:

  dd if=/dev/broken of=/dev/newdisk bs=512 conv=noerror,sync
  zpool import
    [look for the pool's serial number]
  zpool import -f 73710598603223
  zpool scrub pool

There were two regions of read errors on the disk.  When ZFS's scrub
finished, 'zpool status' gave me the pathnames of the files that were
corrupted by dd's replacing with zeroes.  I didn't need the files, so
I deleted them, and now ZFS shows this:

bash-3.00# zpool status -v pool
  pool: pool
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        pool        ONLINE       0     0     0
          c3t1d0s3  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        pool/export:<0x3a10e>
        pool/export:<0x226cb6>
        pool/export:<0x28cf7b>

(it used to show pathnames, I promise.)  With Linux ext3, I discovered
these errors a year later when some .avi wouldn't play.  Having
pathnames is obviously great, because I can go hunting all over the
Internet or my disk clutter for another copy of the corrupt file.  I
can do my hunting before next year when other copies of the file have
become more scarce.  I can feel confident the rest of the disk is in
good shape, not worry I should maybe reinstall operating systems.
This saves me so much time and lets me be lazy.  It's good to have
this accurate and sanely-displayed data about exactly which data was
lost in an fsck of an unclean filesystem, rather than 'inode 98fed94
CLEARED!!!'.  FFS has been keeping my valuable data since 1999, and
now after a year of testing ZFS I think I will move this data onto ZFS
for the next eight years.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL: <http://lists.nycbug.org/pipermail/talk/attachments/20070916/59ee3dc3/attachment.bin>


More information about the talk mailing list