[nycbug-talk] ZFS and firewire - conditions for a perfect storm
Isaac Levy
ike at lesmuug.org
Sun Jun 29 19:11:36 EDT 2008
Hi All,
This story could have been avoided, but... I had a fairly 'perfect
storm' of data loss at home, thought I'd share with the list.
I figure perhaps someone here could suggest a plan of action to fix
this long-term?
If you don't care about firewire, feel free to skip this message- I've
replicated *everything below* with a SATA controller and ZFS/RAID-Z
works as expected, flawlessly. (ZFS is really astounding on FreeBSD!)
--
I've been a big fan (and little user) of ZFS for about a year now,
it's of course excellent. With that, I built a *very* cheap 2nd file
server from gear I had,
- A mini-PC (no PCI slots, almost laptop specs)
- 4 Firewire drives (long-time mac user, have the cases)
- daisy-chained setup, no Firewire hub
- ZFS RAID-Z (FreeBSD 7, was running HEAD now REL)
This rig worked so well, that it quickly became my primary SMB
workhorse. A mac with an apple software raid became the backup
system, (rsync over SSH rocks for me). The firewire drives cases are
metal, and act as a heat-sinc, so no noisy fans for home use :)
The firewire drives seem to have issues hanging up ZFS if one fails,
but I figured *what the heck, it's got a backup on an alltogether
different filesystem*, and at least I don't ever have to fsck 4tb
volumes! Whee! This has been a very productive tag-team for my uses-
until now...
--
The perfect storm:
I needed to pull 2 drives from the mac, therefore I shut down SMB on
the ZFS server. I rebuilt the RAID on the mac, and brought it back
online.
- I went to immediately begin copying files back from the ZFS machine,
- A physical drive enclosure power board shorted out (perhaps the
fatal moment?)
- The FreeBSD system stayed up, all I/O to the ZFS volumes would just
hang-
- df reported the volumes still mounted
- saw the firewire drive dissappear in /var/log/messages, but,
- 'zpool status' reported that ALL WAS FINE
I freaked out, and did nothing to the system but watch it (and chain
smoke) for 2 hours- hoping perhaps that something would move. All
disk I/O to the ZFS volume stayed in a hung state.
Finally, I rebooted the system- and it hung during the shutdown
sequence. I sat with it for another 30 minutes, hoping something
would change, to no avail.
I finally power-cycled the machine, (perhaps the fatal moment?)
Before it came back up, I pulled the dead drive out of the pool,
expecting the ZFS volume to come back online in a RAID-Z degraded state-
*I found the entire ZFS volume was hosed (insert wailing and gnashing
of teeth here):
[root at blackowl /usr/home/ike]# zpool status
pool: Z
state: FAULTED
status: One or more devices could not be used because the label is
missing
or invalid. There are insufficient replicas for the pool to continue
functioning.
action: Destroy and re-create the pool from a backup source.
see: http://www.sun.com/msg/ZFS-8000-5E
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
Z FAULTED 0 0 0 corrupted data
da0 FAULTED 0 0 0 corrupted data
da1 FAULTED 0 0 0 corrupted data
da2 FAULTED 0 0 0 corrupted data
da3 ONLINE 0 0 0
[root at blackowl /usr/home/ike]#
Anyhow, I have a few theories about what happened, and some subsequent
tests.
1) The firewire bus could possibly be loosing track of which device is
which- and confusing ZFS. In my daisy-chain setup, when one drive in
the chain dies, (say, da2), and it's removed from the chain, it seems
to become the previous drive (e.g. da1).
However, this may not exactly be the case- because when I remove the
last drive from the chain, I still destroy the RAID-Z pool, and it
still hangs as expected above.
ZFS definately gets confused when I take an existing daisy-chain ZFS
pool, and reboot the machine with all drives plugged into a firewire
hub:
[root at blackowl /usr/home/ike]# zpool status
pool: Z
state: UNAVAIL
status: One or more devices could not be used because the label is
missing
or invalid. There are insufficient replicas for the pool to continue
functioning.
action: Destroy and re-create the pool from a backup source.
see: http://www.sun.com/msg/ZFS-8000-5E
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
Z UNAVAIL 0 0 0 insufficient replicas
raidz1 UNAVAIL 0 0 0 insufficient replicas
da0 FAULTED 0 0 0 corrupted data
da1 FAULTED 0 0 0 corrupted data
da2 FAULTED 0 0 0 corrupted data
da3 FAULTED 0 0 0 corrupted data
[root at blackowl /usr/home/ike]#
As an aside, for those not familiar with using Firewire and Apple/OSX
stuff: this is not the case on a Mac with HFS+ volumes, they always
retain their identity regardless of how/where they are plugged in...
Stupid simple.
2) The firewire driver may not send the right signal in/to the kernel
when a drive fails or is removed, or perhaps ZFS may not pick it up
correctly. This scenario would explain the hanging and subsequent
data corruption, when the last drive is simply unplugged from the chain.
(My expected results would be simply an 'UNAVAIL' ZFS device, and
proper RAID-Z degraded functionality).
--
Resolution/Conclusions:
1) Don't bank on Firewire/ZFS using FreeBSD. Apple hasn't even sorted
the issues out for OSX... :) (Anyone know about OpenSolaris/Firewire/
ZFS? How's that for esoteric :)
ZFS ROCKS HARD in my experience using SATA and well-worn (and even
cheap) SATA controllers, so for the future, I'll be sticking to the
MASSIVE amount of that hardware out there.
2) Should I email a FreeBSD dev list with this, and if so, which one?
Firewire? ZFS?
3) Should I email a Sun/ZFS list with this, and if so, does anyone
know that scene?
Rocket-
.ike
More information about the talk
mailing list