[nycbug-talk] Approach for the NFS cluster

Miles Nordin carton at Ivy.NET
Tue May 19 18:20:07 EDT 2009


>>>>> "il" == Isaac Levy <isaac at diversaform.com> writes:

    il> I was under the impression that disabling the ZIL was a
    il> developer debugging thing- it's dangerous, period.

no, 100% incorrect.  Disabling the ZIL does not increase likelihood of
losing the pool at all.  It does break the NFSv3 transaction/commit
system if the server reboots (meaning: you lose recently written data,
the client ``acts weird'' until you umount and remont the NFS shares
involved), and also it breaks fsync() so it's not safe for
filesystems-on-filesystems (databases, VM guest backing stores).
however I think the dirty truth is that most VM's suppress sync's to
win performance and are unsafe to guest filesystems if the host
reboots, with or without a ZIL.  databases and mail obviously it
matters.

The way they explained it on ZFS list, is that the ZIL is always
present in RAM, even when disabled, and is part of how the POSIX
abstraction in ZFS is implemented (which is layered on top of the
object store, a sibling of zvol's).  Normally the ZIL is committed to
the regular part of the disk, the bulk part, when each TXG cimmits
every 30 seconds or so.  When you call fsync(), or when NFSv3 commits
or closes a file, the relevant part of the ZIL in RAM is flushed to
disk.  It's flushed to a separated special disk area that acts as a
log, so it is write-only unless there is a crash.  Eventually when the
next TXG commits, the prior ZIL flush is superceded, and the blocks
both in RAM and on disk are free for reuse.  Often that special area
is incorrectly called ``the ZIL'', and writing to it is what you
disable.  so disabling it doesn't endanger data written more than 30
seconds before the crash.

but it does break fsync() so you shouldn't do it just for fun.  also i
think it's a global setting not per filesystem, which kind of blows
chunks.

    il> I really feel that some Ggate-ish thing could be written for
    il> the Geom subsystem which allowed for multiple writes?  Or
    il> something which did writes according to some transactional
    il> model- (locking files, etc...)

There are two kinds of multiple writers.  first kind is SCSI layer.

The better Linux iSCSI targets (like SCST, which I haven't used yet)
support multiple initiators.  This is a SCSI term.  When you activate
an SCST target, a blob of SCST springs to life in the kernel
intercepting ALL scsi commands headed toward the disk.  Applications
and filesystems running on the same box as SCST represent one
initiator and get routed through this blob.  The actual iSCSI
initiators out on the network become the second and further
initiators.  so, even if you have only one iSCSI initiator hitting
your target, SCST multiple-initiator features are active.

There are many multi-initiator features in the SCSI standard.  I don't
completely understand any of them.

One is the reservation protocol, which can be used as a sort of
heartbeat.  However since a smart and physically-sealed device is
managing the heartbeat rather than a mess of cabling and switches,
split-brain is probably less likely when all nodes are checking in
with one of their disks rather than pinging each other over a network.
The disk then becomes single point of failure so then you need a
quorum of 3 disks.

I think maybe the reservation protocol can also block access to the
disk, but I'm not sure.  That is not its most important feature.
Sun's cluster stuff has bits in the host driver stack to block access
to a disk when the cluster node isn't active and isn't suppsoed to
have access, and I suspect it can work on slice/partition level.

A second kind of multi-initiator feature is to clean up all the
standards-baggage of extra seldom-used SCSI features.  For example if
one node turns on the write cache and sets a read-ahead threshhold,
the change will affect all nodes, but the other nodes won't know about
it.  SCSI has some feature to broadcast changes to mode pages.

A third multi-initiator feature is to actually support reads and
writes from multiple hosts.  SCST claims they re-implement TCQ in
their multi-initiator blob, to support SATA disks which have either no
exposed queues or queue commandsets which don't work with multiple
initiators.

but SEPARATE FROM ALL THIS MULTI-INITIATOR STUFF, is the second kind
of multiple writer.  Two hosts being able to write to the disk won't
help you with any traditional filesystem, including ZFS.  You can't
mount a filesystem from two hosts over the same block device.  This is
certain---it won't work without completely rearchitecting the
filesystem.  Filesystems aren't designed to accept input from
underneath them.

It's possible to envision such a filesystem.  like the
Berkeley/Sleepycat/Oracle BDB library can have multiple processes open
the same database.  but they cheat!  They have shared memory regions
so the multiple processes communicate with each other directly, and a
huge chunk of their complexity is to support this feature (since it's
often why programmers turn to the library in the first place).

And it's been done with filesystems, too.  RedHat GFS, Oracle OCFS,
Sun QFS, all work this way, and all of them also cheat: you need a
``metadata'' node which has direct and exclusive access to a little
bit of local disk (the metadata node might be a single-mounter
active/passive HA cluster like we're talking about before).  The point
of these filesystems isn't so much availability as switching density.
It's not that you want an active/active cluster so you can feel
better---it's that there's so much filesystem traffic it can't be
funneled through a single CPU.  By having clients open big files, yes
granted they are all funneled to the single metadata server still, but
for all the other bulk access to the meat inside the files they can go
straight to the disks.  It's not so much for clustered servers serving
non-cluster clients as for when EVERYTHING is part of the cluster.

The incentive to design filesystems this way, having clients use the
SCSI multi-initiator features directly, is the possibility of
extremely high-bandwidth high-density high-price FC-SW storage from
EMC/Hitachi/NetApp.  Keeping both nodes in an HA cluster active is
only slightly helpful, because you still need to handle your work on 1
node, and because 2 is only a little bigger than 1.  But if you can
move more of your work into the cluster, so only the interconnect is
shared and not the kernel image, it's possible to grow much further,
with the interconnect joining:

   [black box] storage system
   2 metadata nodes
   n work nodes

The current generation of split metadata/data systems (Google FS,
pNFS, GlusterFS, Lustre) uses filesystems rather than SCSI devices as
the data backing store, so now the interconnect joins:

  m storage nodes 
  2 metadata nodes
  n work nodes

and you do not use SCSI-2 style multi-initiator at all, except maybe
on the 2 metadata nodes.  All (not sure for GoogleFS but all others)
have separate metadata servers like GFS/OCFS/QFS.  The difference is
that the data part is also a PeeCee with another filesystem like ext4
or ZFS between the disks and the clients, instead of disks directly.
I think this approach has got the future nailed down, especially if
the industry manages to deliver lossless fabric between PeeCees like
infiniband or CEE, and especially because everyone seems to say GFS
and OCFS don't work.

I think QFS does work though.  It's very old.  And Sun has been
mumbling about ``emancipating'' it.

  http://www.auc.edu.au/myfiles/uploads/Conference/Presentations%202007/Duncan_Ian.pdf
  http://www.afp548.com/filemgmt/visit.php?lid=64&ei=nyMTSsuQKJiG8gTmq8iBBA
  http://wikis.sun.com/display/SAMQFS/Home

    il> something which did writes according to some transactional
    il> model- (locking files, etc...)

I've heard of many ways to take ``crash consistent'' backups at block
layyer.  Without unounting the filesystem, you can back it up while
the active node is stiill using it, with no cooperation from the
filesystem.  There are also storage layers that use this philosophy to
make live backups, will watch the filesystem do its work, and
replicate this asynchronously over a slow connection offsite without
making the local filesystem wait.  (thus, better than gmirror by far)
They are supposed to work fine if you accumulate hours of backlog
during the day, then catch up overnight.

  * Linux LVM2
    multiple levels of snapshot.  not sure they can be writeable though.

    + drbd.org - replicate volumes to a remote site.  not sure how
      integrated it is with LVM2 though, maybe not at all.

  * ZFS zvol's
    multiple levels, can be writeable

    + zfs send/recv for replication.  good for replication, bad for stored backups!

  * vendor storage (EMC Hitachi NetApp)

    they can all do it, not sure all the quirks

    some ship kits you can install in windows crap to ``quiesce'' the
    filesystems or SQL Server stores.  it is suposed to be
    crash-consistent on its own, always, in case you actually did
    crash, but i guess NTFS and SQL Server are goofy and don't meet
    this promise, so there is a whole mess of fud and confusing terms
    and modules to buy.

  * Sun/Storagetek AVS ii (instant image)

    this isn't a proper snapshot because you only get 2 layer, only
    ``current'' and ``snap''.  There is a ``bitmap'' volume to mark
    which blocks are dirty.  no trees of snapshots.

    ii is a key part of AVS block-layer replication.  only with ii is
    it possible to safely fail back from the secondary to the primary.

    + AVS (availability suite, a.k.a. sun cluster geographic edition,
      maybe other names) is like drbd

      AVS is old and mature and can do things like put multiple
      volumes into consistency groups that share a single timeline.
      If you have a datapbase that uses multiple volumes at once, or
      if you are mirroring underneath a RAID/raidz layer, then you
      need to put all the related volumes into a consistency group or
      else there's no longer such a thing as crash-consistency.  If
      you think about it this makes sense.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL: <http://lists.nycbug.org/pipermail/talk/attachments/20090519/df92c6ea/attachment.bin>


More information about the talk mailing list