[nycbug-talk] Approach for the NFS cluster

Wed May 20 02:16:45 EDT 2009

On May 19, 2009, at 6:20 PM, Miles Nordin wrote:

>>>>>> "il" == Isaac Levy <isaac at diversaform.com> writes:
>
>    il> I was under the impression that disabling the ZIL was a
>    il> developer debugging thing- it's dangerous, period.
>
> no, 100% incorrect.

OK- I'll concede being incorrect, (and I learned something new here),  
but I'd like to only concede 99% incorrect:

http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#ZIL

Sun's got tons of docs about using non-volatile RAM (solid state  
drive), or a dedicated drive to increase performance of ZFS, but in  
reading this, none of their docs speak clearly about disabling it.

> Disabling the ZIL does not increase likelihood of
> losing the pool at all.  It does break the NFSv3 transaction/commit
> system if the server reboots (meaning: you lose recently written data,
> the client ``acts weird'' until you umount and remont the NFS shares
> involved), and also it breaks fsync() so it's not safe for
> filesystems-on-filesystems (databases, VM guest backing stores).
> however I think the dirty truth is that most VM's suppress sync's to
> win performance and are unsafe to guest filesystems if the host
> reboots, with or without a ZIL.  databases and mail obviously it
> matters.

I can see that disabling the ZIL could yield better performance, but  
even in active filesystems (without filesystems-on-filesystems), I can  
see potential problems in data loss- (loosing the box as file(s) are  
being written, or were recently written).  So to me, disabling the ZIL  
doesn't seem quite rational for most circumstances.

>
>
> The way they explained it on ZFS list, is that the ZIL is always
> present in RAM, even when disabled, and is part of how the POSIX
> abstraction in ZFS is implemented (which is layered on top of the
> object store, a sibling of zvol's).  Normally the ZIL is committed to
> the regular part of the disk, the bulk part, when each TXG cimmits
> every 30 seconds or so.  When you call fsync(), or when NFSv3 commits
> or closes a file, the relevant part of the ZIL in RAM is flushed to
> disk.  It's flushed to a separated special disk area that acts as a
> log, so it is write-only unless there is a crash.  Eventually when the
> next TXG commits, the prior ZIL flush is superceded, and the blocks
> both in RAM and on disk are free for reuse.  Often that special area
> is incorrectly called ``the ZIL'', and writing to it is what you
> disable.  so disabling it doesn't endanger data written more than 30
> seconds before the crash.

Hrm.  I'm chewing on this.  To me, 30 seconds of lost files feels  
funy- I can't think of an acceptable case for this happening- (but I'm  
not saying there is no case where this is acceptable).

>
>
> but it does break fsync() so you shouldn't do it just for fun.  also i
> think it's a global setting not per filesystem, which kind of blows
> chunks.
>
>    il> I really feel that some Ggate-ish thing could be written for
>    il> the Geom subsystem which allowed for multiple writes?  Or
>    il> something which did writes according to some transactional
>    il> model- (locking files, etc...)
>
> There are two kinds of multiple writers.  first kind is SCSI layer.
>
> The better Linux iSCSI targets (like SCST, which I haven't used yet)
> support multiple initiators.  This is a SCSI term.  When you activate
> an SCST target, a blob of SCST springs to life in the kernel
> intercepting ALL scsi commands headed toward the disk.  Applications
> and filesystems running on the same box as SCST represent one
> initiator and get routed through this blob.  The actual iSCSI
> initiators out on the network become the second and further
> initiators.  so, even if you have only one iSCSI initiator hitting
> your target, SCST multiple-initiator features are active.
>
> There are many multi-initiator features in the SCSI standard.  I don't
> completely understand any of them.
>
> One is the reservation protocol, which can be used as a sort of
> heartbeat.  However since a smart and physically-sealed device is
> managing the heartbeat rather than a mess of cabling and switches,
> split-brain is probably less likely when all nodes are checking in
> with one of their disks rather than pinging each other over a network.
> The disk then becomes single point of failure so then you need a
> quorum of 3 disks.
>
> I think maybe the reservation protocol can also block access to the
> disk, but I'm not sure.  That is not its most important feature.
> Sun's cluster stuff has bits in the host driver stack to block access
> to a disk when the cluster node isn't active and isn't suppsoed to
> have access, and I suspect it can work on slice/partition level.
>
> A second kind of multi-initiator feature is to clean up all the
> standards-baggage of extra seldom-used SCSI features.  For example if
> one node turns on the write cache and sets a read-ahead threshhold,
> the change will affect all nodes, but the other nodes won't know about
> it.  SCSI has some feature to broadcast changes to mode pages.
>
> A third multi-initiator feature is to actually support reads and
> writes from multiple hosts.  SCST claims they re-implement TCQ in
> their multi-initiator blob, to support SATA disks which have either no
> exposed queues or queue commandsets which don't work with multiple
> initiators.
>
> but SEPARATE FROM ALL THIS MULTI-INITIATOR STUFF, is the second kind
> of multiple writer.  Two hosts being able to write to the disk won't
> help you with any traditional filesystem, including ZFS.  You can't
> mount a filesystem from two hosts over the same block device.  This is
> certain---it won't work without completely rearchitecting the
> filesystem.  Filesystems aren't designed to accept input from
> underneath them.
>
> It's possible to envision such a filesystem.  like the
> Berkeley/Sleepycat/Oracle BDB library can have multiple processes open
> the same database.  but they cheat!  They have shared memory regions
> so the multiple processes communicate with each other directly, and a
> huge chunk of their complexity is to support this feature (since it's
> often why programmers turn to the library in the first place).
>
> And it's been done with filesystems, too.  RedHat GFS, Oracle OCFS,
> Sun QFS, all work this way, and all of them also cheat: you need a
> ``metadata'' node which has direct and exclusive access to a little
> bit of local disk (the metadata node might be a single-mounter
> active/passive HA cluster like we're talking about before).  The point
> of these filesystems isn't so much availability as switching density.
> It's not that you want an active/active cluster so you can feel
> better---it's that there's so much filesystem traffic it can't be
> funneled through a single CPU.  By having clients open big files, yes
> granted they are all funneled to the single metadata server still, but
> for all the other bulk access to the meat inside the files they can go
> straight to the disks.  It's not so much for clustered servers serving
> non-cluster clients as for when EVERYTHING is part of the cluster.
>
> The incentive to design filesystems this way, having clients use the
> SCSI multi-initiator features directly, is the possibility of
> extremely high-bandwidth high-density high-price FC-SW storage from
> EMC/Hitachi/NetApp.  Keeping both nodes in an HA cluster active is
> only slightly helpful, because you still need to handle your work on 1
> node, and because 2 is only a little bigger than 1.  But if you can
> move more of your work into the cluster, so only the interconnect is
> shared and not the kernel image, it's possible to grow much further,
> with the interconnect joining:
>
>   [black box] storage system
>   2 metadata nodes
>   n work nodes
>
> The current generation of split metadata/data systems (Google FS,
> pNFS, GlusterFS, Lustre) uses filesystems rather than SCSI devices as
> the data backing store, so now the interconnect joins:
>
>  m storage nodes
>  2 metadata nodes
>  n work nodes
>
> and you do not use SCSI-2 style multi-initiator at all, except maybe
> on the 2 metadata nodes.  All (not sure for GoogleFS but all others)
> have separate metadata servers like GFS/OCFS/QFS.  The difference is
> that the data part is also a PeeCee with another filesystem like ext4
> or ZFS between the disks and the clients, instead of disks directly.
> I think this approach has got the future nailed down, especially if
> the industry manages to deliver lossless fabric between PeeCees like
> infiniband or CEE, and especially because everyone seems to say GFS
> and OCFS don't work.
>
> I think QFS does work though.  It's very old.  And Sun has been
> mumbling about ``emancipating'' it.
>
>  http://www.auc.edu.au/myfiles/uploads/Conference/Presentations%202007/Duncan_Ian.pdf
>  http://www.afp548.com/filemgmt/visit.php?lid=64&ei=nyMTSsuQKJiG8gTmq8iBBA
>  http://wikis.sun.com/display/SAMQFS/Home
>
>    il> something which did writes according to some transactional
>    il> model- (locking files, etc...)
>
> I've heard of many ways to take ``crash consistent'' backups at block
> layyer.  Without unounting the filesystem, you can back it up while
> the active node is stiill using it, with no cooperation from the
> filesystem.  There are also storage layers that use this philosophy to
> make live backups, will watch the filesystem do its work, and
> replicate this asynchronously over a slow connection offsite without
> making the local filesystem wait.  (thus, better than gmirror by far)
> They are supposed to work fine if you accumulate hours of backlog
> during the day, then catch up overnight.
>
>  * Linux LVM2
>    multiple levels of snapshot.  not sure they can be writeable  
> though.
>
>    + drbd.org - replicate volumes to a remote site.  not sure how
>      integrated it is with LVM2 though, maybe not at all.
>
>  * ZFS zvol's
>    multiple levels, can be writeable
>
>    + zfs send/recv for replication.  good for replication, bad for  
> stored backups!
>
>  * vendor storage (EMC Hitachi NetApp)
>
>    they can all do it, not sure all the quirks
>
>    some ship kits you can install in windows crap to ``quiesce'' the
>    filesystems or SQL Server stores.  it is suposed to be
>    crash-consistent on its own, always, in case you actually did
>    crash, but i guess NTFS and SQL Server are goofy and don't meet
>    this promise, so there is a whole mess of fud and confusing terms
>    and modules to buy.
>
>  * Sun/Storagetek AVS ii (instant image)
>
>    this isn't a proper snapshot because you only get 2 layer, only
>    ``current'' and ``snap''.  There is a ``bitmap'' volume to mark
>    which blocks are dirty.  no trees of snapshots.
>
>    ii is a key part of AVS block-layer replication.  only with ii is
>    it possible to safely fail back from the secondary to the primary.
>
>    + AVS (availability suite, a.k.a. sun cluster geographic edition,
>      maybe other names) is like drbd
>
>      AVS is old and mature and can do things like put multiple
>      volumes into consistency groups that share a single timeline.
>      If you have a datapbase that uses multiple volumes at once, or
>      if you are mirroring underneath a RAID/raidz layer, then you
>      need to put all the related volumes into a consistency group or
>      else there's no longer such a thing as crash-consistency.  If
>      you think about it this makes sense.

Holy moses thanks for the overview in the thread Miles!  Learned 20  
new things today :)

Rocket-
.ike