[nycbug-talk] Approach for the NFS cluster
Isaac Levy
isaac at diversaform.com
Wed May 20 02:16:45 EDT 2009
On May 19, 2009, at 6:20 PM, Miles Nordin wrote:
>>>>>> "il" == Isaac Levy <isaac at diversaform.com> writes:
>
> il> I was under the impression that disabling the ZIL was a
> il> developer debugging thing- it's dangerous, period.
>
> no, 100% incorrect.
OK- I'll concede being incorrect, (and I learned something new here),
but I'd like to only concede 99% incorrect:
http://www.solarisinternals.com/wiki/index.php/ZFS_Evil_Tuning_Guide#ZIL
Sun's got tons of docs about using non-volatile RAM (solid state
drive), or a dedicated drive to increase performance of ZFS, but in
reading this, none of their docs speak clearly about disabling it.
> Disabling the ZIL does not increase likelihood of
> losing the pool at all. It does break the NFSv3 transaction/commit
> system if the server reboots (meaning: you lose recently written data,
> the client ``acts weird'' until you umount and remont the NFS shares
> involved), and also it breaks fsync() so it's not safe for
> filesystems-on-filesystems (databases, VM guest backing stores).
> however I think the dirty truth is that most VM's suppress sync's to
> win performance and are unsafe to guest filesystems if the host
> reboots, with or without a ZIL. databases and mail obviously it
> matters.
I can see that disabling the ZIL could yield better performance, but
even in active filesystems (without filesystems-on-filesystems), I can
see potential problems in data loss- (loosing the box as file(s) are
being written, or were recently written). So to me, disabling the ZIL
doesn't seem quite rational for most circumstances.
>
>
> The way they explained it on ZFS list, is that the ZIL is always
> present in RAM, even when disabled, and is part of how the POSIX
> abstraction in ZFS is implemented (which is layered on top of the
> object store, a sibling of zvol's). Normally the ZIL is committed to
> the regular part of the disk, the bulk part, when each TXG cimmits
> every 30 seconds or so. When you call fsync(), or when NFSv3 commits
> or closes a file, the relevant part of the ZIL in RAM is flushed to
> disk. It's flushed to a separated special disk area that acts as a
> log, so it is write-only unless there is a crash. Eventually when the
> next TXG commits, the prior ZIL flush is superceded, and the blocks
> both in RAM and on disk are free for reuse. Often that special area
> is incorrectly called ``the ZIL'', and writing to it is what you
> disable. so disabling it doesn't endanger data written more than 30
> seconds before the crash.
Hrm. I'm chewing on this. To me, 30 seconds of lost files feels
funy- I can't think of an acceptable case for this happening- (but I'm
not saying there is no case where this is acceptable).
>
>
> but it does break fsync() so you shouldn't do it just for fun. also i
> think it's a global setting not per filesystem, which kind of blows
> chunks.
>
> il> I really feel that some Ggate-ish thing could be written for
> il> the Geom subsystem which allowed for multiple writes? Or
> il> something which did writes according to some transactional
> il> model- (locking files, etc...)
>
> There are two kinds of multiple writers. first kind is SCSI layer.
>
> The better Linux iSCSI targets (like SCST, which I haven't used yet)
> support multiple initiators. This is a SCSI term. When you activate
> an SCST target, a blob of SCST springs to life in the kernel
> intercepting ALL scsi commands headed toward the disk. Applications
> and filesystems running on the same box as SCST represent one
> initiator and get routed through this blob. The actual iSCSI
> initiators out on the network become the second and further
> initiators. so, even if you have only one iSCSI initiator hitting
> your target, SCST multiple-initiator features are active.
>
> There are many multi-initiator features in the SCSI standard. I don't
> completely understand any of them.
>
> One is the reservation protocol, which can be used as a sort of
> heartbeat. However since a smart and physically-sealed device is
> managing the heartbeat rather than a mess of cabling and switches,
> split-brain is probably less likely when all nodes are checking in
> with one of their disks rather than pinging each other over a network.
> The disk then becomes single point of failure so then you need a
> quorum of 3 disks.
>
> I think maybe the reservation protocol can also block access to the
> disk, but I'm not sure. That is not its most important feature.
> Sun's cluster stuff has bits in the host driver stack to block access
> to a disk when the cluster node isn't active and isn't suppsoed to
> have access, and I suspect it can work on slice/partition level.
>
> A second kind of multi-initiator feature is to clean up all the
> standards-baggage of extra seldom-used SCSI features. For example if
> one node turns on the write cache and sets a read-ahead threshhold,
> the change will affect all nodes, but the other nodes won't know about
> it. SCSI has some feature to broadcast changes to mode pages.
>
> A third multi-initiator feature is to actually support reads and
> writes from multiple hosts. SCST claims they re-implement TCQ in
> their multi-initiator blob, to support SATA disks which have either no
> exposed queues or queue commandsets which don't work with multiple
> initiators.
>
> but SEPARATE FROM ALL THIS MULTI-INITIATOR STUFF, is the second kind
> of multiple writer. Two hosts being able to write to the disk won't
> help you with any traditional filesystem, including ZFS. You can't
> mount a filesystem from two hosts over the same block device. This is
> certain---it won't work without completely rearchitecting the
> filesystem. Filesystems aren't designed to accept input from
> underneath them.
>
> It's possible to envision such a filesystem. like the
> Berkeley/Sleepycat/Oracle BDB library can have multiple processes open
> the same database. but they cheat! They have shared memory regions
> so the multiple processes communicate with each other directly, and a
> huge chunk of their complexity is to support this feature (since it's
> often why programmers turn to the library in the first place).
>
> And it's been done with filesystems, too. RedHat GFS, Oracle OCFS,
> Sun QFS, all work this way, and all of them also cheat: you need a
> ``metadata'' node which has direct and exclusive access to a little
> bit of local disk (the metadata node might be a single-mounter
> active/passive HA cluster like we're talking about before). The point
> of these filesystems isn't so much availability as switching density.
> It's not that you want an active/active cluster so you can feel
> better---it's that there's so much filesystem traffic it can't be
> funneled through a single CPU. By having clients open big files, yes
> granted they are all funneled to the single metadata server still, but
> for all the other bulk access to the meat inside the files they can go
> straight to the disks. It's not so much for clustered servers serving
> non-cluster clients as for when EVERYTHING is part of the cluster.
>
> The incentive to design filesystems this way, having clients use the
> SCSI multi-initiator features directly, is the possibility of
> extremely high-bandwidth high-density high-price FC-SW storage from
> EMC/Hitachi/NetApp. Keeping both nodes in an HA cluster active is
> only slightly helpful, because you still need to handle your work on 1
> node, and because 2 is only a little bigger than 1. But if you can
> move more of your work into the cluster, so only the interconnect is
> shared and not the kernel image, it's possible to grow much further,
> with the interconnect joining:
>
> [black box] storage system
> 2 metadata nodes
> n work nodes
>
> The current generation of split metadata/data systems (Google FS,
> pNFS, GlusterFS, Lustre) uses filesystems rather than SCSI devices as
> the data backing store, so now the interconnect joins:
>
> m storage nodes
> 2 metadata nodes
> n work nodes
>
> and you do not use SCSI-2 style multi-initiator at all, except maybe
> on the 2 metadata nodes. All (not sure for GoogleFS but all others)
> have separate metadata servers like GFS/OCFS/QFS. The difference is
> that the data part is also a PeeCee with another filesystem like ext4
> or ZFS between the disks and the clients, instead of disks directly.
> I think this approach has got the future nailed down, especially if
> the industry manages to deliver lossless fabric between PeeCees like
> infiniband or CEE, and especially because everyone seems to say GFS
> and OCFS don't work.
>
> I think QFS does work though. It's very old. And Sun has been
> mumbling about ``emancipating'' it.
>
> http://www.auc.edu.au/myfiles/uploads/Conference/Presentations%202007/Duncan_Ian.pdf
> http://www.afp548.com/filemgmt/visit.php?lid=64&ei=nyMTSsuQKJiG8gTmq8iBBA
> http://wikis.sun.com/display/SAMQFS/Home
>
> il> something which did writes according to some transactional
> il> model- (locking files, etc...)
>
> I've heard of many ways to take ``crash consistent'' backups at block
> layyer. Without unounting the filesystem, you can back it up while
> the active node is stiill using it, with no cooperation from the
> filesystem. There are also storage layers that use this philosophy to
> make live backups, will watch the filesystem do its work, and
> replicate this asynchronously over a slow connection offsite without
> making the local filesystem wait. (thus, better than gmirror by far)
> They are supposed to work fine if you accumulate hours of backlog
> during the day, then catch up overnight.
>
> * Linux LVM2
> multiple levels of snapshot. not sure they can be writeable
> though.
>
> + drbd.org - replicate volumes to a remote site. not sure how
> integrated it is with LVM2 though, maybe not at all.
>
> * ZFS zvol's
> multiple levels, can be writeable
>
> + zfs send/recv for replication. good for replication, bad for
> stored backups!
>
> * vendor storage (EMC Hitachi NetApp)
>
> they can all do it, not sure all the quirks
>
> some ship kits you can install in windows crap to ``quiesce'' the
> filesystems or SQL Server stores. it is suposed to be
> crash-consistent on its own, always, in case you actually did
> crash, but i guess NTFS and SQL Server are goofy and don't meet
> this promise, so there is a whole mess of fud and confusing terms
> and modules to buy.
>
> * Sun/Storagetek AVS ii (instant image)
>
> this isn't a proper snapshot because you only get 2 layer, only
> ``current'' and ``snap''. There is a ``bitmap'' volume to mark
> which blocks are dirty. no trees of snapshots.
>
> ii is a key part of AVS block-layer replication. only with ii is
> it possible to safely fail back from the secondary to the primary.
>
> + AVS (availability suite, a.k.a. sun cluster geographic edition,
> maybe other names) is like drbd
>
> AVS is old and mature and can do things like put multiple
> volumes into consistency groups that share a single timeline.
> If you have a datapbase that uses multiple volumes at once, or
> if you are mirroring underneath a RAID/raidz layer, then you
> need to put all the related volumes into a consistency group or
> else there's no longer such a thing as crash-consistency. If
> you think about it this makes sense.
Holy moses thanks for the overview in the thread Miles! Learned 20
new things today :)
Rocket-
.ike
More information about the talk
mailing list