[nycbug-talk] Approach for the NFS cluster
Isaac Levy
isaac at diversaform.com
Tue May 19 15:46:48 EDT 2009
Steve and Miles both have the key points,
On May 19, 2009, at 1:23 PM, Miles Nordin wrote:
>>>>>> "sk" == Steven Kreuzer <skreuzer at exit2shell.com> writes:
>
> sk> This is an extremely oversimplified explanation of how you
> sk> could provide HA NFS.
>
> Except that I don't know of any actual
> clustering software built over ggate, and it's not something you roll
> yourself with shell scripts.
Actually, it kindof is something one can roll with shell scripts- I've
done it, (at home, no real/heavy usage, just messing around).
Thing is, it sucked:
> The volume cannot be mounted on both
> nodes at the same time because obviously the filesystem doesn't
> support that,
This, in the end, is the big problem. I've shell scripted the hell
out of this before- it sucked, (and not just my scripts- the act of
discovering-> mounting-> cleaning-up is riddled with annoyance).
Perhaps Gmirror can be setup to trick the filesystems?
> so, like other HA stuff, there has to be a heartbeat
> network connection or a SCSI reservation scheme or some such magic so
> the inactive node knows it's time to take over the storage,
> fsck/log-roll it, mount it, export the NFS. It's not like they can
> both be ready all the time, and CARP will decide which one gets the
> work---not posible.
Actually when I asked this very question, Mickey said there were some
hooks in Carp somewhere to force master to flip, but I haven't
particularly hacked into it... This part definately seems to be a
'read the source' kind of maneuver.
> Also the active node has to notice if, for some
> reason, it has lost control by the rules of the heartbeat/reservation
> scheme even though it doesn't feel crashed, and in that case it should
> crash itself.
>
> There may also be some app-specific magic in NFS. The feature that
> lets clients go through server reboots without losing any data, even
> on open files, should make it much easier to clusterify than SMB: on
> NFS this case is explicitly supported by, among other things, all the
> write caches in the server filesystem and disks are kept in duplicate
> in the clients so they can be re-rolled if the server crashes. But
> there may be some tricky corner cases the clustering software needs to
> handle. For example, on Solaris if using ZFS, you can ``disable the
> ZIL'' to improve NFS performance in the case where you're opening,
> writing, closing files frequently, but the cost of disabling is that
> you lose this stateless-server-reboot feature.
I was under the impression that disabling the ZIL was a developer
debugging thing- it's dangerous, period.
>
>
> sk> suggest you look at Isilon, NetApp and Sun,
>
> The solaris clustering stuff may actually be $0. I'm not sure though,
> never run it. The clustering stuff is not the same thing as the pNFS
> stuff. +1 on Steven's point that you can do this with regular NFS on
> the clients---only the servers need to be special. But they need to
> be pretty special. The old clusters used a SCSI chain with two host
> adapters, one at each end of the bus, so there's no external
> terminator (just the integrated terminator in the host adapters).
> These days probably you will need a SAS chassis with connections for
> two initiators. unless the ggate thing works, but there's a need to
> flush write buffers deterministically when told to for the NFS corner
> case, and some clusters use this SCSI-2 reservation command,
> so...shared storage is not so much this abstract modular good-enough
> blob.
Data state. This is the big deal...
I really feel that some Ggate-ish thing could be written for the Geom
subsystem which allowed for multiple writes? Or something which did
writes according to some transactional model- (locking files, etc...)
Hrm...
Rocket-
.ike
More information about the talk
mailing list