[nycbug-talk] Approach for the NFS cluster

Tue May 19 15:46:48 EDT 2009

Steve and Miles both have the key points,

On May 19, 2009, at 1:23 PM, Miles Nordin wrote:

>>>>>> "sk" == Steven Kreuzer <skreuzer at exit2shell.com> writes:
>
>    sk> This is an extremely oversimplified explanation of how you
>    sk> could provide HA NFS.
>
> Except that I don't know of any actual
> clustering software built over ggate, and it's not something you roll
> yourself with shell scripts.

Actually, it kindof is something one can roll with shell scripts- I've  
done it, (at home, no real/heavy usage, just messing around).

Thing is, it sucked:

> The volume cannot be mounted on both
> nodes at the same time because obviously the filesystem doesn't
> support that,

This, in the end, is the big problem.  I've shell scripted the hell  
out of this before- it sucked, (and not just my scripts- the act of  
discovering-> mounting-> cleaning-up is riddled with annoyance).

Perhaps Gmirror can be setup to trick the filesystems?

> so, like other HA stuff, there has to be a heartbeat
> network connection or a SCSI reservation scheme or some such magic so
> the inactive node knows it's time to take over the storage,
> fsck/log-roll it, mount it, export the NFS.  It's not like they can
> both be ready all the time, and CARP will decide which one gets the
> work---not posible.

Actually when I asked this very question, Mickey said there were some  
hooks in Carp somewhere to force master to flip, but I haven't  
particularly hacked into it...  This part definately seems to be a  
'read the source' kind of maneuver.

> Also the active node has to notice if, for some
> reason, it has lost control by the rules of the heartbeat/reservation
> scheme even though it doesn't feel crashed, and in that case it should
> crash itself.
>
> There may also be some app-specific magic in NFS.  The feature that
> lets clients go through server reboots without losing any data, even
> on open files, should make it much easier to clusterify than SMB: on
> NFS this case is explicitly supported by, among other things, all the
> write caches in the server filesystem and disks are kept in duplicate
> in the clients so they can be re-rolled if the server crashes.  But
> there may be some tricky corner cases the clustering software needs to
> handle.  For example, on Solaris if using ZFS, you can ``disable the
> ZIL'' to improve NFS performance in the case where you're opening,
> writing, closing files frequently, but the cost of disabling is that
> you lose this stateless-server-reboot feature.

I was under the impression that disabling the ZIL was a developer  
debugging thing- it's dangerous, period.

>
>
>    sk> suggest you look at Isilon, NetApp and Sun,
>
> The solaris clustering stuff may actually be $0.  I'm not sure though,
> never run it.  The clustering stuff is not the same thing as the pNFS
> stuff.  +1 on Steven's point that you can do this with regular NFS on
> the clients---only the servers need to be special.  But they need to
> be pretty special.  The old clusters used a SCSI chain with two host
> adapters, one at each end of the bus, so there's no external
> terminator (just the integrated terminator in the host adapters).
> These days probably you will need a SAS chassis with connections for
> two initiators.  unless the ggate thing works, but there's a need to
> flush write buffers deterministically when told to for the NFS corner
> case, and some clusters use this SCSI-2 reservation command,
> so...shared storage is not so much this abstract modular good-enough
> blob.

Data state.  This is the big deal...

I really feel that some Ggate-ish thing could be written for the Geom  
subsystem which allowed for multiple writes?  Or something which did  
writes according to some transactional model- (locking files, etc...)

Hrm...

Rocket-
.ike