[nycbug-talk] Approach for the NFS cluster

Tue May 19 17:28:55 EDT 2009

>>>>> "sk" == Steven Kreuzer <skreuzer at exit2shell.com> writes:

    sk> This is an extremely oversimplified explanation of how you
    sk> could provide HA NFS.

yeah, though this is coming from someone who's never done it, that
sounds like a good summary.  Except that I don't know of any actual
clustering software built over ggate, and it's not something you roll
yourself with shell scripts.  The volume cannot be mounted on both
nodes at the same time because obviously the filesystem doesn't
support that, so, like other HA stuff, there has to be a heartbeat
network connection or a SCSI reservation scheme or some such magic so
the inactive node knows it's time to take over the storage,
fsck/log-roll it, mount it, export the NFS.  It's not like they can
both be ready all the time, and CARP will decide which one gets the
work---not posible.  Also the active node has to notice if, for some
reason, it has lost control by the rules of the heartbeat/reservation
scheme even though it doesn't feel crashed, and in that case it should
crash itself.

There may also be some app-specific magic in NFS.  The feature that
lets clients go through server reboots without losing any data, even
on open files, should make it much easier to clusterify than SMB: on
NFS this case is explicitly supported by, among other things, all the
write caches in the server filesystem and disks are kept in duplicate
in the clients so they can be re-rolled if the server crashes.  But
there may be some tricky corner cases the clustering software needs to
handle.  For example, on Solaris if using ZFS, you can ``disable the
ZIL'' to improve NFS performance in the case where you're opening,
writing, closing files frequently, but the cost of disabling is that
you lose this stateless-server-reboot feature.

    sk> suggest you look at Isilon, NetApp and Sun,

The solaris clustering stuff may actually be $0.  I'm not sure though,
never run it.  The clustering stuff is not the same thing as the pNFS
stuff.  +1 on Steven's point that you can do this with regular NFS on
the clients---only the servers need to be special.  But they need to
be pretty special.  The old clusters used a SCSI chain with two host
adapters, one at each end of the bus, so there's no external
terminator (just the integrated terminator in the host adapters).
These days probably you will need a SAS chassis with connections for
two initiators.  unless the ggate thing works, but there's a need to
flush write buffers deterministically when told to for the NFS corner
case, and some clusters use this SCSI-2 reservation command,
so...shared storage is not so much this abstract modular good-enough
blob.

Miles,
pNFS provide protocol support to take advantage of clustered server
deployments including the ability to provide scalable parallel access
to files distributed among multiple servers thus removing singlepoint of failure. Aproach is different but metadata is clustered 

>From CITI - UM :
...enables direct client access to heterogeneous parallel file systems.  Linux pNFS  features a pluggable client architecture that harnesses the potential of pNFS as a  universal and scalable metadata protocol by enabling dynamic support for layout format, storage protocol, and file system policies. 

Experiments with the Linux pNFS architecture demonstrate that using the page cache inflicts an I/O performance penalty and that I/O performance is highly subject  to I/O transfer size.  In addition, Linux pNFS can use bi-directional parallel I/O to  raise data transfer throughput between parallel file systems. 

Ether way I will not be able to experiment with NFS4.1 version and pNFS on the mission critical systems but DBDA on linux seems a right thing at this time.