[nycbug-talk] Approach for the NFS cluster

Miles Nordin carton at Ivy.NET
Wed May 20 18:50:43 EDT 2009


>>>>> "bc" == Brian Cully <bcully at gmail.com> writes:

    bc> CouchDB works is because the application programmer has to
    bc> take on the duty of handling things like aborted
    bc> transactions. You simply cannot solve the problem in a generic
    bc> sense and POSIX has no facilities for communicating the
    bc> problem or its possible solutions.

yeah I guess CouchDB is very unlike BDB.  I'm not ready to agree it's
impossible.  but I think in both cases it might be ~always silly.

    bc> 	You can put a POSIX FS in a DB pretty trivially, but
    bc> the reverse is not true. Using a DB (with ACID qualities)
    bc> requires entirely new APIs which are strictly more powerful
    bc> than what you get out of POSIX.

This doesn't make sense.  ~all existing databases run inside POSIX
filesystems.  Are you saying they're all broken?  In what way?

    bc> The fundamental problem here is that in an HA setup
    bc> you typically have N nodes talking to the same set of M
    bc> discs.

for HA NFS you typically have 1 node talking to M disks.  The 2nd node
is passive.  It should be possible to make a logically-equivalent HA
cluster using a robotic arm and a cold spare:

1. primary fails.
2. robot notices somehow.
3. robot unplugs primary's power cord.
4. robot moves all disks from primary chassis to secondary chassis.
5. robot plugs in secondary's power cord.
6. robot goes to sleep.

not much magic.

Most of the magic is in the NFS protocol itself, which allows servers
to reboot, even cord-yank reboot, without clients noticing or losing
any data at all---in fact, if the implementation is really solid,
regular old NFS clients will safely regrab their POSIX advisory locks
across a server reboot.  SMB does not have any of this magic.  Thus
the HA part with NFS is relatively small, just to make the ``reboot''
step faster, and to add management so you've some idea when it's safe
to fail back.  It's bigger than just carp, but it's not big like
Oracle, and not magical.

It is worth doing to create NFS-to-mysteryFileMesh gateways, multiple
gateways onto a single filesystem, so that a pool of clients can mount
a single unified filesystem even if their aggregate bandwidth is too
big to pass through a single CPU.  This is the point of pNFS, of
exporting regular NFS from multiple clustered Lustre clients, and
seems to be the Isilon pitch if I read them right.  but doing this
involves larger and different magic, at least two layers of it (one to
create this new fancy kind of filesystem itself, and then a second
layer to make the multiple NFS servers present themselves to clients
cohesively including moving lock custodianship from one server to
another when a client switches to a different gateway) and you do not
invoke said magic just for the availability improvement that the OP
wanted.  You do it to break through a performance wall.

    bc> What do you do when two nodes try to write to the same part of
    bc> disc at the same time?  Who wins?

the lockholder?

    bc> What if the overwritten data is part of a complex structure
    bc> and thus locking any individual set of blocks may still not
    bc> prevent corruption?

The sentence unravels because ``all the blocks on the disk'' is also a
set of blocks.  You can safely update whatever you like---you just
don't scale close to linearly with any obvious approach.

I don't know what Oracle RAC does.  The filesystems like QFS, GFS,
OCFS attack this to a first order by storing only unstructured data on
the shared SCSI targets.  All the structured data (metadata) goes
through an active/passive HA pair.

The metadata server can also arbitrate locks and cache-leases, or
clients can run a distributed locking protocol amongst themselves.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL: <http://lists.nycbug.org/pipermail/talk/attachments/20090520/0d667fcc/attachment.bin>


More information about the talk mailing list