[nycbug-talk] Approach for the NFS cluster

Wed May 20 17:43:51 EDT 2009

On 19-May-2009, at 23:31, Miles Nordin wrote:
> Second, Oracle RAC is state-of-the-art for expensive databases, and it
> does not scale anything close to linearly.  CouchDB, Hadoop, and the
> next round of clustered filesystems, I've the impression, scale much
> closer to linearly and might be far more interesting in the long run.

	I was thinking of CouchDB specifically, but the reason it can do what  
it does are applicable to the larger context. The reason POSIX will  
never be able to support what people want is because what people want  
is impossible. The reason CouchDB works is because the application  
programmer has to take on the duty of handling things like aborted  
transactions. You simply cannot solve the problem in a generic sense  
and POSIX has no facilities for communicating the problem or its  
possible solutions. You're certainly not going to get a drop-in POSIX  
FS replacement with ACID qualities. If you want that stuff, you have  
to write the code in your own app, and nowhere anywhere near the  
kernel is appropriate.

> Third, you can use BDB as a backing-store (a ``storage brick'') for
> GlusterFS.  I think the idea is, for tiny files, but not having
> actually used it I am not sure if it really does make sense for
> certain workloads, or if it's something they implemented as a toy.
> But it is proof-of-concept: POSIX filesystem inside a database.  It
> may be a pointless exercise, but I'm not sure it's difficult or
> out-of-reach exercise.

	You can put a POSIX FS in a DB pretty trivially, but the reverse is  
not true. Using a DB (with ACID qualities) requires entirely new APIs  
which are strictly more powerful than what you get out of POSIX.

	The fundamental problem here is that in an HA setup you typically  
have N nodes talking to the same set of M discs. Suddenly, time is a  
strictly localized concept and many of the operations you took for  
granted are no longer possible in the general sense. What do you do  
when two nodes try to write to the same part of disc at the same time?  
Who wins? What if the overwritten data is part of a complex structure  
and thus locking any individual set of blocks may still not prevent  
corruption? How do you tell the other node to go fuck itself? How can  
you prevent that kind of access in light of the many and varied types  
of topology splits that have opened before you?

-bjc