[nycbug-talk] Approach for the NFS cluster
Brian Cully
bcully at gmail.com
Wed May 20 17:43:51 EDT 2009
On 19-May-2009, at 23:31, Miles Nordin wrote:
> Second, Oracle RAC is state-of-the-art for expensive databases, and it
> does not scale anything close to linearly. CouchDB, Hadoop, and the
> next round of clustered filesystems, I've the impression, scale much
> closer to linearly and might be far more interesting in the long run.
I was thinking of CouchDB specifically, but the reason it can do what
it does are applicable to the larger context. The reason POSIX will
never be able to support what people want is because what people want
is impossible. The reason CouchDB works is because the application
programmer has to take on the duty of handling things like aborted
transactions. You simply cannot solve the problem in a generic sense
and POSIX has no facilities for communicating the problem or its
possible solutions. You're certainly not going to get a drop-in POSIX
FS replacement with ACID qualities. If you want that stuff, you have
to write the code in your own app, and nowhere anywhere near the
kernel is appropriate.
> Third, you can use BDB as a backing-store (a ``storage brick'') for
> GlusterFS. I think the idea is, for tiny files, but not having
> actually used it I am not sure if it really does make sense for
> certain workloads, or if it's something they implemented as a toy.
> But it is proof-of-concept: POSIX filesystem inside a database. It
> may be a pointless exercise, but I'm not sure it's difficult or
> out-of-reach exercise.
You can put a POSIX FS in a DB pretty trivially, but the reverse is
not true. Using a DB (with ACID qualities) requires entirely new APIs
which are strictly more powerful than what you get out of POSIX.
The fundamental problem here is that in an HA setup you typically
have N nodes talking to the same set of M discs. Suddenly, time is a
strictly localized concept and many of the operations you took for
granted are no longer possible in the general sense. What do you do
when two nodes try to write to the same part of disc at the same time?
Who wins? What if the overwritten data is part of a complex structure
and thus locking any individual set of blocks may still not prevent
corruption? How do you tell the other node to go fuck itself? How can
you prevent that kind of access in light of the many and varied types
of topology splits that have opened before you?
-bjc
More information about the talk
mailing list