[nycbug-talk] BSD Cluster Filesystem Roundup

Miles Nordin carton at Ivy.NET
Mon Feb 23 09:48:35 EST 2009


>>>>> "sk" == Steven Kreuzer <skreuzer at exit2shell.com> writes:

    sk> I am curious if anyone has played around with HAMMER

+1, also curious.

The way Dillon talked about HAMMER before its release made me really
suspicious.  lots of kool-aid bragging about buzzwords and the sizes
of various bitfields.

The max size of a filesystem is limited by things like:

 * can you replace and resilver several disks at the same time?  if
   you have enough disks, you'll constantly be replacing one, or two.

 * can you scrub/check/backup the filesystem in multiple threads?
   These operations are O(size) so you need to be able to scale their
   speed as size increases.

 * mount after unclean shutdown is O(what) in time and RAM
   consumption?  Otherwise it will never finish.  FreeBSD's O(n^2)
   ``background fsck'' is just a home user gimmick when it comes to
   scaling to very large sizes.

 * serious MTDL projections.  for many plausible configurations, RAID5
   is worse than a single drive, even on paper, and you can easily
   wind up in a situation where you've got better than even odds of
   losing the filesystem within a year.  After six months drives are
   breaking all over the place and you're scribbling away in your
   notepad on some Failure Analysis trying to explain why it wasn't
   the software's ``fault'' because it was behaving the way you
   thought it would.  Each drive failed and returned latent sector
   errors more or less as the google/netapp papers predicted, and you
   excuse the filesystem layout by calling it ``our extremely bad
   luck.''

 * can you network-mount the filesystem in such a way that all the
   data does not have to pass through a single node?

When I saw him rant on for three pages on how he divided some stupid
bitfield, then mention that it's all ``distributed'' like some
new-fangled checked-feature-box buzzword, without explaining that this
is a necessary feature for very large filesystems because the only
thing that can scale read/write bandwidth high enough to linearly
match the increasing size, is the backplane of a netork switch, I
thought this guy is another crusty BSD dinosaur who still does not get
it.  Given that ZFS understands some of these things but doesn't
really get it either, i've not much hope.

    sk> It makes use of FUSE which seems a bit suspect,

yeah.  they have a lot of stuff on their Testimonials page, though.

Though, will any of that RDMA stuff actually work on BSD?

...then if you will ditch BSD there is also GFS, OCFS, Lustre, ...
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL: <http://lists.nycbug.org/pipermail/talk/attachments/20090223/fe3e8941/attachment.bin>


More information about the talk mailing list