[nycbug-talk] mapreduce, hadoop, and UNIX

Sat Aug 25 12:23:39 EDT 2007

>>>>> "il" == Isaac Levy <ike at lesmuug.org> writes:

    il> I mean: applying classical computing paradigms and methodology
    il> at wild scales.  (E.G.  with the Google Filesystem, it's
    il> simply a filesystem where the disk blocks exist as network
    il> resources, etc...

I think their ``big trick'' is working on jobs until they are finished
and not teetering piles of dung like OpenMOSIX.  The actual stuff they
do sounds like it's not complicated or complete enough to become part
of Unix---it's just some local webapp or cheesy daemon.

I suspect half of their mapreduce tool is management infrastructure.
Like, there is definitely a webapp the presenter showed us that draws
a bar graph showing how far along each node has gotten in the current
job.  IIRC if a node lags for too long, its job (and bar) gets
reassigned to another node.  And he said often bars don't rise fast
enough because, when a node's disk goes bad, first the node becomes
slow for a day or two before the disk dies completely.

And this management focus pushes its way down to the API---jobs _must_
be composed of restartable chunks with no side-effects that can be
assigned to a node, then scrapped and reassigned.  For example, a
mapreduce job cannot be ``deliver an email to the 200,000 recipients
on this mailing list,'' because if a node dies after sending 1000
messages successfully, mapreduce will restart that job and deliver
duplicates of those 1000 messages.

so it is rather an unimpressive tool.  The virtue of it is how well
they've hammered the quirks out of it, its ability to work on
unreliable hardware (prerequisite for scaling to thousands of cheap
nodes), the simplicity of the interface for new developers/sysadmins,
and support for multiple languages (Python u.s.w.).

    il> - Sun ZFS?  

there's no distributed aspect to ZFS, only an analagous hype aspect.
(honestly it's pretty good though.  this tool I actually _am_ using
for once, not just ranting about.)

    il> - AFS and the like?  
    il> - RH GFS and the like?

It's unfair to compare these to the google GFS because according to
the presentation I saw, google's requires using very large blocks.
It's a bit of a stretch to call Google's a ``file system''.  It's more
of a ``megabyte chunk fetcher/combiner/replicator.''  If you wanted to
store the result of a crawl in one giant file, that's possible, but
for most other tasks you will have to re-implement a filesystem inside
your application to store the tiny files you need inside the one giant
file you are allowed to performantly use.

I think Sun's QFS may be GFS-ish, but with small blocks allowed, and
on a SAN instead of on top of a bunch of cheesy daemons listening on
sockets.  There are variants of this idea from several of the old
proprietary Unix vendors: one node holds the filesystem's metadata,
and all the other nodes connect to it over Ethernet.  but metadata
only.  Data, they hold on a Fibre-channel SAN, and all nodes connect
to it over FC-SW.  so, it is like NFS, but if you will use giant files
like GFS does, and open them on only one node at a time, most of the
traffic passes directly from the client to the disk, without passing
through any file server's CPU.  They had disgusting license terms
where you pay per gigabyte and stuff.

Another thing to point out about RedHat GFS, is that in a Mosix
cluster a device special file points to a device _on a specific
cluster node_.  If I open /dev/hda on NFS, I just get whatever is
/dev/hda on the NFS client.  But on OpenMOSIX/GFS, the device methods
invoked on the GFS client, the open/read/write/ioctl, get wrapped in
TCP and sent to whichever node owns that device filename.  That's
needed for Mosix, but Google doesn't do anything nearly that
ambitious.

    il> distributed computing and RPC systems

i heard erlang is interesting, but haven't tried any of these---just a
bibliography.  so it might be silly, or broken.

I think it's kind of two different tasks, if you want to write a
distributed scientific-computing program from scratch?  or you want to
manage a scripty-type job built from rather large existing programs
(which I suspect is what you want).

One last thing.  when my friend tried OpenMOSIX, he was really excited
for about a month.  Then he slowly realized that the overall system
was completely unreliable---processes quietly, randomly dying, and
other such stuff.  That's the worst anti-recommendation I can think
of.  If he'd said ``I tried it---it didn't work,'' then I might try it
again.  but, ``I tried it.  It wasted a month or two of my time before
I found serious show-stopping problems about which the authors and
evangelists were completely dishonest.''  so personally I want to see
it in action publicly before I spend any time on it, and in action
doing some job where you cannot afford to have processes randomly die.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL: <https://lists.nycbug.org:8443/pipermail/talk/attachments/20070825/d6048f5b/attachment.bin>