[nycbug-talk] mapreduce, hadoop, and UNIX

Isaac Levy ike at lesmuug.org
Sat Aug 25 16:42:37 EDT 2007


On Aug 25, 2007, at 12:23 PM, Miles Nordin wrote:

>>>>>> "il" == Isaac Levy <ike at lesmuug.org> writes:
>     il> I mean: applying classical computing paradigms and methodology
>     il> at wild scales.  (E.G.  with the Google Filesystem, it's
>     il> simply a filesystem where the disk blocks exist as network
>     il> resources, etc...
> I think their ``big trick'' is working on jobs until they are finished
> and not teetering piles of dung like OpenMOSIX.  The actual stuff they
> do sounds like it's not complicated or complete enough to become part
> of Unix---it's just some local webapp or cheesy daemon.

Sure, all a small problem scaled up to massive size.

Some of the results are no doubt impressive, but yes- nothing lasting  
enough to become a part of Unix.

> I suspect half of their mapreduce tool is management infrastructure.
> Like, there is definitely a webapp the presenter showed us that draws
> a bar graph showing how far along each node has gotten in the current
> job.  IIRC if a node lags for too long, its job (and bar) gets
> reassigned to another node.  And he said often bars don't rise fast
> enough because, when a node's disk goes bad, first the node becomes
> slow for a day or two before the disk dies completely.
> And this management focus pushes its way down to the API---jobs _must_
> be composed of restartable chunks with no side-effects that can be
> assigned to a node, then scrapped and reassigned.  For example, a
> mapreduce job cannot be ``deliver an email to the 200,000 recipients
> on this mailing list,'' because if a node dies after sending 1000
> messages successfully, mapreduce will restart that job and deliver
> duplicates of those 1000 messages.
> so it is rather an unimpressive tool.  The virtue of it is how well
> they've hammered the quirks out of it, its ability to work on
> unreliable hardware (prerequisite for scaling to thousands of cheap
> nodes), the simplicity of the interface for new developers/sysadmins,
> and support for multiple languages (Python u.s.w.).

Interesting and spot-on take on things, you a lot in perspective  
here, in the context of the generated hype.

>     il> - Sun ZFS?
> there's no distributed aspect to ZFS, only an analagous hype aspect.
> (honestly it's pretty good though.  this tool I actually _am_ using
> for once, not just ranting about.)
>     il> - AFS and the like?
>     il> - RH GFS and the like?
> It's unfair to compare these to the google GFS because according to
> the presentation I saw, google's requires using very large blocks.
> It's a bit of a stretch to call Google's a ``file system''.  It's more
> of a ``megabyte chunk fetcher/combiner/replicator.''  If you wanted to
> store the result of a crawl in one giant file, that's possible, but
> for most other tasks you will have to re-implement a filesystem inside
> your application to store the tiny files you need inside the one giant
> file you are allowed to performantly use.
> I think Sun's QFS may be GFS-ish, but with small blocks allowed, and
> on a SAN instead of on top of a bunch of cheesy daemons listening on
> sockets.  There are variants of this idea from several of the old
> proprietary Unix vendors: one node holds the filesystem's metadata,
> and all the other nodes connect to it over Ethernet.  but metadata
> only.  Data, they hold on a Fibre-channel SAN, and all nodes connect
> to it over FC-SW.  so, it is like NFS, but if you will use giant files
> like GFS does, and open them on only one node at a time, most of the
> traffic passes directly from the client to the disk, without passing
> through any file server's CPU.  They had disgusting license terms
> where you pay per gigabyte and stuff.
> Another thing to point out about RedHat GFS, is that in a Mosix
> cluster a device special file points to a device _on a specific
> cluster node_.  If I open /dev/hda on NFS, I just get whatever is
> /dev/hda on the NFS client.  But on OpenMOSIX/GFS, the device methods
> invoked on the GFS client, the open/read/write/ioctl, get wrapped in
> TCP and sent to whichever node owns that device filename.  That's
> needed for Mosix, but Google doesn't do anything nearly that
> ambitious.

On Aug 25, 2007, at 11:34 AM, Alex Pilosov wrote:
>> Distributed filesystems are hard, compared to writing an OS.
>> -alex


>     il> distributed computing and RPC systems
> i heard erlang is interesting, but haven't tried any of these---just a
> bibliography.  so it might be silly, or broken.
> I think it's kind of two different tasks, if you want to write a
> distributed scientific-computing program from scratch?  or you want to
> manage a scripty-type job built from rather large existing programs
> (which I suspect is what you want).

That's exactly what I want- but right now, as I'm just barely getting  
going with this personal book-scanning project (which I suspect will  
take years to feel 'complete'), I'm willing to explore any path that  
others have had successes with.

Right now, that looks like tying things together with existing  
programs- looks like I have plenty of choices, and room for growth  
with each!

> One last thing.  when my friend tried OpenMOSIX, he was really excited
> for about a month.  Then he slowly realized that the overall system
> was completely unreliable---processes quietly, randomly dying, and
> other such stuff.  That's the worst anti-recommendation I can think
> of.  If he'd said ``I tried it---it didn't work,'' then I might try it
> again.  but, ``I tried it.  It wasted a month or two of my time before
> I found serious show-stopping problems about which the authors and
> evangelists were completely dishonest.''  so personally I want to see
> it in action publicly before I spend any time on it, and in action
> doing some job where you cannot afford to have processes randomly die.

Miles, thanks for this lengthy reply, all these practical experiences  
are worth gold to me right now.


More information about the talk mailing list