[nycbug-talk] mapreduce, hadoop, and UNIX

Sat Aug 25 16:42:37 EDT 2007

Wow,

On Aug 25, 2007, at 12:23 PM, Miles Nordin wrote:

>>>>>> "il" == Isaac Levy <ike at lesmuug.org> writes:
>
>     il> I mean: applying classical computing paradigms and methodology
>     il> at wild scales.  (E.G.  with the Google Filesystem, it's
>     il> simply a filesystem where the disk blocks exist as network
>     il> resources, etc...
>
> I think their ``big trick'' is working on jobs until they are finished
> and not teetering piles of dung like OpenMOSIX.  The actual stuff they
> do sounds like it's not complicated or complete enough to become part
> of Unix---it's just some local webapp or cheesy daemon.

Sure, all a small problem scaled up to massive size.

Some of the results are no doubt impressive, but yes- nothing lasting  
enough to become a part of Unix.

>
> I suspect half of their mapreduce tool is management infrastructure.
> Like, there is definitely a webapp the presenter showed us that draws
> a bar graph showing how far along each node has gotten in the current
> job.  IIRC if a node lags for too long, its job (and bar) gets
> reassigned to another node.  And he said often bars don't rise fast
> enough because, when a node's disk goes bad, first the node becomes
> slow for a day or two before the disk dies completely.
>
> And this management focus pushes its way down to the API---jobs _must_
> be composed of restartable chunks with no side-effects that can be
> assigned to a node, then scrapped and reassigned.  For example, a
> mapreduce job cannot be ``deliver an email to the 200,000 recipients
> on this mailing list,'' because if a node dies after sending 1000
> messages successfully, mapreduce will restart that job and deliver
> duplicates of those 1000 messages.
>
> so it is rather an unimpressive tool.  The virtue of it is how well
> they've hammered the quirks out of it, its ability to work on
> unreliable hardware (prerequisite for scaling to thousands of cheap
> nodes), the simplicity of the interface for new developers/sysadmins,
> and support for multiple languages (Python u.s.w.).

Interesting and spot-on take on things, you a lot in perspective  
here, in the context of the generated hype.

>
>     il> - Sun ZFS?
>
> there's no distributed aspect to ZFS, only an analagous hype aspect.
> (honestly it's pretty good though.  this tool I actually _am_ using
> for once, not just ranting about.)
>
>     il> - AFS and the like?
>     il> - RH GFS and the like?
>
> It's unfair to compare these to the google GFS because according to
> the presentation I saw, google's requires using very large blocks.
> It's a bit of a stretch to call Google's a ``file system''.  It's more
> of a ``megabyte chunk fetcher/combiner/replicator.''  If you wanted to
> store the result of a crawl in one giant file, that's possible, but
> for most other tasks you will have to re-implement a filesystem inside
> your application to store the tiny files you need inside the one giant
> file you are allowed to performantly use.
>
> I think Sun's QFS may be GFS-ish, but with small blocks allowed, and
> on a SAN instead of on top of a bunch of cheesy daemons listening on
> sockets.  There are variants of this idea from several of the old
> proprietary Unix vendors: one node holds the filesystem's metadata,
> and all the other nodes connect to it over Ethernet.  but metadata
> only.  Data, they hold on a Fibre-channel SAN, and all nodes connect
> to it over FC-SW.  so, it is like NFS, but if you will use giant files
> like GFS does, and open them on only one node at a time, most of the
> traffic passes directly from the client to the disk, without passing
> through any file server's CPU.  They had disgusting license terms
> where you pay per gigabyte and stuff.
>
> Another thing to point out about RedHat GFS, is that in a Mosix
> cluster a device special file points to a device _on a specific
> cluster node_.  If I open /dev/hda on NFS, I just get whatever is
> /dev/hda on the NFS client.  But on OpenMOSIX/GFS, the device methods
> invoked on the GFS client, the open/read/write/ioctl, get wrapped in
> TCP and sent to whichever node owns that device filename.  That's
> needed for Mosix, but Google doesn't do anything nearly that
> ambitious.

On Aug 25, 2007, at 11:34 AM, Alex Pilosov wrote:
>> Distributed filesystems are hard, compared to writing an OS.
>>
>> -alex

:)

>
>
>     il> distributed computing and RPC systems
>
> i heard erlang is interesting, but haven't tried any of these---just a
> bibliography.  so it might be silly, or broken.
>
> I think it's kind of two different tasks, if you want to write a
> distributed scientific-computing program from scratch?  or you want to
> manage a scripty-type job built from rather large existing programs
> (which I suspect is what you want).

That's exactly what I want- but right now, as I'm just barely getting  
going with this personal book-scanning project (which I suspect will  
take years to feel 'complete'), I'm willing to explore any path that  
others have had successes with.

Right now, that looks like tying things together with existing  
programs- looks like I have plenty of choices, and room for growth  
with each!

>
> One last thing.  when my friend tried OpenMOSIX, he was really excited
> for about a month.  Then he slowly realized that the overall system
> was completely unreliable---processes quietly, randomly dying, and
> other such stuff.  That's the worst anti-recommendation I can think
> of.  If he'd said ``I tried it---it didn't work,'' then I might try it
> again.  but, ``I tried it.  It wasted a month or two of my time before
> I found serious show-stopping problems about which the authors and
> evangelists were completely dishonest.''  so personally I want to see
> it in action publicly before I spend any time on it, and in action
> doing some job where you cannot afford to have processes randomly die.

Miles, thanks for this lengthy reply, all these practical experiences  
are worth gold to me right now.

Rocket-
.ike