[nycbug-talk] mapreduce, hadoop, and UNIX
Isaac Levy
ike at lesmuug.org
Sat Aug 25 16:42:37 EDT 2007
Wow,
On Aug 25, 2007, at 12:23 PM, Miles Nordin wrote:
>>>>>> "il" == Isaac Levy <ike at lesmuug.org> writes:
>
> il> I mean: applying classical computing paradigms and methodology
> il> at wild scales. (E.G. with the Google Filesystem, it's
> il> simply a filesystem where the disk blocks exist as network
> il> resources, etc...
>
> I think their ``big trick'' is working on jobs until they are finished
> and not teetering piles of dung like OpenMOSIX. The actual stuff they
> do sounds like it's not complicated or complete enough to become part
> of Unix---it's just some local webapp or cheesy daemon.
Sure, all a small problem scaled up to massive size.
Some of the results are no doubt impressive, but yes- nothing lasting
enough to become a part of Unix.
>
> I suspect half of their mapreduce tool is management infrastructure.
> Like, there is definitely a webapp the presenter showed us that draws
> a bar graph showing how far along each node has gotten in the current
> job. IIRC if a node lags for too long, its job (and bar) gets
> reassigned to another node. And he said often bars don't rise fast
> enough because, when a node's disk goes bad, first the node becomes
> slow for a day or two before the disk dies completely.
>
> And this management focus pushes its way down to the API---jobs _must_
> be composed of restartable chunks with no side-effects that can be
> assigned to a node, then scrapped and reassigned. For example, a
> mapreduce job cannot be ``deliver an email to the 200,000 recipients
> on this mailing list,'' because if a node dies after sending 1000
> messages successfully, mapreduce will restart that job and deliver
> duplicates of those 1000 messages.
>
> so it is rather an unimpressive tool. The virtue of it is how well
> they've hammered the quirks out of it, its ability to work on
> unreliable hardware (prerequisite for scaling to thousands of cheap
> nodes), the simplicity of the interface for new developers/sysadmins,
> and support for multiple languages (Python u.s.w.).
Interesting and spot-on take on things, you a lot in perspective
here, in the context of the generated hype.
>
> il> - Sun ZFS?
>
> there's no distributed aspect to ZFS, only an analagous hype aspect.
> (honestly it's pretty good though. this tool I actually _am_ using
> for once, not just ranting about.)
>
> il> - AFS and the like?
> il> - RH GFS and the like?
>
> It's unfair to compare these to the google GFS because according to
> the presentation I saw, google's requires using very large blocks.
> It's a bit of a stretch to call Google's a ``file system''. It's more
> of a ``megabyte chunk fetcher/combiner/replicator.'' If you wanted to
> store the result of a crawl in one giant file, that's possible, but
> for most other tasks you will have to re-implement a filesystem inside
> your application to store the tiny files you need inside the one giant
> file you are allowed to performantly use.
>
> I think Sun's QFS may be GFS-ish, but with small blocks allowed, and
> on a SAN instead of on top of a bunch of cheesy daemons listening on
> sockets. There are variants of this idea from several of the old
> proprietary Unix vendors: one node holds the filesystem's metadata,
> and all the other nodes connect to it over Ethernet. but metadata
> only. Data, they hold on a Fibre-channel SAN, and all nodes connect
> to it over FC-SW. so, it is like NFS, but if you will use giant files
> like GFS does, and open them on only one node at a time, most of the
> traffic passes directly from the client to the disk, without passing
> through any file server's CPU. They had disgusting license terms
> where you pay per gigabyte and stuff.
>
> Another thing to point out about RedHat GFS, is that in a Mosix
> cluster a device special file points to a device _on a specific
> cluster node_. If I open /dev/hda on NFS, I just get whatever is
> /dev/hda on the NFS client. But on OpenMOSIX/GFS, the device methods
> invoked on the GFS client, the open/read/write/ioctl, get wrapped in
> TCP and sent to whichever node owns that device filename. That's
> needed for Mosix, but Google doesn't do anything nearly that
> ambitious.
On Aug 25, 2007, at 11:34 AM, Alex Pilosov wrote:
>> Distributed filesystems are hard, compared to writing an OS.
>>
>> -alex
:)
>
>
> il> distributed computing and RPC systems
>
> i heard erlang is interesting, but haven't tried any of these---just a
> bibliography. so it might be silly, or broken.
>
> I think it's kind of two different tasks, if you want to write a
> distributed scientific-computing program from scratch? or you want to
> manage a scripty-type job built from rather large existing programs
> (which I suspect is what you want).
That's exactly what I want- but right now, as I'm just barely getting
going with this personal book-scanning project (which I suspect will
take years to feel 'complete'), I'm willing to explore any path that
others have had successes with.
Right now, that looks like tying things together with existing
programs- looks like I have plenty of choices, and room for growth
with each!
>
> One last thing. when my friend tried OpenMOSIX, he was really excited
> for about a month. Then he slowly realized that the overall system
> was completely unreliable---processes quietly, randomly dying, and
> other such stuff. That's the worst anti-recommendation I can think
> of. If he'd said ``I tried it---it didn't work,'' then I might try it
> again. but, ``I tried it. It wasted a month or two of my time before
> I found serious show-stopping problems about which the authors and
> evangelists were completely dishonest.'' so personally I want to see
> it in action publicly before I spend any time on it, and in action
> doing some job where you cannot afford to have processes randomly die.
Miles, thanks for this lengthy reply, all these practical experiences
are worth gold to me right now.
Rocket-
.ike
More information about the talk
mailing list