[nycbug-talk] mapreduce, hadoop, and UNIX

Sat Aug 25 11:56:36 EDT 2007

On 8/25/07, Isaac Levy <ike at lesmuug.org> wrote:
>
> Hey All,
>
> This *may* be a thread which is aimed at the wrong list, but I
> thought it's appropriate.  Feel free to yell at me to take the thread
> elsewhere.
>
> --
> Pretext:
> I've been working on a personal project lately which has landed me
> with some home-scale monster number-crunching tasks, as well as some
> quickly scaling massive storage requirements.
> (Fun fun fun, I'm trying to scan and OCR my personal book collection,
> and I'm getting scared out of my mind now that I'm making some
> headway :)
>
> Anyhow, I've been looking really closely at Google's MapReduce system/
> algorithm spec, which seems to be at the heart of how they make their
> massive clusters work.  This seems to be the current hot topic in
> macho computing.
> http://en.wikipedia.org/wiki/MapReduce
>
> Fun for UNIX folks, Rob Pike made an awk-like utility/language called
> 'Sawzall' which uses Google's internal MapReduce API- I think it's
> pretty interesting.
> http://labs.google.com/papers/sawzall.html
>
> With that, I've also found that Yahoo is putting massive support into
> an implementation of the MapReduce idea, Open Source as a part of the
> Apache Project:
> http://lucene.apache.org/hadoop/
>
> There's other implementations cropping up all over, it seems.  Like I
> said, it's the current buzz...
>
> With all that, many of us have noticed that Google is good at scaling
> patterns in computing- to bastardize their whole tech. operation,
> it's their big trick.  When I say scaling patterns, I mean: applying
> classical computing paradigms and methodology at wild scales.  (E.G.
> with the Google Filesystem, it's simply a filesystem where the disk
> blocks exist as network resources, etc...  You see what I mean with
> scale?)
>
>
> --
> Question:
>
> Anyhow, I'm looking for more patterns in this MapReduce stuff,
> because I'm simply not one to dive headfirst into 'buzz-tech'.  With
> that, aside from the map, and reduce, functions found in many
> programming languages,
> http://en.wikipedia.org/wiki/Map_%28higher-order_function%29
> http://en.wikipedia.org/wiki/Fold_%28higher-order_function%29
>
> can anyone shed some light on similar prior works in distributed
> computing and RPC systems which are 'old classics' in UNIX?  These
> distributed computing problems simply can't be new.
>
> To be really straight, what I'm getting at, is why is this more or
> less useful than intelligently piping commands through ssh?  What
> about older UNIX rpc mechanisms?  Aren't there patterns in even
> kernel source code which match this work, or are even computationally
> more sophisticated and advanced?
>
> From kernel to userland to network, I'm dying to find similar works,
> any help is much appreciated!
>
> Rocket-
> .ike
>
>
> ---
> p.s.
> If anyone is interested in book-scanning stuff, Google happens to
> currently host the Open Source 'tessarect' project, a very nice OCR
> software- a very clean command-line application for OCR processing.
> (I was actually inspired to start all this stuff based on how much
> fun I had screwing around with tessarect).  Apache licence, /me shrugs.
> http://code.google.com/p/tesseract-ocr/
>
>
> _______________________________________________
> % NYC*BUG talk mailing list
> http://lists.nycbug.org/mailman/listinfo/talk
> %Be sure to check out our Jobs and NYCBUG-announce lists
> %We meet the first Wednesday of the month
>

You might want to look into Starfish, which is a MapReduce implementation
for Ruby.
( http://rufy.com/starfish/doc/ )  Should be greatly simpler than dealing
with Hadoop
(but thats just my personal opinion).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.nycbug.org:8443/pipermail/talk/attachments/20070825/f52f5644/attachment.htm>