[nycbug-talk] mapreduce, hadoop, and UNIX
ike at lesmuug.org
Sat Aug 25 10:48:21 EDT 2007
This *may* be a thread which is aimed at the wrong list, but I
thought it's appropriate. Feel free to yell at me to take the thread
I've been working on a personal project lately which has landed me
with some home-scale monster number-crunching tasks, as well as some
quickly scaling massive storage requirements.
(Fun fun fun, I'm trying to scan and OCR my personal book collection,
and I'm getting scared out of my mind now that I'm making some
Anyhow, I've been looking really closely at Google's MapReduce system/
algorithm spec, which seems to be at the heart of how they make their
massive clusters work. This seems to be the current hot topic in
Fun for UNIX folks, Rob Pike made an awk-like utility/language called
'Sawzall' which uses Google's internal MapReduce API- I think it's
With that, I've also found that Yahoo is putting massive support into
an implementation of the MapReduce idea, Open Source as a part of the
There's other implementations cropping up all over, it seems. Like I
said, it's the current buzz...
With all that, many of us have noticed that Google is good at scaling
patterns in computing- to bastardize their whole tech. operation,
it's their big trick. When I say scaling patterns, I mean: applying
classical computing paradigms and methodology at wild scales. (E.G.
with the Google Filesystem, it's simply a filesystem where the disk
blocks exist as network resources, etc... You see what I mean with
Anyhow, I'm looking for more patterns in this MapReduce stuff,
because I'm simply not one to dive headfirst into 'buzz-tech'. With
that, aside from the map, and reduce, functions found in many
can anyone shed some light on similar prior works in distributed
computing and RPC systems which are 'old classics' in UNIX? These
distributed computing problems simply can't be new.
To be really straight, what I'm getting at, is why is this more or
less useful than intelligently piping commands through ssh? What
about older UNIX rpc mechanisms? Aren't there patterns in even
kernel source code which match this work, or are even computationally
more sophisticated and advanced?
From kernel to userland to network, I'm dying to find similar works,
any help is much appreciated!
If anyone is interested in book-scanning stuff, Google happens to
currently host the Open Source 'tessarect' project, a very nice OCR
software- a very clean command-line application for OCR processing.
(I was actually inspired to start all this stuff based on how much
fun I had screwing around with tessarect). Apache licence, /me shrugs.
More information about the talk