[nycbug-talk] mapreduce, hadoop, and UNIX

Isaac Levy ike at lesmuug.org
Sat Aug 25 10:48:21 EDT 2007

Hey All,

This *may* be a thread which is aimed at the wrong list, but I  
thought it's appropriate.  Feel free to yell at me to take the thread  

I've been working on a personal project lately which has landed me  
with some home-scale monster number-crunching tasks, as well as some  
quickly scaling massive storage requirements.
(Fun fun fun, I'm trying to scan and OCR my personal book collection,  
and I'm getting scared out of my mind now that I'm making some  
headway :)

Anyhow, I've been looking really closely at Google's MapReduce system/ 
algorithm spec, which seems to be at the heart of how they make their  
massive clusters work.  This seems to be the current hot topic in  
macho computing.

Fun for UNIX folks, Rob Pike made an awk-like utility/language called  
'Sawzall' which uses Google's internal MapReduce API- I think it's  
pretty interesting.

With that, I've also found that Yahoo is putting massive support into  
an implementation of the MapReduce idea, Open Source as a part of the  
Apache Project:

There's other implementations cropping up all over, it seems.  Like I  
said, it's the current buzz...

With all that, many of us have noticed that Google is good at scaling  
patterns in computing- to bastardize their whole tech. operation,  
it's their big trick.  When I say scaling patterns, I mean: applying  
classical computing paradigms and methodology at wild scales.  (E.G.  
with the Google Filesystem, it's simply a filesystem where the disk  
blocks exist as network resources, etc...  You see what I mean with  


Anyhow, I'm looking for more patterns in this MapReduce stuff,  
because I'm simply not one to dive headfirst into 'buzz-tech'.  With  
that, aside from the map, and reduce, functions found in many  
programming languages,

can anyone shed some light on similar prior works in distributed  
computing and RPC systems which are 'old classics' in UNIX?  These  
distributed computing problems simply can't be new.

To be really straight, what I'm getting at, is why is this more or  
less useful than intelligently piping commands through ssh?  What  
about older UNIX rpc mechanisms?  Aren't there patterns in even  
kernel source code which match this work, or are even computationally  
more sophisticated and advanced?

 From kernel to userland to network, I'm dying to find similar works,  
any help is much appreciated!


If anyone is interested in book-scanning stuff, Google happens to  
currently host the Open Source 'tessarect' project, a very nice OCR  
software- a very clean command-line application for OCR processing.   
(I was actually inspired to start all this stuff based on how much  
fun I had screwing around with tessarect).  Apache licence, /me shrugs.

More information about the talk mailing list