<span class="gmail_quote"><span class="e" id="q_1149dbc436a2040b_0"><span class="gmail_quote">On 8/25/07, <b class="gmail_sendername">Isaac Levy</b> <<a href="mailto:ike@lesmuug.org" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">

ike@lesmuug.org</a>> wrote:</span><blockquote class="gmail_quote" style="margin-top: 0; margin-right: 0; margin-bottom: 0; margin-left: 0; margin-left: 0.80ex; border-left-color: #cccccc; border-left-width: 1px; border-left-style: solid; padding-left: 1ex">

 Hey All,<br><br>This *may* be a thread which is aimed at the wrong list, but I<br>thought it's appropriate.  Feel free to yell at me to take the thread<br>elsewhere.<br><br>--<br>Pretext:<br>I've been working on a personal project lately which has landed me 

<br>with some home-scale monster number-crunching tasks, as well as some<br>quickly scaling massive storage requirements.<br>(Fun fun fun, I'm trying to scan and OCR my personal book collection,<br>and I'm getting scared out of my mind now that I'm making some 

<br>headway :)<br><br>Anyhow, I've been looking really closely at Google's MapReduce system/<br>algorithm spec, which seems to be at the heart of how they make their<br>massive clusters work.  This seems to be the current hot topic in 

<br>macho computing.<br><a href="http://en.wikipedia.org/wiki/MapReduce" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">http://en.wikipedia.org/wiki/MapReduce</a><br><br>Fun for UNIX folks, Rob Pike made an awk-like utility/language called

<br>'Sawzall' which uses Google's internal MapReduce API- I think it's <br>pretty interesting.<br><a href="http://labs.google.com/papers/sawzall.html" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">

http://labs.google.com/papers/sawzall.html</a><br><br>With that, I've also found that Yahoo is putting massive support into<br>an implementation of the MapReduce idea, Open Source as a part of the <br>Apache Project:<br>

<a href="http://lucene.apache.org/hadoop/" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">http://lucene.apache.org/hadoop/</a><br><br>There's other implementations cropping up all over, it seems.  Like I

<br>said, it's the current buzz... <br><br>With all that, many of us have noticed that Google is good at scaling<br>patterns in computing- to bastardize their whole tech. operation,<br>it's their big trick.  When I say scaling patterns, I mean: applying 

<br>classical computing paradigms and methodology at wild scales.  (E.G.<br>with the Google Filesystem, it's simply a filesystem where the disk<br>blocks exist as network resources, etc...  You see what I mean with<br>

 scale?)<br><br><br>--<br>Question:<br><br>Anyhow, I'm looking for more patterns in this MapReduce stuff,<br>because I'm simply not one to dive headfirst into 'buzz-tech'.  With<br>that, aside from the map, and reduce, functions found in many 

<br>programming languages,<br><a href="http://en.wikipedia.org/wiki/Map_%28higher-order_function%29" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">http://en.wikipedia.org/wiki/Map_%28higher-order_function%29

</a><br><a href="http://en.wikipedia.org/wiki/Fold_%28higher-order_function%29" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)"> http://en.wikipedia.org/wiki/Fold_%28higher-order_function%29</a><br>

<br>can anyone shed some light on similar prior works in distributed<br>computing and RPC systems which are 'old classics' in UNIX?  These<br>distributed computing problems simply can't be new. <br><br>To be really straight, what I'm getting at, is why is this more or

less useful than intelligently piping commands through ssh?  What about older UNIX rpc mechanisms?  Aren't there patterns in even  kernel source code which match this work, or are even computationally more sophisticated and advanced?

<br><br> From kernel to userland to network, I'm dying to find similar works,<br>any help is much appreciated!<br><br>Rocket-<br>.ike<br><br><br>---<br>p.s.<br>If anyone is interested in book-scanning stuff, Google happens to

<br>currently host the Open Source 'tessarect' project, a very nice OCR<br>software- a very clean command-line application for OCR processing. <br>(I was actually inspired to start all this stuff based on how much

<br>fun I had screwing around with tessarect).  Apache licence, /me shrugs.<br><a href="http://code.google.com/p/tesseract-ocr/" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">http://code.google.com/p/tesseract-ocr/ 

</a><br><br><br>_______________________________________________<br>% NYC*BUG talk mailing list<br><a href="http://lists.nycbug.org/mailman/listinfo/talk" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">

http://lists.nycbug.org/mailman/listinfo/talk</a> %Be sure to check out our Jobs and NYCBUG-announce lists  %We meet the first Wednesday of the month </blockquote> You might want to look into Starfish, which is a MapReduce implementation for Ruby.

<br>( <a href="http://rufy.com/starfish/doc/" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">http://rufy.com/starfish/doc/ </a> )  Should be greatly simpler than dealing with Hadoop<br>(but thats just my personal opinion). 

<br>