Once upon a time, I called MapReduce and Sawzall “major force multipliers”. At work, I’m learning the hard way about the Sawzall part of that combination. It’s great to have a scalable distributed programming model and massive data storage engines, but data querying and manipulation are the secret sauce where the magic happens. Trying to do querying type stuff at the Java API level is “teh suck”.
According to Wikipedia, Sawzall never hit the open source big time, but at least Pig, Hive, and Cascalog came to be. Pinin’ for a good open source, graph query language and runbtime baked into the Hadoop ecosystem though.