Enjoyed Paul Lamere’s documenting how he processed a very large music dataset using Amazon S3 and Amazon Elastic MapReduce. Paul was hacking the Million Song Dataset, but the info seems highly relevant to my noodlings with the MemeTracker data. It’s pretty amazing anybody off the street can harness that much processing power for 10 bucks. And it finished for Lamere in 20 minutes.
One thing I’ve learned, munging a few tens of gigabytes, you start to time everything. Once things start taking longer than 10 minutes or so you want to 1) figure out if you’re doing something stupid, and 2) figure out how to make it faster.
Personally, I’ve got a short term goal to get my data up onto S3. Then I’ll be digging into MrJob, which I’ve run across previously.