Boinging the Common Crawl

Posted on: Mon 31 December 2012

Here’s an interesting hack I don’t really don’t have time to execute on. Take a BoingBoing data dump and dissect either references to Boing Boing pages or outlinks from the archive using the Common Crawl dataset. Might be some interesting intersections, or you could trace out the patterns of BoingBoing influence.

Tangential side project, build up a set of mrjob modules to work with either dataset. The BoingBoing stuff comes in one big file so it might be useful for someone else to bust it up into some smaller units and stuff onto Amazon S3, or otherwise make publicly available.