5 Billion Pages

Posted on: Wed 16 May 2012

Ever since I stopped my personal Twitter data collection project, I’ve been mentally casting about for a new dataset to build my Mad Data Skillz ™ (Boyeeeee!). Obviously, just restarting the tweet inflow is an option, but something involving more scale with less work would be nice.

Enter Common Crawl, a non-profit making a large — 5 Billion Web page — crawl publicly and freely available on Amazon EC2. How juicy! A big dataset conveniently located within the premier, openly available, utility computing infrastructure in the world. Definitely has potential to put the Skillz to the test. Common Crawl even has a convenient series of blog posts instructing one on how to process their page repository.