Common Crawl’s New URL Index

Posted on: Wed 09 January 2013

Scott Robertson put together an index of the URLS in the Common Crawl dataset, so everyone doesn’t have to trawl the whole contents looking for where the links are:

I’m happy to announce the first public release of the Common Crawl URL Index, designed to solve the problem of finding the locations of pages of interest within the archive based on their URL, domain, subdomain or even TLD (top level domain).

Keeping with Common Crawl tradition we’re making the entire index available as a giant download. Fear not, there’s no need to rack up bandwidth bills downloading the entire thing. We’ve implemented it as a prefixed b-tree so you can access parts of it randomly from S3 using byte range requests. At the same time, you’re free to download the entire beast and work with it directly if you desire.

Information about the format, and samples of accessing it using python are available on github. Feel free to post questions in the issue tracker and wikis there.

Combine with a previous hare-brained scheme to good effect.