Pavel Repin copiously documents his initial foray into processing the Common Crawl data set:
At my company, we are building infrastructure that enables us to perform computations involving large bodies of text data.
To get familiar with the tech involved, I started with a simple experiment: using Common Crawl metadata corpus, count crawled URLs grouped by top level domain (TLD).
…
It’s not a very exciting query. To be blunt, it’s a pretty boring one. But that’s the point: this is a new ground for me, so I start simple, and capture what I’ve learned.
This gist is a fully-fledged Git repo with all the code necessary to run this query, so if you want to play with the code yourself, go ahead clone this thing.