250K Tweets

Posted on: Wed 05 October 2011

About a week and a half ago I got possessed with the luny notion to collect 1 million tweets. I’ve worked with the Twitter streaming api in the past, so thought it would simply be a matter of keeping the connection up for an extended period of time.

I started with the simplest possible thing that might work, directly from the Twitter streaming API documentation:

curl -d @locations https://stream.twitter.com/1/statuses/filter.json -uAnyTwitterUser:Password.

First couple of cracks didn’t quite work. As documented, the HTTP connection is subject to various disruptions, so I wrapped the curl invocation in a Python script that would re-execute as needed.

So now the script has been running for about 5 days now. I’ve got about 647 Mb of data totaling north of 250,000 tweets. I can’t tell if the script has needed to do a restart, but I’m impressed it’s lived this long.

And 250K tweets, with metadata, should make for some interesting analysis.