So Gowalla and Instagram have real-time streaming apis. Not quite as easy as Twitter since those services use PubSubHubbub to stream their updates. It also means running a notification receiver that accepts incoming HTTP requests on my Linode. So for my next trick, I’d like to see if I can’t get a million updates out of each of those services.
Passed the 500K mark in tweets collected. Halfway home. I’m not so much amazed that I’ve gotten this far, but that there haven’t been more bumps. Basically, I’ve had a single script running for well over a week. That script has only needed to launch two child processes and only recover once from a child failure. The same script is a paltry 51 lines of code including empty and comment lines. Meanwhile the data file is now well over 1GB.
That’s a testament to 1) Python’s concision, 2) Linux/UNIX’s simple clean design, 3) Twitter’s clean streaming API and 4) curl
‘s robustness. I’ve just glued a bunch of really solid parts together to good effect.
Now that I’m on the downhill, I’m starting to think of datastores to hold, index, and query the data with. MongoDB seems like it might be the best fit of the NoSQL camp since the Twitter streaming API pumps data out in JSON. MongoDB also seems to have a nice query language, to have indexing for time along with geography, and to be production ready at scale.
Let’s see, even numbered week. Oh that must mean VICTORY! It was big scoring, for a bye week, but not a big victory. Then again, I was essentially a man down, what with Minnesota as my DEF delivering a -1 on the scorebook. Yes, I would have been better off with an empty spot.
Still won with a comfortable 9 point spread. I’ll give myself a GM pat on the back for picking up the Giant’s Ahmad Bradshaw. Loving his three TDs. Aaron Rodgers met expectations while Buffalo’s Fred Jackson doubled up his projection. After that it was a bunch of mediocre performances.
My opponent must be wondering how he lost. He had the second highest score in the league this week. And the opposing team spotted him a point, and then effectively played 7 guys versus his 8. We were both expecting a track meet between Dallas and New England, but he was hurt worse when a defensive struggle broke out. On my sidle Miles Austin and Jason Witten were solid, but on the other side Tom Brady and Wes Welker were significantly down. But enough of him, I’m enjoying a win.
Holding down the playoffs in fourth position. Now can I put together back to back victories?
Obmention: William Gibson’s Setup. Not all that interesting to be honest.
Attended the Big Data DC meetup this evening. The meeting was surprisingly conveniently located for me and I enjoyed both speakers, even though I only had surface knowledge of the night’s topic. The presentations were focused on the Cassandra NoSQL data store. We got an overview of the fresh off the press Cassandra 1.0 release, and some insights into the CQL query language for Cassandra.
Didn’t mind the dive into infrastructure, but I’m more of a straight-up data and analytics guy. Looking forward to more Big Data DC meetups though.
Actually, I’ve got over 335K tweets and counting captured on this collection. There were two keys to making progress over my last effort. First, I moved my collector to a personal Linode virtual private server (VPS). As opposed to my ancient home Linux workstation, the VPS can stay up and on the network for days and weeks at a time. No outages due to brief power shutdowns or arbitrary Verizon glitches. Second, my script actually recovered from a network hiccup that killed its child curl
process. Picked up the right error code, forked off a new curl
, and kept on trucking.
The back of the envelope rate of collection is about 60K+ tweets per 24 hours. At this rate, it’ll take me about another week and a half to hit the million mark. I’m not holding my breath, but glad to see a significant advance over the last milestone.
P.S. After a few months, can thoroughly recommend Linode’s services.
Belay that MacBook Air lust. My poor little white MacBook seemed to hit a severe thrashing wall. (Load avg 23!!) A little inspection and it turns out a bunch of Chrome processes were taking up all the CPU time. I could have gone tab killing, but I had a hunch about another culprit.
A little less than a year ago, John Gruber dumped Flash. I’ve always sympathized with that approach. My laptop’s fan starting up seems to correlate with playing Flash content. Now I suspected Flash was the real villain in my MacBook slowdown. What to do?
Turns out you can configure Chrome to not autoload Flash content. Subsequently you click on demand for playback. Works wonders.
Still got the same number of tabs open in Chrome, but now my machine is qualitatively much more useful.
Hat tip to Chris Kasten
Fingers crossed, someday I’ll meet my target of collecting 1 million geotagged tweets. So then what? Well I’m not sure, but I’m guessing some geospatial based analysis might be fun.
Enter SimpleGeo. I don’t quite know what their business model is, but I like their products/APIs:
- Context: gives you contextual information about a place.
- Places: allows you to search for further Points of Interest (POIs) near a given location
- Storage: lets you store, index, and query geospatially tagged data
Plus SimpleGeo’s pricing seems pretty reasonable for hobbyist tinkerer like me. Not to mention they make the PolyMaps library, which seems mighty handy.
Intermittently over the past few years, I’ve been writing various command line apps in Python. Things like
split
, but with a little more complexity and in a higher level language.
I’ve often felt I wasn’t quite writing these in a Pythonic fashion. I’d stashed away Guido van Rossum’s BDFL blessed idiom for main, but never really put it to use.
Steve Lott has a much better and simpler style of writing Python main functions, that seems more easy to ingest and adopt. He even includes a nice example of how to tie option parsing into logging.
Now it would be great if someone could come up with a cookbook for using the argparse
module.
Need to chop a 7 million line file into 1 million line chunks? There’s a UNIX command line utility for that: split
.
Does what it says on the tin and does it quite well.
In the spirit of Dennis Ritchie. Godspeed kind sir.
Sigh. The beat goes on. Lose, win big, lose, win big, lose. And of course Julio Jones adds to the injury streak, going down with, you guessed it, a bad hammy.
About the best that can be said for this past week is that I had my optimal lineup in. Of course my bench was fully parked with Must Start level guys on bye. Makes filling out the lineup card easy.
Did have a little hope, holding a lead going into Monday Night Football. But when the opponent has Matthew Stafford and Calvin Johnson… it was over mercifully quick.
Still 9 games left, and only a game out of the last playoff spot. A-Listers back in the lineup in a big way. Every streak starts with one win.
Link parkin’: GeoNames
The GeoNames geographical database covers all countries and contains over eight million placenames that are available for download free of charge.
To: Mark Zuckerberg
From: C. Ross Jam
Re: “You Have Notifications Pending”
Zuck dude, all I know is your unsubscribe link better work. Because I skipped out on those activities on purpose. Remind me again and I may dump you wholesale.
That is all.
I’m in violent agreement with Jason Snell’s thoughts regarding Tim McCarver’s recent illness:
I wish Mr. McCarver a speedy recovery and a healthy, long life. I also wish him a lengthy and joyous retirement.
By which I mean, Tim, this broadcasting game. It’s too stressful. You have just had a “minor” heart procedure—but how minor can anything regarding your heart be?
…So Tim. We’re your pals. Quit this broadcasting racket. …
Might be the best thing I’ve seen on American McCarver yet.
Despite my little tantrum, turns out one round of my tweet harvesting experiment spanned the evening of Oct 1st to the afternoon of Oct 6th. Geofocused on New York City and San Francisco, there’s a good chance I caught a solid burst of interesting tweets.
Specialization is for insects. —Robert A. Heinlein in Time Enough For Love.
I really enjoyed Ted Leung’s recap of the Surge 2011 conference. I don’t know squat about scalability matters, the purview of the DevOps community, but there was one quote from Leung’s recap that really struck me:
specialization is an industrial age notion and needs to be discounted in spaces where we operate at the boundary of the known versus unknown.
This was from Ben Fried, Google’s CIO, commenting on how his organization’s structure failed to solve a problem with a very complex system. Turns out no one could really understand the system from a global, end-to-end perspective, which was the key to solving the problem.
The quote really resonated, because on the day job I’ve been working on a team trying to strategically address a particularly thorny innovation challenge. Many of the technology components are known, and pairwise compositions well understood. But the unknown solution feels like it needs to emerge from a global, end-to-end perspective. If you agree with Fried, and I think I do, we’ll need some generalists to rise to the task.
And of course our company is chock full of stovepiped specialists.
Today, I declare myself The Most Unsentimental Man In The World ™. Think Different my friends.
Moment of silence for Rev. Fred Shuttlesworth though. Godspeed kind sir.
About a week and a half ago I got possessed with the luny notion to collect 1 million tweets. I’ve worked with the Twitter streaming api in the past, so thought it would simply be a matter of keeping the connection up for an extended period of time.
I started with the simplest possible thing that might work, directly from the Twitter streaming API documentation:
curl -d @locations https://stream.twitter.com/1/statuses/filter.json -uAnyTwitterUser:Password.
First couple of cracks didn’t quite work. As documented, the HTTP connection is subject to various disruptions, so I wrapped the curl
invocation in a Python script that would re-execute as needed.
So now the script has been running for about 5 days now. I’ve got about 647 Mb of data totaling north of 250,000 tweets. I can’t tell if the script has needed to do a restart, but I’m impressed it’s lived this long.
And 250K tweets, with metadata, should make for some interesting analysis.
A maxim to live by, from your’s truly:
There’s no need to sit in a parked car, listening to sports talk radio. It’s just sports blather. They’ll make more.
Occasionally I have to catch myself on this one.
All I have to say is, I’m glad Aaron Rodgers is on my fantasy football team. With the scoring rules in my league, he dropped 51 fantasy points, far and away the top scorer in the league.
Which was pretty handy, seeing as how the opposing fantasy QB was Cam Newton, who wasn’t too shabby at 30 points himself. Plus my opponent also played Greg Jennings for 19 points, Ray Rice for 20, and Darren McFadden for 14.
Besides Rodgers, the key support on my team came from WRs Santana Moss for 12 points, Julio Jones for 18, Fred Jackson for 18, and Jason Witten for 19. Between Rodgers and Witten, I cruised to an easy 26 point lead, although I had to wait out a potential Ray Rice explosion.
The league only has 8 teams in it, so there’s plenty of talent to go around. The key is getting one or two really outstanding performances, which is a combination of matchups and luck.
At this point I’m 2-2, fighting for the 4th playoff spot. Still plenty of games to go, but I have to put a little streak together. I think I’ve got a viable core, but I’m always scuffling on a spot or two. At least this week, I don’t think anyone got injured.
The NFL is heading into its stretch of bye weeks, which puts the bench and the waiver wire at a premium. I think I’m configured for success during these challenging weeks, but we’ll see.
Update: Title modified for the correct team in 2011. Yeesh! I’m slipping.
Update 2: Oops! Forgot about Rashard Mendenhall pulling a hammy, and possibly missing next week. Then again, he was due for a benching anyway.
Ben Lorica runs the @bigdata Twitter stream, which routinely provides good big data related nuggets. I don’t actually follow the stream because he occasionally summarizes the best tweets on his blog
Speaking of PostgreSQL, it’s also worth mentioning PostGIS, which is a fabulous geospatial extension for PostgreSQL. Nelson Minar recommends the book PostGIS in Action to supplement the PostGIS documentation. As an example, PostGIS is improving in the area of doing nearest neighbor searches.
Alternatively, one can try SpatiaLite which embeds GIS capabilities in the self-contained SQLite RDBMS. However, I found SpatiaLite had a lot of moving parts and didn’t get a lot of mileage out of it.
Sideways observation: Pinboard doesn’t make individual item URLs obviously visible in its HTML. A little digging in an RSS feed reveals their existence. Turns out, on the person’s web page, the dates are the permalinks for the bookmarks. Useful for giving credit.
Been noodling around with PostgreSQL at work. Great relational database system. Glad I got a chance to get back to it. Gets even better with age.
For the longest time, I’d never really used stored procedures in an RDBMS because the languages to write the procedures were unappealing. However, PostgreSQL has embedded scripting languages, including Python, as procedural languages. Makes things a lot more fun.
One caveat, it was pretty tricky to convince the PostgreSQL build process on MacOS Lion to not use the system Python. An important point if you don’t have root privileges to install modules, which comes in handy. I’ll have to look up the mods and post them since it could save someone a little bit of time.
Well the Fantasy Football gods giveth, and the Fantasy Football gods taketh away. The Rangers got beat by a little under 48 points. I was saved from the worst beating award by someone who got shellacked for a 60 point loss.
This is the third week in a row I’ve had a player incur an injury that put them out the next week. I lucked out in that Steven Jackson and Miles Austin had decent and great games. Kenny Britt wasn’t so lucky, done for the season with no production.
I’m patting myself on the back for picking up Buffalo’s Fred Jackson off the waiver wire. Only he and Aaron Rodgers scored in double digits for me. Meanwhile the opposing team had five 15+ point scorers, including Jermichael Finley going for 30. Note that it’s pretty difficult in fantasy football to gain ground when your quarterback is throwing TDs to the other guy’s tight end.
Biggest disappointment was Rashard Mendenhall, who couldn’t do anything against the Colts. The Pittsburgh running game is giving me the willies.
Anyhew, back to the waiver wire. Gotta scrounge another receiver.
Link parkin’: DJ Patil, who build LinkedIn’s data team from the ground up, expounds on Building Data Science Teams
I don’t follow too many linkblogs these days, but Nelson Minar established a Pinboard account to capture his URLs of note. He claims it’s better than his blog. Don’t know if that’s true but in NetNewsWire I’ve been flagging a lot of links he’s been pointing too. Plenty of good, solid, diverse hacking content. Time to give the stream a little publicity.
I’m sort of binary on Woody Allen films. If he’s in the film, I can’t stand it. If he’s not I generally enjoy the flick: Match Point, Cassandra’s Dream, Vicky Cristina Barcelona, Celebrity. The only exception has been Manhattan, but then that’s a very exceptional film. His typical nebbish take drives me batshit.
I made the mistake of watching Anything Else, even though I knew Allen had a role, principally because I like looking at Christina Ricci. It was doubly horrible because Jason Biggs filled the Allen lead role to accurate effect, meaning you had twice the nebbishness. Yeesh!
Glad someone else is in violent agreement, calling it Woody Allen’s worst film. Oddly enough, Tarantino apparently loved Anything Else. No accounting for taste.
Okay, for all I know they probably don’t have the biggest catalog, but HDNet Movies has made a big impression on me over the last couple of weeks. First it was just the delight of seeing Salma Hayek in After the Sunset. Then it was being reintroduced to The Rundown.
Did I mention the movies are uninterrupted (c.f. the suckage of IFC)? Unmodified? Except for the occasional, tastefully displayed HDNet Movies logo. Letterboxed, to boot, so no pan and scan.
But the closer is that somebody in the programming department thought today would be a great day for a Quentin Tarantino double bill: Pulp Fiction and True Romance. Yeah, I know Tony Scott directed True Romance, but Tarantino wrote the script and Scott retained most of Tarantino’s crime, violence, and dialogue oeuvre. It’s essentially a Tarantino film.
Brilliant. Two of my favorite films of all time, and Pulp Fiction is in my top 3.
Set DVRs to stun.
Grrrr. My trusty MacBook is feeling a bit pokey for the first time ever. I may have reached the limits of its resources what with a bazillion tabs open in Chrome, a few in Firefox, and a handful of other apps running. Plus, I’m using Spaces.
Maybe some tab clearance will help out, but whenever the next edition of MacBooks or MacBook Airs come out, I will seriously be lusting.
I realized this weekend, that whatever success I have in fantasy football is based upon the following principle. During the currently playing weekend, I start planning for the next weekend.
By halfway through the first game on a Sunday afternoon, I’m already scanning the waiver wire for promising mid-week pickups. In leagues where you can still make afternoon free agent claims, I’ve jumped on key players after early injuries.
A second minor philosophy, is to consider the real world matchups, but not over think my lineup. Don’t mess with players on bad teams (c.f. Indianapolis Colts 2011). Studs always play.
Didn’t manage to follow up on the week 1 quick hit, but the key to that week was subbing out about 20+ points that could have made the difference. Picked the wrong kicker. Picked the wrong DEF. Stuck with Rashard Mendenhall against the Ravens. Then dumped Ahmad Bradshaw, with a decent matchup, thinking I might be able to make a late afternoon waiver move. Oops! All free agents hit the waiver wire at the start of Sunday, so nobody can make moves during the day. Know your league’s rules son!
Week 2 was a much happier affair. I had some plush match ups, and actually made some good GM moves.
-
Figured Miles Austin was going to have a big day against SF. Didn’t know he was going to be the top WR for our league this week.
-
Made the brilliant move of picking up and starting Buffalo’s Fred Jackson. Finished in the top 5 and only 0.8 points out of the top RB scoring position.
-
Rob Bironas kicked his way back into form, and Houston’s defense did a decent job against Miami.
-
Stuck with regulars Aaron Rodgers, Jason Witten, Rashard Mendenhall, and Santana Moss. Everybody delivered double digit fantasy points, with Moss surprisingly totaling 14.
With everyone on my team scoring double digits, I’ve cruised to a 40+ point victory. The only downside is that Austin looks like he’s going to be on the shelf for a few games. That’s okay though, there’s a couple of emerging WRs, with good matchups, I’ve got my eye on. Been working hard to make up for my draft day bye blunders. I’m feeling that waiver wire magic again, revamping my team week to week.
The Rundown popped up on HDNet this past week. A little over 8 years since its release, the movie has held up surprisingly well. Actually, it shouldn’t be all that surprising. The Rundown is a pretty basic action film, not overly enhanced with sci-fi techno elements, or CG special effects. Fights, double-crosses, chases, captures. Hero wins in the end.
Herewith some of the reasons I enjoyed watching it yet again:
-
Dwayne “The Rock” Johnson actually evinced a little acting depth.
-
The directer, Peter Berg, tightly paced the flick with no fat. Comes in at a clean, efficient hour and a half. Plus there’s actually a distinctive visual style to the whole affair.
-
Christopher Walken doing a great, batshit exploitative warlord.
-
Rosario Dawson is, as always, easy on the eyes.
-
The film is actually funny in quite a few spots. Go away monkey!! Seann William Scott’s scruffy ass actually adds a solid comedic lift.
-
The kick ass opening sequence. Heck, all the fight sequences. Well done and not your run of the mill punch out.
You could do quite a bit worse than The Rundown on a slow, weekend afternoon.
Update. Forgot about the Arnie cameo.
Well that didn’t last too long. I just locked onto the Stathead flow back in June. While I found the signal to noise ratio wasn’t very high, I got a solid nugget or two.
Of course it takes a lot of human effort to filter the stats oriented portion of the sportsosphere. And the humans have just given up.
As recently as 2005, the NFL’s primetime games were quite often blowout duds. Especially late in the season when matchups that looked exciting in July fizzled because one or both teams had disintegrated. This eventually led to the flex system which allowed the networks to schedule more attractive late season games into the primetime slots.
I don’t know what deal the league made with the devil for the 2011 season, but the night games for the season opening were amazing. To wit:
-
The last two Super Bowl champs, Green Bay and New Orleans, go toe-to-toe in a last play, goal line stand thriller. Rodgers and Brees come out firing on all cylinders. Plus there was a 108 yard kickoff return for a TD. That’s your season opener.
-
The Sunday night game showcases America’s Team, Dallas, versus the New York Jets. In New York. On the tenth anniversary of 9/11. Hometeam goes on a dramatic 17 point 4th quarter scoring spree to pull out the comeback victory. Said spree includes a blocked punt for a TD
-
ESPN has a double header on Monday night, with a decidedly lukewarm tilt between New England and Miami to start off. Miami hangs in for about 3 quarters. The Fish were especially helped by a pick-6 off of Tom Brady, his first interception in 350+ attempts. Then the Pats explode to put Miami down, but along the way Wes Welker goes for a 99 yard TD reception and Brady totals 517 yards in the air.
-
The second ESPN game is the Raiders vs Denver, two mediocre teams that hate each other. Per usual there’s a scrum after just about every play. While outplaying the Broncos, the Raiders can’t pull away. A 90 yard punt return sparks the Broncos in the second half, but the Raiders manage to hold on for the win. Did I mention that Sebastian Janikowski kicked a league record tying 63 yard field goal?
No real duds, lots of riveting action, and even some historical performances. Way to kick off the season NFL.
Taking some “vacation” time this week, and what am I doing? Uploading multi-gigabyte files to Amazon S3. So I’m a nerd.
Turns out though that uploading files greater than 5GB requires multipart uploading, which breaks the file up into chunks for better throughput and reliability. Client support for this part of the S3 API is not obvious.
Hat tip to Joe Miller for cataloging a few S3 clients that do support multipart uploads, including Cyberduck. Open source, free as in beer, cross platform, and eminently usable, Cyberduck is the Swiss Army knife of file uploading. Just for this capability I will be making a donation shortly.
I can’t speak for every working parent, but The Job(™) and The Kid(™) have definitely killed my outbound movie going experience. At home, you’re bombarded by film choices for your in-residence studio: Premium Cable, Pay Per View, Netflix, iTunes, Amazon Instant Watch, not to mention good old Blu-Ray and CDs for the old timers.
For the classic movie theatre experience? Overpriced, blockbustered up, and inconvenient? Not so much. Until today in the theater I’d seen exactly 1 movie in the last 24 months: Toy Story 3. (Surprisingly good for the third of a trio).
Every now and then though, one has to partake of “cinema” just to remember what it was like. Since I was taking a day off I caught Rise of The Planet of The Apes, (Surprisingly good for a reboot). Yeah, I gouged myself even on a matinee: (8.50?! don’t ask about the popcorn and soda I got), but I was the only one in the viewing at Leesburg’s spanking new Cobb 12 Theater, which is quite nicely appointed. Think full service restaurant, full bar, spacious accommodations attached to your friendly neighborhood multiplex. No great shakes for the rest of the country, Evanston, IL had the same setup when I lived there over 5 years ago, but relatively new to good old Loudoun County.
The movie was good, not great. Didn’t feel stupider having seen it. No commercial ads, reasonable number of trailers. Nobody talked, cried, texted, or took a call during the screening. The seats were comfortable and the space clean. To my eyes the projectionist didn’t screw up. A 5 minute drive from my humble abode. What’s not to like? Other than the prices.
I’ll probably never get back to my grad school heyday of seeing multiple double bill classics at the Berkeley Theater in a week (The Godfather, and The Godfather, Part II back-to-back, $7, FTW! RIP Berkeley). But there’s still a little life in The Big Screen business.
Deeper analysis later in the week, but suffice it to say the Rangers went down already (no Monday night players left), and went down hard. Major disappointment from my running backs. While Stephen Jackson managed to get in the end zone, he hurt himself along the way. Rashard Mendenhall just stunk.
Shot myself in the foot by reading the Yahoo projections and switching the Bears defense out and the Browns defense in.
And I gotta get more than 2 points out of my kicker.
Another Premiership Saturday has come and gone. After the first month we’re starting to get a little insight.
Manchester United is looking pretty good. Then again Chelsea started out with 6 straight wins last year. Not that the Blues completely collapsed, or that I’d bet against the the Red Devils, but it’s a long season.
Chicarito is a bad dude.
Sheik Mansour is getting his money’s worth out of the Man City club. And Carlos Tevez just hit the pitch for his first match.
Maybe it’s just Fox Soccer Channel house style, but I don’t like how after halftime the announcers don’t reset who’s on the pitch.
Premier League games are a welcome replacement for the early NCAA talk shows. That’ll probably work well for the NFL yakfests as well. The matches are great because they move at a good pace, no commercial interruptions, and they finish within 2 hours.
Link parkin’: Warehouse-Scale Computing: Entering the Teenage Decade. Comes highly recommended. Warning though, the material is a 50 minute, 1 GB QuickTime file, password protected by an ACM account. You can Flash stream it as well.
Seems like real-time at global scale is the new frontier.
© 2008-2025 C. Ross Jam. Built using Pelican. Theme based upon Giulio Fidente’s original svbhack, and slightly modified by crossjam.