Coming to you live from the new toy my wife got me for my birthday, a new, 3rd generation, iPad. Not quite sure what I’ll be primarily doing with it, but there is one must have …
Link parkin’: Nice passel of software engineering blogs collected by Rafe Colburn. Will have to add ’em all to the aggregator.
John Scalzi is one of my favorite authors. Recently visiting the DC metro area, he had a mental slip and lost track of his computer bag.
I’m glad to say the DC area came through, returning the bag, and full contents, to him. Good on ya’ DMV, there’s gotta be a karma bonus in there somewhere.
Back in 1988 when I was applying to graduate schools, one of my undergrad advisers suggested I check out the University of Washington. He called their Computer Science and Engineering up and coming. Took his advice and sent in an application.
I got into 2 of the 3 places I applied, UDub and UC Berkeley, and visited both. I came away pretty impressed, especially because of some discussion time I had with Tom Anderson, who was “just” a doctoral student at the time. He gave me some great advice on my decision and career, some of which I managed to follow.
Cal was my eventual choice, primarily on rep and the fact that I had a number of friends on campus and in the area, and I’ll never regret that decision. From afar though, I’ve admired the UW CSE program as it’s closed on the MIT/Berkeley/CMU/Stanford summit.
I haven’t kept up with graduate rankings since I exited my former life, but Jeff Heer and Daniela Rosner joining the UW faculty, not to mention attracting Carlos Guestrin, indicates their further push to reach the top. A key has been building productive relationships with Microsoft, Amazon, Google, and other locally based tech juggernauts.
Occasionally (but not often) makes me wonder “what if?”
Thanks to stumbling onto the Common Crawl blog, I’ve run across sites for Pete Warden and Mat Kelcey. Both look like interesting data hacking venues which update at a reasonable frequency. Particularly like Warden’s Five Short Links series. Subscribed. May even have to follow them on Twitter.
Link parkin’: A field guide to Twitter Platform objects
Like any ecosystem, the Twitter platform has a variety of flora and fauna. Use this field guide to better understand the most frequently observed wild objects.
Don’t know how long this has been in place, but now I don’t have to guess about various details of the Twitter API. At least as long as the docs stay in synch with the code.
How Web Maps Work: Does what it says on the tin. Nice overview.
Via Nelson Minar’s linkblog
I didn’t give Chelsea much of a chance to beat Bayern Munich in the UEFA Champions League Final, but they somehow managed to pull off the victory. Oddly they’ve captured a funky double also winning the FA Cup. Meanwhile, 6th place is the best The Blues could do in the Premiership.
Didier Drogba figured large, scoring an 88th minute goal to keep Chelsea alive, giving up a penalty kick that was eventually saved by Cech, and hitting the final and clinching PK. That’s a long way from the horrible injury Drogba suffered back in August.
Wes McKinney’s Python for Data Analysis is in Early Release from O’Reilly. If I could carve out an extended period of time to get some initial experience with pandas I’d grab it. Might have to do so anyway.
Amund Tveit attended Accel’s Big Data Conference and came away with some interesting takeaways. This spurred him to do a mental exercise on what it would take to store a year’s worth of Twitter’s tweets in RAM:
Keeping 1 year worth of tweets (including metadata) and (a crude) index of them in-memory is costly, but not too bad. I.e. 1.36 Million USD to keep 1 years worth of tweets (124 billion tweets) for 1 year in an (distributed) in-memory hashtable (or the same amount of tweets stored in the same hashtable for one day costs approximately 3732 USD).
Q: So, is it time to reconsider using hard drives and SSDs and consider going for RAM instead
A: yes, at least consider it and combine with Hadoop. …
Obviously $1.36 million dollars is a heap ‘o kale, but there are probably businesses with that scale of data and a competitive need for that speed of processing. It‘s extreme now, but 15 years ago buying terabytes off the Costco shelf was unimaginable to most people.
Other nuggets that struck me between the two posts include the fact that Twitter sees 340 million tweets per day and that realtime processing is a hot topic. I personally had the feeling that “realtime at scale” is the new frontier but this is a shred of confirming evidence.
Don’t know about that “combine with Hadoop” comment though. The more I find out about Hadoop the less impressed I am.
Ever since I stopped my personal Twitter data collection project, I’ve been mentally casting about for a new dataset to build my Mad Data Skillz ™ (Boyeeeee!). Obviously, just restarting the tweet inflow is an option, but something involving more scale with less work would be nice.
Enter Common Crawl, a non-profit making a large — 5 Billion Web page — crawl publicly and freely available on Amazon EC2. How juicy! A big dataset conveniently located within the premier, openly available, utility computing infrastructure in the world. Definitely has potential to put the Skillz to the test. Common Crawl even has a convenient series of blog posts instructing one on how to process their page repository.
Machine Learning for Hackers vs Machine Learning in Action to be precise. Two books. One topic. Different languages. John D. Cook compares and contrasts:
Both books are about the same size and many of the same topics. One difference between the two books is choice of programming language: ML for Hackers uses R for its examples, ML in Action uses Python.
I was somewhat interested in ML for Hackers since I’m familiar with and admire Drew Conway’s online writings. The use of Python better aligns ML in Action with my interests though.
Daniel Greenfeld compares and contrasts Django toolkits for creating REST APIs. I’m interested in other perspectives on this topic as I’ve talked about tastypie before and actually put it in practice at work.
On site comments for the post and over at HackerNews are useful as well. In particular, I have to agree with a few folks that tastypie is pretty good for fairly standard Django models, but gets a little tricky for non-ORM or search based resources. In particular, dealing with object dehydration and response URIs was a bit opaque. I may be hallucinating, but when I first got started with tastypie, this documentation node on the request/response cycle didn’t exist. Maybe it’ll clear up my confusion.
I’d still recommend tastypie, but for advanced uses prepare to spend some time digging into module source code and doing a lot of experimentation to get the right results. As you’d expect!
Gee, guess I called that one. If I didn’t know better I would have thought Manchester City executed their highwire snatching of defeat from the jaws of victory then victory from the jaws of defeat, just to explicitly taunt their Mancunian neighbors. Of course they might have killed a fan or two of their own in the process. A more fitting cherry on top of the Premiership season would have been harder to script, what with The Citizens going into stoppage time to score two goals and rescue their campaign from ignominy. Time had literally run out on the season before they came back from the dead yet again.
Unfortunately, I didn’t get to see the match live as I was out and about for Mother’s Day. Driving through DC and constantly checking the iPhone for updates is not particularly safe, I can confirm. I’m hoping ESPN goes Instant Classic with the broadcast or there’s an on-demand recording available.
Link parkin’: the Nova Makers group is right around the corner in Reston, Virginia. Looks like they even have a hacker space, Nova Labs to support a wide variety of activities.
The NOVA Makers meetup is dedicated to creating and supporting a community of makers in Northern Virginia.
Link parkin’: Datavisualization.ch selected tools
Datavisualization.ch Selected Tools is a collection of tools that we, the people behind Datavisualization.ch, work with on a daily basis and recommend warmly. This is not a list of everything out there, but instead a thoughtfully curated selection of our favourite tools that will make your life easier creating meaningful and beautiful data visualizations.
Hat tip Chris Diehl via Twitter.
So the end of the 2011-2012 Barclay’s Premier League campaign arrives on us tomorrow. There are pretty much three races. Man City vs Man U for the title. Arsenal, Tottenham, and Newcastle United for the third and fourth positions, with at least one making the next Champion’s League. And finally Bolton Wanderers and Queens Park Rangers are trying to avoid relegation.
I like how the Premiership schedules their last weekend. Everyone plays and every game starts at the same time. No team sits in their locker room, rooting for an outcome. If you need a result, all you can do is your part.
Still got it in my gut that something wacky is going to happen at the top. Maybe Rangers pull a draw against The Citizens or the Red Devils actually lose to clinch it for Man City anyway. It would only be fitting if all three races were still in doubt going into the second half of the matches.
Cool! The Commonwealth of Virginia is going to be running an apps competition later this year. Longitudinal data regarding education will be the source fuel. The competition window, about a month starting in early August, leaves enough time for a part time hacker to crank out something interesting, even if they’re not interested in launching a startup.
With a bit of a break in the work storm last night, I tapped into two first round NBA Eastern Conference playoff series: ’76ers vs Bulls and Hawks vs Celtics. Both were elimination games so I was hoping for some high drama at the end.
As The Basketball Jones points out, both series ended in horrible thuds.
The Bulls, having been a smart, consistently hard working, high executing team wasted great defensive effort on bone-headedness. Sorry C. J. Watson, but that was just the wrong play.
The Hawks were just robbed. The officials messed up on both Joe Johnson’s drive and the in bounds play. Then the Hawks went all Hawks on us, acting stupid and choking on the line. They couldn’t even get off of a long range heave for three to try and tie it. Typical Hawks.
High drama indeed.
Ob Wiz. Please Lakers. Knock Javale McGee out of the playoffs. However, I do admit it might be fun to continue the playoffs with the Lakers out and the Clippers in.
PyCon 2013 is going to be right back where it was in 2012, Santa Clara, California. I’m assuming the Santa Clara Convention Center again. Yeah!!
Sign me up! And I promise this time not to get sick, deny any work requests, and be a more active participant.
On my iPhone, took one of my infrequent trips into the “Random Hip-Hop” playlist for some listening pleasure. Shuffle landed me on Marly Marl’s “The Symphony, Part 1”. The beat is legendary but jeez, Big Daddy Kane brings it as the last Cold Chillin’ rapper with arguably one of the greatest raps of all time.
… And battlin’ me is hazardous to your health So put a quarter in your ass, cause ya played yourself Like a game in the arcade. You need a far aid I'm walkin’ the path that Allah made I’ll attend and then begin to send a speech to reach and teach So just say when So I can let lyrics blast like a bullet My mouth is the gun; on suckers I pull it The trigger, ya figure, my pockets gettin’ bigger Cause when it comes to money, yo, Grant's my nigga! ...
And words on the screen can’t even begin to do justice to Kane’s enunciation and delivery. Classic.
Ob moment of silence for MCA (Adam Yauch). After odes to Heavy D, Guru, and Malcolm Maclaren, I’m giving up obits in this space though.
“Scarcity brings clarity.” Boy is that ringing true for me with a big crunch at work, a couple of holidays coming up, and my wife soon out of town for a week. When time is scarce you really start to prioritize.
I’ll sleep when I’m dead.
I recently tapped into the MapBox blog and they announced TileMill 0.9.1:
We just released TileMill 0.9.1, which adds support for PostGIS 2.0, runs on the latest Node.js 0.6.17 release, and provides packages for the latest Ubuntu Long Term Support (LTS) distribution: 12.04 (Precise Pangolin). TileMill 0.9.1 is the culmination of a several month sprint on stability, with over 80 tickets closed. The full list of fixes and advances for this release can be found in the changelog. Here are a few highlights.
TileMill sounds so cool but I really have no idea what you do with it other than “make maps”. Ah, here we go:
TileMill is an application for making beautiful maps. Whether you’re a journalist, web designer, researcher, or seasoned cartographer, TileMill is the design studio you need to create compelling, interactive maps.
My only question is whether it also eases the effort to serve your maps for web clients? After one has made their maps can you just point a browser at an obvious server and go to town?
Seems like something to learn. Could be another personal project.
Shout out to MapBox as a DC area concern.
Craig Kerstiens continues to catalog reasons to use PostgreSQL. Like the additions, although a quick and dirty test-drive of Multicorn failed miserably for me. Trying to build the sample application actually crashed the
postgres
db server, which is pretty tough to do. Probably some embedded Python and dynamic library badness, but still. Maybe I just need to go back and be a little careful about my build.
Well I guess that the next generation MacBook Pro announcement I was hoping for didn’t really pan out. Haven’t heard a peep out of Apple about anything MacBook related recently, even though the Intel Ivy Bridge announcement happened a few weeks ago. Although as MacWorld points out, the ultrabook version of the processor, slated for release later in the year, would be more appropriate for MacBooks. And Apple typically doesn’t pre-announce stuff so the timing would be in line. If they announce it, you can buy it.
Still waiting patiently
So get this. Liverpool has already won the Carling Cup. Chelsea just beat Liverpool for the FA Cup. And The Blues could double through the Champions League finals, although I don’t give them much of a chance against Bayern Munich in Munich. Meanwhile, Manchester City is basically one game away, in which they are heavily favored, from beating out Manchester United for the Premier League title.
The weird thing is that Chelsea and Liverpool are definitely also-rans in the Premiership this campaign. Liverpool is in ninth place in the tables as I write this. If Chelsea doesn’t win the UEFA championship, they might not be in at all next year. Man City stunk it up in knockout play, European play, and were left for Premiership dead a month ago. One of their top players acts like a spoiled child and another took an extended golf vacation in the middle of the season.
I don’t quite know what to make of “underachievers” taking home so many trophies, but methinks they might be giving out a bit too much hardware in international football.
The premise of Michaelangelo Matos’ “How Chicago house got its groove back” might be a bit flawed, but I found it worth a read. Feels like Matos at least did quite a bit of interviewing and background research, including talking to folks like DJ Sneak, Derrick Carter, and Cajmere in depth. If accurate, it fills in some details of mid-90’s House music I wasn’t aware of.
The comments are somewhat illuminating as well, with Carter himself chiming in with some corrections and lamentations. Writing such a piece is always fraught, partially due to the obscurity of what’s trying to be covered making it hard to get the story right, partially because space limitations mean leaving out part of the story, and partially because there are always irate fans who know better.
Ob. disclosure. When Matos’ mentions Curtis A. Jones forsaking graduate school in Chemical Engineering, I was literally there with Cajmere at UC Berkeley. Part of a small cohort of black engineering students, we met at a College of Engineering function and started hitting the SF scene for parties. There’s brushes with greatness, but I can definitely say, “I knew him when…”.
Hat tip, @CajualRecords
P.S. As predicted, DJ Sneak’s, Fabric 62, didn’t do a whole lot for me. This is why I have a bit of a problem with the notion that Sneak somehow led a revival in Chicago House Music.
Interesting post by Young Hahn of MapBox on “Rendering The World”. The problem Hahn discusses is the rendering of map tiles at high zoom levels for the entire world. The obvious and straightforward way quickly becomes unscalable for the zoom levels MapBox wants to achieve due to exponential, recursive explosion.
Turns out the actual space of unique tiles, by content, is orders of magnitude smaller than the number of tiles needed a.k.a. there’s a high level of redundancy. For example, many tiles at any zoom level simply represent all blue patches of water. Capturing and exploiting this redundancy is the key to getting scalable performance.
This page had been sitting in my Chrome tabs for quite some time, but it was well worth the read once I got around to it.
Link parkin’: Data Journalism Handbook
This book is intended to be a useful resource for anyone who thinks that they might be interested in becoming a data journalist, or dabbling in data journalism.
Link parkin’: Postgres Guide
We here are very big fans of Postgres as a database and believe it is often the best database for the job. For many though, working with and maintaining Postgres involves a steep learning curve. This guide is designed as an aid for beginners and experienced users to find specific tips and explore tools available within Postgres.
Via Craig Kerstiens who outlines a number of reasons why you might actually want to use PostgreSQL. I heartily concur.
Holy smokes! How the ?!?!!*! did it get to be May already?
Looking at my monthly archives on the right over there tells me I’ve probably had my best posting run ever. Pretty much seven (7 wow!) months straight minus a singular brain cramp in early December.
That’s at least one post every day, including weekends and holidays. Through illness, while traveling, while my wife’s traveling, when work is bursting, and even when I think I just don’t have anything to say.
It’s been an interesting and worthwhile challenge to tackle, not to mention helping me keep one of my New Year’s resolutions. And there’s more yet to come. Not declaring victory just yet.
Now if I could only apply the same tenacity to a couple of my other resolutions. But it’ll come. I can feel an extended personal hacking run launching this summer.
Good to check back in on the resolutions. Some progress made, but more to do.
After my spoiler adventures with the UEFA Champions League, I learned my lesson and avoided sports and news outlets on my way home from work. Worked like a charm as my DVR recording of The Manchester Derby went off without a hitch.
I knew it was a big match but I was somewhat surprised at the level of hyperbole even for ESPN announcers. “The biggest match in the 20 year history of the Premiership!” Okay. If you say so.
It wasn’t an epic match in terms of play, thanks mostly to Sir Alex Ferguson going all conservative and playing guys like Scholes, Giggs, and Park ??!? How Valencia gets all of 15 minutes and Chicharito doesn’t see the field is beyond me. But the strategy did stifle the Man City creativity, and if it wasn’t for the Kompany header, the Red Devils might be on their way to yet another title.
Since I’m a bit of a Man U hater, I quite enjoyed the Citizens going to the top of the table with their 1-0 victory. Let’s see if they can hang on through these last two games.
atbr seems like an interesting approach to really large scale in memory key/value store, otherwise known as dictionaries, or dicts, in Python.
…atbr is basically a thin swig-wrapper around Google’s (memory efficient) opensource sparsehash (written in C++). Atbr also supports relatively efficient loading of tsv key value files (tab separated files) since loading mapreduce output data quickly is one of our main use cases.
While the authors seem a little more focused on Hadoop integration, I’ve got another interesting use case. NetworkX is a well developed Python module for graph representation, manipulation, and algorithms. The module uses Python’s built-in dicts as the primary data structure to represent these graphs. In my experience, NetworkX tends to fall over a bit with big graphs. Maybe using atbr as a replacement underneath NetworkX would improve both memory usage and execution speed. Yet another personal hacking project I could adopt.
Also of interest was some of the benchmarking that inspired atbr and demonstrated that Python dicts are actually pretty decent.
Interesting post by Sean Gorman of GeoIQ, on “Just in Time Analytics” a.k.a. kanban analyses, especially in the context of Big Data:
The presentation was on the concept of how analysis can evolve to better take advantage of real time data streams. The community currently does lots of fascinating analysis of real time data from Twitter, mobiles devices, sensors etc., but it is inevitably a post mortem. By that I mean we do the analysis well after the event itself is over. If we think of a data stream as a living organism that is constantly changing we focus our analysis on the history that has already past.
The post summarizes a presentation (which I need to partake of) at the O’Reilly Where 2012 conference.
And just like that, the Bulls championship hopes swirl down the drain. I thought they couldn’t get past the Heat, but Derrick Rose blowing out his ACL seals the deal.
I wouldn’t be completely surprised if they make the Eastern Conference Finals though.
Link parkin’: Miso
Miso is an open source toolkit designed to expedite the creation of high-quality interactive storytelling and data visualisation content.
The first release under the Miso Project is Dataset, a JavaScript client-side data management and transformation library.
Link parkin’: cliff — Command Line Interface Formulation Framework
cliff is a framework for building command line programs. It uses plugins to define sub-commands, output formatters, and other extensions.
So I’m not a radical anti-spoiler type, but I did DVR the second leg of the Chelsea-Barca UEFA tilt this past Tuesday and on the way home was avoiding hearing the final outcome. So of course, being a dolt, I get home, open my feedreader, and click on the Sports folder, only to see at the top:
Fernando Torres Scores In Extra Time, Sends Chelsea To Champions League Final
Then I compounded the problem by actually reading the item from SportsGrid which revealed Chelsea playing with 10 men, Lionel Messi (GREATEST FOOTBALLER EVER ™, not) missing a penalty kick, and Chelsea scoring in extra time of the first half. Great! Why even bother to watch the recording now?
Well I did, and it was definitely worth it. First, Mireles’ Ramires’ goal at the end of the first half was sheer brilliance. Great run, perfectly placed ball. Second, not only did Messi miss a PK, he also hit the post on a near-miss that might have been helped by Petr Cech. Third, apparently no one told SportsGrid that down 2-1, Chelsea still would have advanced, tied on aggregate but with an away goal as the tiebreaker. So there was definite drama as FC Barcelona kept extended possession around the Chelsea box, probing and searching for that third goal. I half expected Barca to flop, errr luck into, another PK. The Torres goal sealed the deal but in no way did it “send Chelsea to Champions League Final”. The brilliant, disciplined, and a bit lucky, Chelsea defense did that.
So much for spoilers.
Now if only I’d not been listening to sports talk radio before the Bayern Munich-Real Madrid match…
Washington Wizards General Manager Ernie Grunfeld received a contract extension today.
Grumble.
Previously in these here parts, I claimed that five people needed to depart the organization: Andray Blatche, Flip Saunders, Nick Young, Javale McGee, and … Ernie Grunfeld. Grunfeld has managed to at least clean up a little this year by getting rid of the aforementioned four, but I still really fear his draft prowess, or lack thereof.
As a Wizards fan, it’s difficult to see how this was in any sense “earned”. There’s a nine year track record of Ernie’s ability and while there was a nice post-season streak, some spectacularly bad personnel moves provide counterbalance. We’ll have to defer to owner privilege on this one, but I don’t know too many DC folks who aren’t scratching their heads.
I do have precedence for sustained loyalty in the face of silliness like this. After Jordan retired I adopted the Chicago Bulls along with oft embattled GM Jerry Krause. Krause managed to scorch the franchise to the ground (Kornel David era anybody?) and get some interesting pieces (Elton Brand, Tyson Chandler, Eddy Curry, Jay Williams, Ron Artest, Jamal Crawford) that never really panned out. Things didn’t really pick up for the Bulls until Krause was long gone and they lucked into Derrick Rose.
So I’m not gonna’ let Ernie sticking around bum me out. Maybe lightning will strike this year and their top pick will be a superstar (the first since Wes Unseld?). Please though Ted, get Ernie some draft evaluation assistance. And stay away from the European picks!!
Tim Bray drops science on tab management in Chrome and Safari. Read and learn grasshopper.
© 2008-2025 C. Ross Jam. Built using Pelican. Theme based upon Giulio Fidente’s original svbhack, and slightly modified by crossjam.