home ¦ Archives ¦ Atom ¦ RSS

Data Journalism Handbook

Link parkin’: Data Journalism Handbook

This book is intended to be a useful resource for anyone who thinks that they might be interested in becoming a data journalist, or dabbling in data journalism.


Postgres Guide

PostgreSQL Logo Link parkin’: Postgres Guide

We here are very big fans of Postgres as a database and believe it is often the best database for the job. For many though, working with and maintaining Postgres involves a steep learning curve. This guide is designed as an aid for beginners and experienced users to find specific tips and explore tools available within Postgres.

Via Craig Kerstiens who outlines a number of reasons why you might actually want to use PostgreSQL. I heartily concur.


Best Run Evah!

Holy smokes! How the ?!?!!*! did it get to be May already?

Looking at my monthly archives on the right over there tells me I’ve probably had my best posting run ever. Pretty much seven (7 wow!) months straight minus a singular brain cramp in early December.

That’s at least one post every day, including weekends and holidays. Through illness, while traveling, while my wife’s traveling, when work is bursting, and even when I think I just don’t have anything to say.

It’s been an interesting and worthwhile challenge to tackle, not to mention helping me keep one of my New Year’s resolutions. And there’s more yet to come. Not declaring victory just yet.

Now if I could only apply the same tenacity to a couple of my other resolutions. But it’ll come. I can feel an extended personal hacking run launching this summer.

Good to check back in on the resolutions. Some progress made, but more to do.


No Spoiling The Derby

After my spoiler adventures with the UEFA Champions League, I learned my lesson and avoided sports and news outlets on my way home from work. Worked like a charm as my DVR recording of The Manchester Derby went off without a hitch.

I knew it was a big match but I was somewhat surprised at the level of hyperbole even for ESPN announcers. “The biggest match in the 20 year history of the Premiership!” Okay. If you say so.

It wasn’t an epic match in terms of play, thanks mostly to Sir Alex Ferguson going all conservative and playing guys like Scholes, Giggs, and Park ??!? How Valencia gets all of 15 minutes and Chicharito doesn’t see the field is beyond me. But the strategy did stifle the Man City creativity, and if it wasn’t for the Kompany header, the Red Devils might be on their way to yet another title.

Since I’m a bit of a Man U hater, I quite enjoyed the Citizens going to the top of the table with their 1-0 victory. Let’s see if they can hang on through these last two games.


Big Dicts

atbr seems like an interesting approach to really large scale in memory key/value store, otherwise known as dictionaries, or dicts, in Python.

…atbr is basically a thin swig-wrapper around Google’s (memory efficient) opensource sparsehash (written in C++). Atbr also supports relatively efficient loading of tsv key value files (tab separated files) since loading mapreduce output data quickly is one of our main use cases.

While the authors seem a little more focused on Hadoop integration, I’ve got another interesting use case. NetworkX is a well developed Python module for graph representation, manipulation, and algorithms. The module uses Python’s built-in dicts as the primary data structure to represent these graphs. In my experience, NetworkX tends to fall over a bit with big graphs. Maybe using atbr as a replacement underneath NetworkX would improve both memory usage and execution speed. Yet another personal hacking project I could adopt.

Also of interest was some of the benchmarking that inspired atbr and demonstrated that Python dicts are actually pretty decent.


kanban Analytics

Interesting post by Sean Gorman of GeoIQ, on “Just in Time Analytics” a.k.a. kanban analyses, especially in the context of Big Data:

The presentation was on the concept of how analysis can evolve to better take advantage of real time data streams. The community currently does lots of fascinating analysis of real time data from Twitter, mobiles devices, sensors etc., but it is inevitably a post mortem. By that I mean we do the analysis well after the event itself is over. If we think of a data stream as a living organism that is constantly changing we focus our analysis on the history that has already past.

The post summarizes a presentation (which I need to partake of) at the O’Reilly Where 2012 conference.


DRose Down

And just like that, the Bulls championship hopes swirl down the drain. I thought they couldn’t get past the Heat, but Derrick Rose blowing out his ACL seals the deal.

I wouldn’t be completely surprised if they make the Eastern Conference Finals though.


Miso

Link parkin’: Miso

Miso is an open source toolkit designed to expedite the creation of high-quality interactive storytelling and data visualisation content.

The first release under the Miso Project is Dataset, a JavaScript client-side data management and transformation library.

Via Flowing Data


cliff

Link parkin’: cliff — Command Line Interface Formulation Framework

cliff is a framework for building command line programs. It uses plugins to define sub-commands, output formatters, and other extensions.


Spoilage

Chelsea FC Logo So I’m not a radical anti-spoiler type, but I did DVR the second leg of the Chelsea-Barca UEFA tilt this past Tuesday and on the way home was avoiding hearing the final outcome. So of course, being a dolt, I get home, open my feedreader, and click on the Sports folder, only to see at the top:

Fernando Torres Scores In Extra Time, Sends Chelsea To Champions League Final

Then I compounded the problem by actually reading the item from SportsGrid which revealed Chelsea playing with 10 men, Lionel Messi (GREATEST FOOTBALLER EVER ™, not) missing a penalty kick, and Chelsea scoring in extra time of the first half. Great! Why even bother to watch the recording now?

Well I did, and it was definitely worth it. First, Mireles’ Ramires’ goal at the end of the first half was sheer brilliance. Great run, perfectly placed ball. Second, not only did Messi miss a PK, he also hit the post on a near-miss that might have been helped by Petr Cech. Third, apparently no one told SportsGrid that down 2-1, Chelsea still would have advanced, tied on aggregate but with an away goal as the tiebreaker. So there was definite drama as FC Barcelona kept extended possession around the Chelsea box, probing and searching for that third goal. I half expected Barca to flop, errr luck into, another PK. The Torres goal sealed the deal but in no way did it “send Chelsea to Champions League Final”. The brilliant, disciplined, and a bit lucky, Chelsea defense did that.

So much for spoilers.

Now if only I’d not been listening to sports talk radio before the Bayern Munich-Real Madrid match…


Ernie Will Be Back

Wizards Logo 2012 Washington Wizards General Manager Ernie Grunfeld received a contract extension today.

Grumble.

Previously in these here parts, I claimed that five people needed to depart the organization: Andray Blatche, Flip Saunders, Nick Young, Javale McGee, and … Ernie Grunfeld. Grunfeld has managed to at least clean up a little this year by getting rid of the aforementioned four, but I still really fear his draft prowess, or lack thereof.

As a Wizards fan, it’s difficult to see how this was in any sense “earned”. There’s a nine year track record of Ernie’s ability and while there was a nice post-season streak, some spectacularly bad personnel moves provide counterbalance. We’ll have to defer to owner privilege on this one, but I don’t know too many DC folks who aren’t scratching their heads.

I do have precedence for sustained loyalty in the face of silliness like this. After Jordan retired I adopted the Chicago Bulls along with oft embattled GM Jerry Krause. Krause managed to scorch the franchise to the ground (Kornel David era anybody?) and get some interesting pieces (Elton Brand, Tyson Chandler, Eddy Curry, Jay Williams, Ron Artest, Jamal Crawford) that never really panned out. Things didn’t really pick up for the Bulls until Krause was long gone and they lucked into Derrick Rose.

So I’m not gonna’ let Ernie sticking around bum me out. Maybe lightning will strike this year and their top pick will be a superstar (the first since Wes Unseld?). Please though Ted, get Ernie some draft evaluation assistance. And stay away from the European picks!!


That’s A Nice Trick

Tim Bray drops science on tab management in Chrome and Safari. Read and learn grasshopper.


Manchester Derby, It’s On

So the Red Devils slipped up, letting Everton come back from 2 goals down to steal a point. At Old Trafford even. Meanwhile, Man City did their duty and picked up 3 points from Wolves. Next week, The Citizens face Manchester United in the Manchester Derby.

Man U. 3 clear with one big head-to-head to play against Man City. Definitely appointment TV.


D3 and Maps

A discussion on D3.js and mapping libraries, started by Nelson Minar, and with commentary from a few relatively well informed figures. For future reference.


Twitter Mining Recipes

Link parkin’: Matt Russell’s repo for his book “21 Recipes for Mining Twitter”

“This repository contains code for the 21 Recipes for Mining Twitter (O’Reilly, 2011.) As the name of the title suggests, it’s a short cookbook of recipes that’s designed to help you solve common problems when working with Twitter data. Some of the recipes are extracted from content presented in Mining the Social Web while others are completely new additions. In either case, they’re designed to be bite-sized and serve as the jetpack that you can strap onto that great Twitter mining idea you’ve been noodling on — whether it’s as simple as running some disposible scripts to crunch some numbers or as extensive as creating full-blown interactive web application.”


Pull Requesting

Rachel Nabors jumped into the deep end of the open source contributor pool and started making pull requests to fix issues. Of course in GitHub-land, it is not incumbent on the receiving end of the request to accept. Nabors didn’t have much luck. Seems like there are lots of prerequisites (check style guide, visit issue tracker, e-mail maintainers first) before you can realistically start firing fixes.

Part of the reason I’m becoming more interested in GitHub is that I’d someday like to be an open source contributor. But you’ve gotta know the tools. And clearly, as Nabors found out, the culture behind the commit log. Good to know.


Multiplexing

Rafe Colburn is just now getting into terminal multiplexing. All I can add is that I’ve been using GNU screen for a few years and it’s really been a boon to my development.

Usefully though, Colburn has a link detailing why tmux might be better than screen. The arguments are compelling. I might have to give the competition a test drive.


Python Text Munging

Link parkin’: csvfilter “a Python command-line tool for manipulating CSV data”

pyp “Pyp is a linux command line text manipulation tool similar to awk or sed, but which uses standard python string and list methods as well as custom functions evolved to generate fast results in an intense production environment.”

Pyp was even presented at PyCon 2012. Too bad it was when I was completely knocked from the flu and near fainting in a parking lot. Luckily the video is on YouTube.

I could of used these a few weeks and months ago, when I was going hot and heavy on the Tweet processing. Probably still useful for the occasional text munging task.


TIL About CartoDB

Thanks to a blog post, admittedly potentially biased, comparing Google’s Fusion Tables with the open source CartoDB. While enticing for some work stuff the GitHub repo looks like a big inhale with lots of sharp pointy bits. But cartodb.com provides a hosted version possibly worth kicking the tires on.


Whaddup Wigan?

Wigan Athletic Logo So not only do The ’Letics disrupt the very top of the Premiership table by bumping off Manchester United, then they back it up with a road win, downing The Gunners at The Emirates. This now puts 3rd place in play, bringing Arsenal back within range of Tottenham and Newcastle United. Chelsea remains right behind both, but at 2 more points back they don’t look like contenders for third. Thanks Wigan Athletic for putting some excitement into the end of the Premier League campaign.

Now off to watch my DVR of Bayern Munich and Real Madrid in The Champions League. I really wish there was a good option for getting the Bundesliga on Verizon FiOS.


TIL About dajax

Today I learned about the dajaxproject: django+ajax. Good to know about given the baseline knowledge needed for client side, errr, Front-End developers.

Thanks Kevin Veroneau. Looking forward to the tutorial.


Spike Lee’s Joints

Do the Right Thing Cover There’s an inspired programmer over at HBO, who‘s been scheduling pretty much a lot of Spike Lee’s oeuvre, especially some of the earlier studio films. Good to see lesser known works like Crooklyn, Clockers, and Jungle Fever, get some airtime, along with the American classic, Do the Right Thing.


Check-Ins Dead?

Jon Mitchell of ReadWriteWeb makes a reasonable argument that the “check-in” as a location based concept is “dead”. Doesn’t look like consumers are adopting check-in applications at a rate that can lead to large scale commercial businesses. Fodder for thought, especially in contrast with Anil Dash’s effusive praise for Foursquare which is not all that stale.

I tend to concur with the later part of the article that speculates that this failure is just part of an experimental, evolutionary process to find a model for continuous location sensing. What we’ve just learned is that continuous location announcement into our social circles isn‘t particularly useful or attractive.

Now personalized capture and derived insights could be a winner. However, that’s a model hard to scale in a big way fast, which is what VCs these days are looking for.


Pure GeoIndexing

LazyWeb I beseech you.

I could use a pure geographic bounding box library or server that does for geoqueries what Sphinx does for full text search. Maybe a bit prickly, idiosyncratically extensible, horizontally scalable, and fast as hell by default. Throw in uniquely identified polygons, take queries for all intersections or all contains, return ids. That is all. No transformations or manipulations, just insert, delete, index and answer queries.

PostGIS is great, but the performance of the geographic indexing seems opaque and hard to optimize, at least to this simple soul.


Good Flow

For whatever reason, the quality of items in my various knowledge serendipity flows (a.k.a. webfeeds, although not exclusively RSS) has had a burst of high quality recently. Obviously I want to save some items for further blogging, but a number of them are looking like this quality post on “A Year with MongoDB” found via Hacker News. Obviously this is one anecdotal experience, but it somewhat confirms my experience with MongoDB. Interesting to start, painful in practice.

On a side note, I’m looking at some of the data storage numbers the HNers are mentioning in the comments and feeling a sense of pride that a project at work has a dataset easily comparable.

Yeehah! I get to hang out with the cool kids.


Greg Still Geeking

I‘ve been a fan of Greg Linden for a loooong time. Heck, he even tipped me off to MapReduce back when it was just a moderately interesting distributed programming model and well before it was a hip ecosystem. Although now he‘s much more intermittent, I still enjoy his posts which are much more link dumps, such as this, these days.

I suspect he‘s adding more depth over at his G+ abode, but that‘s One Social Network Too Many © for me.


Peak Opacity

Jamais Cascio does his usually great job of mentally exploring the future contradicting the notion that “data is the new oil”, precisely because data is massively increasing in quantity and especially in the face of corporate interests singularly focused on collecting as much personally identifying information as possible. Instead he projects that the ability to obscure inspection of oneself, a.k.a. opacity, is rapidly becoming scarce. Opacity is even more analogous to oil in that it can be hazardous, a pollutant, and difficult to extract.

Cascio proposes three potential opacity regimes emerging in concert and conflict:

  • Top down regulation: slow moving and hard to get right
  • Bottom up protections: individually powerful but clearly against the interest of large powerful concerns: corporations and governments
  • Emergent disruption through pollution: hard to stop, hard to reverse, and chock full of unintended consequences

An interesting thought experiment.


The EPL Plot Thickens

Premier League Logo So just when I was about to check out on the English Premier League, what with ManU up by 8 points with 6 matches left, the Red Devils spit the bit. In an interesting mid-week set of matches, Wigan Athletic beat Manchester United, 1-0. Meanwhile, Manchester City rolled on West Bromich, 4-0, putting The Citizens 5 points back with 5 to play. Plus, there’s still another leg of the Manchester Derby at the Etihads, where City could conceivably pick up 3 points. Definitely appointment viewing for that match. One more toestub by ManU and things get really interesting.

Arsenal also continued their incredible revival, squashing Wolves 3-0. The Gunners are close to locking up 3rd and Wolverhampton close to locking up relegation.


NexGen MacBook Pros

ArsTechnica, which is fairly reputable, reports that 15” MacBook Pros seem to be in short supply:

“15” MacBook Pros are starting to become scarce among popular resellers, suggesting an Ivy Bridge update could be coming as soon as the end of April. Users hoping for updated 13” and possibly 17” models will likely have to wait until at least June, however.”

Touting an April 29th announcement for new 15-inchers is getting my hopes up. However, I’m not sure “slimmer … sans optical drive” definitely means MacBook Air slim, which is what I’m really fiending for.

And 8GB Ram.

And 512GB SSD for a reasonable price.

A man can dream can’t he.


Stupefyingly Bad

Wizards Logo 2012 No, not the Washington Wizards. Yes, your 2011-2012 Charlotte Bobcats.

Tonight’s game against the Wizards was the first I’d seen of the Bobcats on an “extended basis”. In front of an empty Monday night house in Charlotte, the Wizards had a 30 point lead well into the fourth quarter. And it didn’t even feel that close.

There’s nothing I can really put my finger on in terms of why the Bobcats suck so bad. They don’t really have a bona-fide star other than Kemba Walker. Then again, the Wizards really only have John Wall.

But enough of those losers. Yes I know the Wizards aren’t really winning. The level of play has taken an uptick though. They’re not getting mocked on SportsCenter on a daily basis. Guys like Kevin Seraphim, James Singleton, and Cartier Martin are proving surprisingly serviceable, although on a good team they’d be on the the end of the bench.

And at the end of the day, Jan Vesely is showing real signs he might pan out, which would be a coup for Ernie Grunfeld (still needs to go).


Loud Leadership

Ryan Tomayko’s take on his management style as Director of Engineering at GitHub is something to aspire to. Check it out and read it all. Choice quote for me:

“I actually don’t show people how to make decisions and ship product in any real direct way. There’s no How To Ship Product training class or anything like that. Instead, I just do work.”


Hooking GitHub

Tarek Ziadé notes a burgeoning trend based on distributed version control that I think is quite important:

“There’s a trend these days on Github-based online services. That is — point me your Github repo and I’ll do something with it everytime you push a change.”

DVCSs, among other things, fundamentally make completely explicit the act of creating a delta on a repository. The explicit commit presents all sorts of opportunities for automation over a code repository. As Tarek points out, it‘s not exactly a new trend, but seems to be accelerating with the popularity of services such as GitHub.

His point about dashboards is intriguing. I‘d extend it to whole ecosystems of repositories. Betcha GitHub has all sorts of interesting dashboards for their internal QA monitoring.


DJ Sneak : Fabric 62

Fabric 62 Cover Amazon’s e-mail notices finally brought me something useful. An alert that DJ Sneak’s Fabric 62 had been released. With digital download immediately available to boot.

Sneak occupies an odd place in my house music affections. I love, Love, LOVE, his cut Show Me the Way, off of The Polyester E.P.. It may be one of my favorite anthems ever. On the other hand, I’ve never really fallen for any of his mix cd’s. They’ve all been passable, and I got Fabric 62 just in case there’s a breakthrough, but none of them have ever been on continuous repeat for me. That’s been reserved for folks like Lil’ Louie Vega and Evol Intent.

After first listen, seems like Fabric 62 won’t hit heavy rotation. Lost my headphones and haven’t been able to give it a repeat, but we’ll give it a chance to breath.


pandas PyCon Tutorial

Link parkin’: Video recording of Wes McKinney’s tutorial on pandas at PyCon2 2012. Hard to get a better source than the project’s lead developer.

Serendipitously picked up the full catalog of NextDayVideo’s PyCon 2012 recordings.


A Top Notch Sports Week

Was talking with a colleague at work and noted that April’s first week, Monday to Monday, might be the best week in sports:

  • NCAA Men’s Basketball Championship: Okay Kentucky was impressively the best team in the land, but I’m still waiting for this one to be vacated.
  • NCAA Women’s Basketball Championship: I don’t care who you are, 40 games in a row is spectacular. But Baylor also beat two #1 seeds in the Final Four including Notre Dame for the second time in the season. They also beat perennial blue bloods Tennessee (twice also!) and UConn, along with last year’s national champion Texas A&M (twice also!). Definitely not a creampuff schedule.
  • Major League Baseball Opening Day: We’ll ignore whatever the hell that was over in Japan. Frankly, I’m not sure baseball has opened until the Cincinatti Red Stockings have played.
  • The Masters: A tradition unlike any other. Can’t say I’m a huge golf fan (yet) but Tiger got me hooked on watching, especially on Sunday.
  • NHL Regular Season Finale: At least here in Washington, given the Caps situation this year, the NHL playoffs have already started.

The only other week that’s comparable is Saturday to Saturday of the last week of March, which includes the first day of the Final Four, but leaves out the last day of The Masters.


Milan v Barca 2

DVRed the second leg of the AC Milan versus FC Barcelona UEFA Champions League quarterfinals match. Can’t say I was captivated but did see Messi’s brilliance. That dude can accelerate with the ball, like nobody’s business. Two PKs weren’t thrilling but at least Milan put in a little spice by scaring Barca with a leveling goal.

Like I said though, bet on Barca.


PostGIS 2.0

PostGIS Logo Small After well over 2 years of development, there’s a new release, 2.0, of PostGIS. The old graybeard general wisdom was that one never took a x.0 release seriously as there were bound to be bugs. Still, I’m mildly intrigued as at work I have multiple millions of geolocated objects stored in PostGIS. There are a few queries that could use any help they can get. Then again, I really need to sit down, do some analysis and benchmarking, and really understand the distribution of my data.

Still, maybe a squeaky new PostGIS can help in the performance arena. Hopefully, the query selectivity has been improved at least.


AdoptedArt

Speaking of the remix project, I finally figured out the perfect name:

The AdoptedArt Project

TA DAH!! Adopted being the antonym of abandoned.

I’ve even gone ahead and snagged the domain name AbandonedArt.org to eventually provide a Web home for the effort. Nothing to see there as of yet though. Move along.


Slowly Getting Git

I’ve been trying to use, or more importantly absorb the ethos, of git off and on for a while now. It’s one thing to read about basic branching and merging in a book, and another to internalize an intuitive feel for how to put the facility to use.

Recently, between work and an initial start on the proposed AbandonedArt remix project, I’ve been getting a heavier dose of git usage. ”Practice makes perfect,” and that’s definitely happening here. I’ve finally internalized that branches are coding excursions, you have to checkout a branch you want to merge into, and then name the branch you want to merge in.

Now if I could only get a sustainable working model of remote repositories. I‘m functional, but there are definitely useful bits I’ve missed and still get hung up on a jagged edge here or there.

And I think resolving merge conflicts is an area most git coverage could use some extended attention. I’m guessing conflicts are supposed to be rare but they popup enough, and are tricky enough, that more detail would be helpful to this git apprentice.


ProGit

Link parkin’: Scott Chacon’s ProGit. A handy reference on the Web for using git. Help him out and buy a copy.

© 2008-2024 C. Ross Jam. Built using Pelican. Theme based upon Giulio Fidente’s original svbhack, and slightly modified by crossjam.