TIL EIN

Posted on: Fri 01 February 2013

I’ve been looking for a decent mash up of IPython and Emacs. Will definitely test drive EIN:

Emacs IPython Notebook (EIN) provides a IPython Notebook client and integrated REPL (like SLIME) in Emacs. While EIN makes notebook editing very powerful by allowing you to use any Emacs features, it also expose IPython features such as code evaluation, object inspection and code completion to the Emacs side. These features can be accessed anywhere in Emacs and improve Python code editing and reading in Emacs.

Via John D. Cook

A Mallet in Your Pig

Posted on: Fri 01 February 2013

Diving deeper into the Hadoop ecosystem, I’m starting to appreciate languages like Hive and Pig much more. The ability to extend each with UDFs really expands their possibilities.

Here’s an old, but still neat, hack from Jacob Perkins, The Data Chef. He’s folding the capabilities of Mallet, an open source topic modeling toolkit, into the Pig language:

I’m going to use Apache Pig and Mallet, a java based machine learning and natural language processing library to discover topics in the 20 newsgroups data set. This corpus is nice since each document already belongs to a newsgroup (a topic) and so it gives us a way of checking how well our topic discovery is doing. …

So it’s clear that we’re going to need a java udf to do the actual topic clustering. Right? This udf will operate on a DataBag of documents and return a DataBag containing the discovered topics.

Extensibility For The Win.

44

Posted on: Thu 31 January 2013

Finally broke the 40 post in a month barrier, this being the 44th, so I thought it only appropriate to give a nod to the reelection and inauguration of the 44th President of the United States.

One thing I can say is, thank God all that political advertising has gone away for a while here in battleground Virginia. Still traumatized from the bombardment.

Wiz Logo

Posted on: Thu 31 January 2013

If nothing else, the Wizards have a cool logo this year. Good enough that I bought my first piece of sports logo apparel in years. A winter hat with this logo as a patch, for the rest of DC’s wintry days. Nice bonus that it’s not a walking brand advertisement to boot, shades of Cayce Pollard.

Indiepocalpyse Now

Posted on: Thu 31 January 2013

Andy Baio thinks we’ve seen a phase change for creative endeavors:

Digital distribution subverted the monopolies held by physical distribution, bypassing distribution deals with record stores entirely, allowing artists to sell directly to fans. Social media and online music services changed the way people discover music, making the payola systems of MTV and radio airplay feel quaint. Production costs dropped dramatically as computers became more powerful and audio editing software got dirt cheap, along with new services for printing on demand. And, finally, Kickstarter and other crowdfunding platforms offset the financial risk to artists.

Most importantly, each new platform let artists find, communicate, and sell directly to their fans.

I want to believe, but whenever these transformational moments involve transition for mega-corporate concerns (if not outright downfall), I have this thought that there’s still an Empire Strikes Back chapter to be written.

Time To Learn Vagrant…

Posted on: Wed 30 January 2013

’Cuz Pete Warden said so:

I have fallen in love with Vagrant over the last year, it turns an entire logical computer as a single unit of software. In simple terms, you can easily set up, run, and maintain a virtual machine image with all the frameworks and data dependencies pre-installed. You can wipe it, copy it to a different system, branch it to run experimental changes, keep multiple versions around, easily share it with other people, and quickly deploy multiple copies when you need to scale up. It’s as revolutionary as the introduction of distributed source control systems, you’re suddenly free to innovate because mistakes can be painlessly rolled back, and you can collaborate other people without worrying that anything will be overwritten.

People Are Cluing In

Posted on: Wed 30 January 2013

The Basketball Jones discovers that our Wiz don’t completely suck:

Over the last few weeks, though, there’s been life, legitimate life out of DC. Not only did they win five straight at home, but they beat a handful of pretty good teams — the Thunder, Hawks and the previously mentioned Bulls on that Saturday-nighter — and absolutely ran a couple lousy ones off the court, destroying the Magic 120-91, and drubbing the Timberwolves 114-101 in a game that wasn’t even as close as the final score indicated. Even on their five-game West Coast road trip in between home stints, they went a respectable 2-3, beating two solid teams in the Nuggets and Blazers, and not losing any game by more than seven points. The improvement from the team that started 4-28 was evident, to say the least.

But as of this writing, at the end of the third quarter, Andrew’s ’76ers are in good position. Wiz with 63 lousy points.

Even worse, Nick Young is having a good revenge game, although it’s still “good bye, good riddance”.

Mining the News

Posted on: Tue 29 January 2013

#realtime alerts from @msftresearch: forecasting events of interest from a corpus of 22 yrs of news stories http://t.co/SGayXx5H
— Ben Lorica 罗瑞卡 (@bigdata) January 26, 2013

We describe and evaluate methods for learning to forecast forthcoming events of interest from a corpus containing 22 years of news stories. We consider the examples of identifying significant increases in the likelihood of disease out- breaks, deaths, and riots in advance of the occurrence of these events in the world. We provide details of methods and studies, including the automated extraction and generalization of sequences of events from news corpora and multiple web resources. We evaluate the predictive power of the approach on real-world events withheld from the system.

Mining the Web to Predict Future Events (PDF heads-up)

Reads cool, but for two things. First, they’re only mining scraped stories from The New York Times. Not quite The Web in my book. Okay, slight credit for exploiting Linked Data, but still.

Second, scraping The New York Times?! C’mon Redmond. You’re better than that. Ha, ha! Only serious!

Me 1 - Shark 1

Posted on: Tue 29 January 2013

Previously, UC Berkeley’s Shark had been giving me build fits. Finally tamed the savage beast. Had to resort to java -verbose:class to figure out where various classes and jars were coming from. Then a little surgery to put the appropriate Hadoop jars in the proper place and voilà, I can run the Shark examples against an hdfs repository.

With that out of the way, I’m now getting sucked into Apache Hive development. Yum, data crunching for fun and profit. I didn’t read too closely but I like what I perused of Programming Hive by Dean Wampler. Feels comprehensive and up to date.

Lurv’ Me Some SSD

Posted on: Mon 28 January 2013

The first step in resuscitating Ye Olde MacBook has taken place. I replaced it’s 320 Gb spinning disk hard drive, with a 500 Gb version of the Samsung SSD pictured to the right.

The thing is way overkill, with a 6 Gb / sec transfer rate on a bus that can only do 1.5 Gb / sec tops. But lordy’, has it made my measly old 4 Gb RAM laptop feel usable. Recently it had been pretty neglected, grinding to a halt when I had both Chrome and Firefox open with a bunch of tabs. Now it feels great with nary a pause or spinning beach ball. A larger sample size is needed, but if this keeps up, the machine may rise again to prime blogging platform.

Next up, I need to deconstruct the machine’s innards, replace the Apple issued SuperDrive with an OptiBay, and put 1Tb of spinning platters inside. That should do the trick for at least a year or two.

Update. But I’m also feeling the lack of the Retina display. To be continued.

Pandas High Performance File Parsing

Posted on: Sun 27 January 2013

An oldie but goodie deep dive into improving pandas’ csv file parsing by Wes McKinney

TL;DR I’ve finally gotten around to building the high performance parser engine that pandas deserves. It hasn’t been released yet (it’s in a branch on GitHub) but will after I give it a month or so for any remaining buglets to shake out:

That was from way back in October 2012. Since we’re up to pandas 0.10.1 this feature should be pretty stable.

Wiz Bulls

Posted on: Sat 26 January 2013

The Wizards. They won.
— Wiz Win Last Night? (@DidTheWizWin) January 27, 2013

Surprising to me the Wiz pulled away in the third quarter, stifling the Bulls offense for over eight minutes. They coasted the rest of the way, shutting down a few feeble surges. It should be noted that the Bulls were playing without Luol Deng.

Still need a larger sample size, but the Wiz have at least moved from under the barrel to the bottom of the barrel. And that’s within shouting distance of the playoffs in the Eastern Conference. Who’d a thunk?!

Maps In Tweets

Posted on: Fri 25 January 2013

MapBox has a new feature that let’s you embed a map in a tweet. I just think this is dang cool! But as a studied Twitter observer, I’m also interested in the various media types that can be distributed through Twitter. Seems like there’s a research project or two in there.

MapBox maps are now fully integrated with Twitter. When you tweet a link to any map hosted on MapBox, it will show up inline on Twitter.com and in Twitter’s mobile apps. Your custom styles and markers will all show up as inline images with the title and description of your map. And when you click on them, they’ll take you to the full interactive map on MapBox.com.

Freedom To Tinker Predictions

Posted on: Thu 24 January 2013

Had to let ’em simmer for a bit, but Freedom To Tinker’s annual predictions are always a good read. If it all happens, it’s gonna be a bad year, but particularly in the skies:

Civilian versions of military UAVs, like the Predator, will gain broader approval for use in domestic airspace and will be rapidly adopted by the obvious government agencies (e.g. police departments, border patrol, USGS) as well as all manner of unexpected non-government applications (e.g. traffic reporting, aerial banner advertising). “Deconflicting” airspace will become a hot topic for discussion.

You have to search and scroll a bit, but the earlier years’ predictions often had a surprisingly prescient nugget. For example, there are certain bets regarding the CFAA that are quite on point given the recent Aaron Swartz tragedy.

And of course everyone has a clunker or two:

(28) Facebook will be sold for $4 billion and Mark Zuckerberg will step down as CEO.

That’s the price of prognostication.

Swimming With Shark

Posted on: Wed 23 January 2013

Oy! I had forgotten how painful the bleeding edge of academic research can be. Spent most of the day wrasslin’ with Shark from the Berkeley AMPLab, Go Bears! Shark = Spark + Hive. I’m trying to build the BigRAM (TM) chops and this seems to be a relatively straightforward way to do so.

Thanks to a version mismatch between HDFS client libraries, I got sucked into rebuilding most of the Shark stack. Managed to tame Scala, Hive, Spark, and sbt, but couldn’t eliminate the pesky incompatibility. Being behind a firewall while having lots of remote dependencies didn’t help things. I swear Java has more different ways to define an http proxy. Currently Shark 1 - Me 0.

Oh well. It’s off to jar hunting tomorrow.

More Evidence

Posted on: Tue 22 January 2013

In-memory data processing of massive data needs a catchy nickname. I hearby christen thee Big RAM (™). Probably won’t stick but maybe I can find some unsuspecting proposal to slap it into. Unfortunately close to bigram to be distinctively visible in search engines. Might be able to make some useful social media handles out of that.

In any event, herewith two more nuggets regarding the march of the RAM/SSD combo architecture. First a bit of a rant on SSDs, from the not disinterested Brian Bulkowski:

The economics of flash memory are staggering. If you’re not using SSD, you are doing it wrong.

Not quite true, but close. Some small applications fit entirely in memory – less than 100GB – great for in-memory solutions. There’s a place for rotational drives (HDD) in massive streaming analytics and petabytes of data. But for the vast space between, flash has become the only sensible option.

My quick scan of the comments didn’t see anyone unearthing blinding fundamental errors, other than relying on Dell Enterprise SSD pricing as a baseline, but read with an appropriate grain of salt.

Then today, yet another announcement from Amazon Web Services:

Our new High Memory Cluster Eight Extra Large (cr1.8xlarge) instance type is designed to host applications that have a voracious need for compute power, memory, and network bandwidth such as in-memory databases, graph databases, and memory intensive HPC.

Here are the specs:

Two Intel E5-2670 processors running at 2.6 GHz with Intel Turbo Boost and NUMA support.

244 GiB of RAM.

Two 120 GB SSD for instance storage.

10 Gigabit networking with support for Cluster Placement Groups.

HVM virtualization only.

Support for EBS-backed AMIs only.

This is a real workhorse instance, with a total of 88 ECU (EC2 Compute Units). You can use it to run applications that are hungry for lots of memory and that can take advantage of 32 Hyperthreaded cores (16 per processor). We expect this instance type to be a great fit for in-memory analytics systems like SAP HANA and memory-hungry scientific problems such as genome assembly.

At $3.50 an hour, rent 6 for $21/hour, and get off on eight hours of 192 cores with an aggregate 1TB of RAM, 1TB of SSD, and $168 out of your pocket.

Now that would be a fun day at the office!

Timely Advice

Posted on: Mon 21 January 2013

From last year’s post on MLK’s Drum Major Instinct speech:

The drum major instinct is real. (Yes!) And you know what else it causes to happen? It often causes us to live above our means. (Make it plain!) It’s nothing but the drum major instinct. Do you ever see people buy cars that they can’t even begin to buy in terms of their income? (Amen!) [laughter] You’ve seen people riding around in Cadillacs and Chryslers who don’t earn enough to have a good T-Model Ford. (Make it plain!) But it feeds a repressed ego. …

The drum major instinct can lead to exclusivism in one’s thinking and can lead one to feel that because he has some training, he’s a little better than that person who doesn’t have it. Or because he has some economic security, that he’s a little better than that person who doesn’t have it. And that’s the uncontrolled, perverted use of the drum major instinct.

Still good advice.

Mesos Tech Talk

Posted on: Mon 21 January 2013

I need to watch this AirBnB Tech Talk on Apache Mesos:

OEmbed Link rot on URL: http://www.youtube.com/watch?v=Hal00g8o1iY

Benjamin Hindman is one of the creators of Mesos, a platform for building and running distributed systems. He has done research in the areas of programming languages and distributed systems as a graduate student at Berkeley, where he hopes to one day finish his PhD. These days he spends most of his time hacking on Mesos at Twitter where it is being used in production.

For some reason I’m really intrigued by the potential of Mesos in production. Go Bears!

SSD Projections

Posted on: Mon 21 January 2013

From the Ars Technica article: SSD predictions at CES: fewer OEMs, lower prices

Storage enthusiasts sitting on the edge of your seats for revolutionary SSD announcements out of this year’s CES can rest easy: there’s not anything mind-blowing coming up that you need to be worried about. Ars sat down today with both LSI/Sandforce and Samsung, and while both had plenty of neat stuff to talk about with regard to their current product line, neither had anything earthshaking to share. Like the headline says, this isn’t necessarily a bad thing: now’s an excellent time to buy an SSD if you don’t already have one, and the ever-present enthusiast fear of buying something that will soon be obsolete or out of date isn’t one that really applies for solid state disks.

Having seen at work the performance delta that SSDs can provide, I jumped on this bandwagon at home and bought a 500 Gb Samsung 840. Grand total of $345. Also grabbed a Western-Digital 1 Tb drive for $80. I’m following my aforementioned plan of souping Ye Olde MacBook to squeeze out another year or two as my personal laptop.

I don’t claim to be the sharpest tool in the shed, but this SSD trend has serious implications for advanced data analysis and processing.

O’Reilly In-Memory Report

Posted on: Sun 20 January 2013

Looking forward to this upcoming big data product from O’Reilly:

In a forthcoming report we will highlight technologies and solutions that take advantage of the decline in prices of RAM, the popularity of distributed and cloud computing systems, and the need for faster queries on large, distributed data stores. Established technology companies have had interesting offerings, but what initially caught our attention were open source projects that started gaining traction last year.

Wiz Nuggets

Posted on: Sat 19 January 2013

WIZ GOT IT IN TONIGHT
— Wiz Win Last Night? (@DidTheWizWin) January 19, 2013

The Wiz managed to hang on against the Denver Nuggets for a win. Notably, the game was on the road against a decent team. The Wizards started letting it slip away, but guys like John Wall and Bradley Beal, stepped up big in some late possessions.

This is progress.

BlinkDB

Posted on: Fri 18 January 2013

Cool! BlinkDB provides Interactive timescale querying of massive data through declaring how much error you’re willing to tolerate in the answer. Hive on top, Hadoop and/or Spark underneath.

Today’s web is predominantly data-driven. People increasingly depend on enormous amounts of data (spanning terabytes or even petabytes in size) to make intelligent business and personal decisions. Often the time it takes to make these decisions is critical. However, unfortunately, quickly analyzing large volumes of data poses significant challenges. For instance, scanning 1TB of data may take minutes, even when the data is spread across hundreds of machines and read in parallel. BlinkDB is a massively parallel, sampling-based approximate query engine for running interactive queries on large volumes of data. The key observation in BlinkDB is that one can make perfect decisions.

Paper preprint available. Go Bears!!

Via @bigdata

BlinkDB: massively parallel, sampling-based approximate query engine for running interactive queries on #bigdata http://t.co/xCD2GwGQ
— Ben Lorica 罗瑞卡 (@bigdata) January 15, 2013

PgSQL Indexes

Posted on: Thu 17 January 2013

Two recent Postgres performance nuggets, both touching on indexes.

First, Craig Kerstiens talks about performance measurement with pg_stat, but lands on conditional indexes:

To further optimize this we would great a conditional OR composite index. A conditional would be where only current = true, where as the composite would index both values. A conditional is commonly more valuable when you have a smaller set of what the values may be, meanwhile the composite is when you have a high variability of values.

Next, the Instagram team illustrates some of how they’ve scaled with Postgres. They provide functional indexes as one of their tips:

On some of our tables, we need to index strings (for example, 64 character base64 tokens) that are quite long, and creating an index on those strings ends up duplicating a lot of data. For these, Postgres’ functional index feature can be very helpful:

Both very good to know.

TPM IdeaLab, Subscribed

Posted on: Wed 16 January 2013

Been meaning to put Talking Points Memo’s Idea Lab in the ole’ feedroll since Carl Franzen was one of the few to cover the most recent State of the Map:

The tensions facing the community were on full display at OpenStreetMap’s annual “State of the Map USA” conference in Portland, Oregon from October 13 through 14, a frenzied, jam-packed series of over 50 presentations and countless other informal talks between avid geographers and programmers who sprawled over a few generally overcrowded rooms at the Oregon Convention Center, fueled by coffee (beer at night) and their boundless enthusiasm for using and improving the vast and increasingly vital public map.

Subscribed!

Data Hacking Books

Posted on: Tue 15 January 2013

An interesting collection of books concerning “big data” and “data science”. I already have a couple in hand, but Russell Jurney’s Agile Data looks quite interesting.

Working with data is HARD. Let’s face it, you’re brave to even attempt it, let alone make it your everyday job.

Fortunately, some incredibly talented people have taken the time to compile and share their deep knowledge for you.

Here are 7 books we recommend for picking up some new skills in 2013:

Via Mortar Data

7 books to supercharge your data education http://t.co/X8sNm9Me
— Mortar Data (@mortardata) January 15, 2013

MF Urban Coyotes

Posted on: Mon 14 January 2013

Now I can stop complaining. My Mark Farina, Urban Coyotes CD is physically in hand. This is the first physical CD I’ve bought since Farina’s Mushroom Jazz 7 and I haven’t even downloaded much over the past year and a half. So I’m looking forward to getting something new in the mix. Don’t have time to rip it tonight, but it’ll be on the Jesusphone 5 posthaste.

20 Mile Marchin’

Posted on: Sun 13 January 2013

The Art of Manliness dug into the book Great by Choice, and pulled a common theme of successful organizations, “The 20 Mile March”:

Collins and Morten dubbed the slow and steady approach taken by Southwest and other 10X companies “The 20 Mile March.” They took this moniker from imagining a man determined to walk across the United States, and how he could accomplish his goal faster by committing to walking 20 miles every single day – rain or shine — rather than walking for 40-50 miles in good weather and then very few miles or not at all during inclement conditions.

My biggest achievements last year, collecting lots of Tweets and blogging continuously, inadvertently embraced The 20 Mile March ethos. I’ll be looking to build my 2013 resolutions with even more awareness of the components of this approach. Weight reduction/exercise and side-hacking are going to make repeat appearances this year. Let’s see if applying 20 Mile March techniques help increase my level of success.

Wiz Hawks

Posted on: Sat 12 January 2013

John Wall solves all.
— Wiz Win Last Night? (@DidTheWizWin) January 13, 2013

I don’t know if John Wall is going to make that much of a difference overall, but his return definitely put a pep in the step of the Wiz. They actually rolled to a relatively easy victory over the Hawks.

Two in a row! Who’d a thunk the Wiz would have a winning streak this year.

Sad

Posted on: Sat 12 January 2013

Sad news from the campus newspaper of my alma mater:

Computer activist Aaron H. Swartz committed suicide in New York City yesterday, Jan. 11, according to his uncle, Michael Wolf, in a comment to The Tech. Swartz was 26.

…

The accomplished Swartz co-authored the now widely-used RSS 1.0 specification at age 14, founded Infogami which later merged with the popular social news site reddit, and completed a fellowship at Harvard’s Ethics Center Lab on Institutional Corruption. In 2010, he founded DemandProgress.org, a “campaign against the Internet censorship bills SOPA/PIPA.”

I didn’t know Swartz, but knew of him, from his precocious emergence as a teen in Chicago (when I lived there) to his various Web, Internet, and political activities. He was of that ultra-principled, uber-nerd ilk reminiscent of Richard Stallman, Philip Greenspun, Olin Shivers, Eric Raymond, and Mark Pilgrim. It’s one thing to be able to hack, but quite another to be able to translate that focus into the public sphere.

We should all be so lucky to have done as much as he did in the time that we have been given.

Diggin’ On GraphLab

Posted on: Fri 11 January 2013

What is GraphLab?

GraphLab is a graph-based, high performance, distributed computation framework written in C++. While GraphLab was originally developed for Machine Learning tasks, it has found great success at a broad range of other data-mining tasks; out-performing other abstractions by orders of magnitude.

In the process of experimenting with alternatives to Hadoop for work, I revisited GraphLab. It’s come a long way since it’s 1.0 release. A lot more usable and the performance is off the charts. Let me put it to you this way, I’m able to run graph algorithms on graphs with over 60 million nodes and 1 billion edges that complete in minutes. Often in the time it would take a Hadoop job to even get started.

And often on a 2 year old Dell Optiplex with 2 lousy cores and desktop grade I/O. Okay, it can’t do billions of edges fast, but an order of magnitude down is quite reasonable.

You can do a lot of damage with a decent amount of RAM and a handful of cheap, stock SSDs. The floor at which you really need to buy into distributed, horizontal scaling is much higher than a lot of people think.

pluggin’ away

Posted on: Thu 10 January 2013

As I get more experience building “command shells” in Python, I’ve become interested in making them easily extensible. In the same way cliff is my goto command line building toolkit maybe stevedore can plug this hole:

Python makes loading code dynamically easy, allowing you to configure and extend your application by discovering and loading extensions (“plugins”) at runtime. Many applications implement their own library for doing this, using import or importlib. stevedore avoids creating yet another extension mechanism by building on top of setuptools entry points. The code for managing entry points tends to be repetitive, though, so stevedore provides manager classes for implementing common patterns for using dynamically loaded extensions.

Common Crawl’s New URL Index

Posted on: Wed 09 January 2013

Scott Robertson put together an index of the URLS in the Common Crawl dataset, so everyone doesn’t have to trawl the whole contents looking for where the links are:

I’m happy to announce the first public release of the Common Crawl URL Index, designed to solve the problem of finding the locations of pages of interest within the archive based on their URL, domain, subdomain or even TLD (top level domain).

Keeping with Common Crawl tradition we’re making the entire index available as a giant download. Fear not, there’s no need to rack up bandwidth bills downloading the entire thing. We’ve implemented it as a prefixed b-tree so you can access parts of it randomly from S3 using byte range requests. At the same time, you’re free to download the entire beast and work with it directly if you desire.

Information about the format, and samples of accessing it using python are available on github. Feel free to post questions in the issue tracker and wikis there.

Combine with a previous hare-brained scheme to good effect.

blekko izik

Posted on: Tue 08 January 2013

blekko unleashed a tablet search application. Might be worth checking out. Good to know blekko is still competing hard and creatively in the search space. Maybe their zag will be to make search a lot more usable everywhere instead of gobbling up every search in the world.

Friends of blekko!

We are very pleased to announce izik, our new tablet search app. We launched izik on Friday, and today is it currently the #3 free reference app in the Apple app store.

We believe that the move to the tablet from the desktop/laptop is an environmental shift in how people consume web content. We have developed a search product that addresses the following unique problems of tablet search:

New Bitly APIs

Posted on: Tue 08 January 2013

I'm so excited about this: Announcing the Bitly Social Data APIs http://t.co/8w48MIRN
— Hilary Mason (@hmason) January 8, 2013

For whatever the reason, today was chock full of link fodder. Some of it I’ll be saving for later in the week, but Hilary Mason’s announcement of new, real-time, streaming APIs tickles my fancy.

We just released a bunch of social data analysis APIs over at bitly. I’m really excited about this, as it’s offering developers the power to use social data in a way that hasn’t been available before. There are three types of endpoints and each one is awesome for a different reason.

Already scheming some data collection and analysis projects.

Wiz Thunder

Posted on: Tue 08 January 2013

Can you Beal it?
— Wiz Win Last Night? (@DidTheWizWin) January 8, 2013

I was there in person. Beal for 2 to take the lead @ 0.3 seconds in the fourth. Didn’t realize how undermanned the Wiz were. It was a bit more exciting than the BCS National Championship Game.

Python Hadooping

Posted on: Mon 07 January 2013

In my day job, I’ve been using Yelp!’s mrjob framework to run a lot of Hadoop Map-Reduce jobs. Takes a lot of the Java pain away. So it was with quite a bit of interest that I dug into Uri Laserson’s A Guide to Python Frameworks for Hadoop:

I recently joined Cloudera after working in computational biology/genomics for close to a decade. My analytical work is primarily performed in Python, along with its fantastic scientific stack. It was quite jarring to find out that the Apache Hadoop ecosystem is primarily written in/for Java. So my first order of business was to investigate some of the options that exist for working with Hadoop from Python.

In this post, I will provide an unscientific, ad hoc review of my experiences with some of the Python frameworks that exist for working with Hadoop, including:

His bottom line is that straight Hadoop Streaming with Python has the best performance. Meanwhile, mrjob is well maintained, active, and quite productive, but comes with a significant performance hit.

A small sample size, but worth reading to know all your options if you’re into mixing elephants and snakes.

A Decade of NFL Play-by-Play

Posted on: Sun 06 January 2013

This archive of NFL play-by-play data popped back up on Hacker News. More grist for the sports data hacking mill:

I’ve recently completed a project to compile publicly-available NFL play-by-play data. It took a while, but now it’s ready.

The resulting database comprises nearly all non-preseason games from the 2002 through [edit: 2012] seasons. I have not performed any analysis on the data, so what you’ll get are the only basics—time, down, distance, yard line, play description, and score. It’s almost exactly what I started with. I’ll leave any analysis up to you.

Radar Themes 2013

Posted on: Sat 05 January 2013

I get a lot of my good tech nuggets through the O’Reilly Radar blog. This year they’ve given a heads-up on their themes, and the process behind their selection, for 2013:

In our discussions we worked through about 35 potential themes and narrowed them down to 10 that we are actively working on in the first quarter of 2013. Here’s the list along with the Radar team member who is focused on it.

Topics like In-Memory Data Management, Consumer AI, and Future of Programming look really interesting, but I’m really curious to see how Ben Lorica and Nat Torkington tie in.

Scalable Message Queueing

Posted on: Fri 04 January 2013

At work I started digging into ActiveMQ, mostly because I wanted to whale on some folks who I thought had an unhealthy devotion to some mediocre, unscalable middleware. Then I stumbled upon the ActiveMQ Apollo project:

ActiveMQ Apollo is a faster, more reliable, easier to maintain messaging broker built from the foundations of the original ActiveMQ. It accomplishes this using a radically different threading and message dispatching architecture. Like ActiveMQ, Apollo is a multi-protocol broker and supports STOMP, Openwire, MQTT, SSL, and WebSockets.

If Apollo’s benchmarks prove out in actual deployment, holy smokes!! With 1.x release implying a certain amount of stability to boot. I would like to know how it scales with a high number of concurrent subscribers thou.

And ActiveMQ isn’t all that mediocre although it has some clear scaling limitations.

Slow Data

Posted on: Fri 04 January 2013

Stephen Few is one of those useful contrarians that actually proposes serious alternatives in addition to eloquently railing against the status quo. I actually believe there are some interesting technical trends and advances (e.g. newly accessible distributed programming) underlying the buzzword compliant “Big Data” movement. But Few’s Slow Data Movement is worth considering:

I’d like to introduce a set of goals that should sit alongside the 3Vs to keep us on course as we struggle to enter the information age—an era that remains elusive. May I present to you the 3Ss: small, slow, and sure.