home ¦ Archives ¦ Atom ¦ RSS

Welcome Back NBA

NBA Logo Small With any luck, I will be unleashing this post while sitting in the Verizon Center during the Washington Wizards 2012-13 home opener against the Boston Celtics. After watching bits of a few games during opening week, here are a few observations:

  • The Wizards are still a bunch of piece parts, and I’m not sure John Wall will make all that much difference. At least they’re a bit more professional and likable than last year.
  • Based upon how close they played the San Antonio Spurs, the New Orleans Hornets might be quite surprising this year. They’ve got four rookies, Anthony Davis, Al-Farouq Aminu, Austin Rivers, and Darius Miller, that look like they can play. Then there’s a core of good, e.g. Brian Ryan Anderson (Go Bears!) but not great veterans as support. Wouldn’t be surprised to see them challenge for the last playoff spot in the West.
  • The Los Angeles Clippers, yes Clippers, have a hell of a bench.
  • James Harden may go off for a while as the #1 guy for the Houston Rockets, but then the Association will get a book on him. Let’s see what happens then.
  • The Knicks are still stuck in neutral, and somehow they feel even older than last year.
  • Can’t wait for DRose to get back: [embed]http://www.youtube.com/watch?v=dtj-D8HT9BY&w=400&h=225[/embed]

Dadgum PostGIS!!

PostGIS Logo Small I’ve been enjoying the power of PostGIS at work, although it confounds me to no end. Given the amount of data I’m trying to query against, typically upwards of 10s of millions of rows, I haven‘t found writing efficient spatial queries to be straightforward. This week provided an opportunity to develop a hypothesis about why.

On one machine, I have my spatial DB on a traditional spinning disk HD drive. A query I wrote took about 6 hours. I’ve taken up the radical experiment (for our research org, we move hella fast ;-/) putting the same DB on a consumer grade Solidstate Storage Disk to see what would happen. Query time dropped to about 10 minutes. My back of the envelope calculation shows a 36x improvement. Caveat this with the understanding that I have in no way conducted a scientific comparison. Apples to oranges and all that.

Still the query went from doable to damn useful. Why? My guess is that spatial data and indexing are hard to lay out for good sequential access. Random disk seeks wind up being the order of the day. Thus, the advantages of SSDs really start to shine.

Just a hunch, but I really need to conduct some deeper investigation. And maybe attend some local geo meetups to commiserate with fellow travelers.


Wayne Enterprises Chronicles: Week 8

The Dark Knight Logo Mini Victory! Back to back wins to cross over into the second half of the season. Moves me comfortably into second place in the league.

I got a lot of nice contributions from every position except TE. RGIII was bottled up by the Pittsburgh defense but still managed double figure fantasy points. Tony Gonzalez (Go Bears!) had his first stinker fantasy game with only 4.4 points.

The deficit against projection was made up at RB, starting Darren McFadden and Stevan Ridley, along with the WR spots of Miles Austin and Eric Decker. Ridley and and Decker were the big wins, both being second tier players but coming in at +6. Add in my kicker, Sebastian Janikowski, at +6 as well and I actually came in +5 against my projection.

My opponent put me on pins and needles Monday night, with the San Francisco DEF as his last player. They would have needed to score 18 points to wreck my night, but it wasn‘t out of the realm of possibility given the strength of the 49er D and the crappiness of the Arizona QB.

But it all played out well, and ended in a seven point victory.

On another note, just more evidence of the emergence of computational journalism. Yahoo! has outsourced automating game summaries for fantasy football matches (!!). And the results aren‘t all that bad.

Computational journalism cribbed from Irfan Essa.


Real-Time and Big Data

[embed]https://twitter.com/lintool/statuses/263699888842346496[/embed]

Preprint paper from the guys at Twitter: “Fast Data in the Era of Big Data: Twitter’s Real-Time Related Query Suggestion Architecture”

We present the architecture behind Twitter’s real-time related query suggestion and spelling correction service. Although these tasks have received much attention in the web search literature, the Twitter context introduces a real-time “twist”: after significant breaking news events, we aim to provide relevant results within minutes. This paper provides a case study illustrating the challenges of real-time data processing in the era of “big data”.

Via @lintool


C’Ya Sandy

Bye bye Hurricane Sandy! You got our little development here in Leesburg for two hours of power outage, and our townhouse for a little roof leakage, but otherwise you weren’t too bad. You didn’t even get after my Linode up in the Newark datacenter.

I was highly impressed with your wind gusts though.


A Bad Way To Go Out

I know I’m a bit late to the party, but poor Ozzie Guillen got let go in what may be an MLB first. The Marlin’s posted his firing on Twitter: [embed]https://twitter.com/Marlins/status/260829731627347968[/embed]

Yikes! Amongst Chicagoans, there will always be a soft spot for Oz though, thanks to that 2005 World Series. But I had a feeling it would end badly, from the day he signed on.

P.S. More evidence that it’s all just media now. Now need to use that anachronistic “social media”.


Basic Common Crawl Processing

Pavel Repin copiously documents his initial foray into processing the Common Crawl data set:

At my company, we are building infrastructure that enables us to perform computations involving large bodies of text data.

To get familiar with the tech involved, I started with a simple experiment: using Common Crawl metadata corpus, count crawled URLs grouped by top level domain (TLD).

It’s not a very exciting query. To be blunt, it’s a pretty boring one. But that’s the point: this is a new ground for me, so I start simple, and capture what I’ve learned.

This gist is a fully-fledged Git repo with all the code necessary to run this query, so if you want to play with the code yourself, go ahead clone this thing.

Via Pete Warden


RESTful or Restless?

In my REST API expeditions at work, I’ve been using Flask-Restless. Now, via Python Weekly, I find out about Flask-RESTful. Normally I’d just scan and move on, but RESTful is from folks at Twilio and may have a bit more polish. To wit:

While Flask provides easy access to request data (i.e. querysting or POST form encoded data), it’s still a pain to validate form data. Flask-RESTful has built-in support for request data validation using a library similar to argparse.

The only hitch I see is no examples of connecting with ORM based models, admittedly after only 10 minutes with the docs. Restless actually handles this use case pretty well.

Alternative approaches are always good to know about.

Also have to say thumbs up on Python Weekly. Once a week to my Inbox, an easy read, at least one good link.


jq JSON Processing

Link parkin’: jq: a lightweight and flexible command-line JSON processor.

jq is like sed for JSON data - you can use it to slice and filter and map and transform structured data with the same ease that sed, awk, grep and friends let you play with text


bitmapist

Link parkin’: bitmapist

[embed]https://twitter.com/pypi/statuses/261516455638597632[/embed]

Implements a powerful analytics library using Redis bitmaps.

This library makes it possible to implement real-time, highly scalable analytics that can answer following questions:

  • Has user 123 been online today? This week? This month?
  • Has user 123 performed action “X”?
  • How many users have been active have this month? This hour?
  • How many unique users have performed action “X” this week?
  • How many % of users that were active last week are still active?
  • How many % of users that were active last month are still active this month?

Wayne Enterprises Chronicles: Week 7

The Dark Knight Logo Mini Victory!! It was a trouncing, with an 80+ point margin of victory. The other guy mailed it in, starting three players that were on bye this week. Then he got nothing out of his tight end, leading to a grand total of 31 points.

Not much to dig into, but I’m glad to break the losing streak.


Mac Tweetbot

Tweetbot Logo The Tapbots finally released their official for-pay version of Tweetbot for the Mac. The $20 price, driven by some of Twitter’s token limiting practices. induced some sticker shock. However, I didn’t think long before pulling the trigger.

Lex Friedman has a pretty good review of Tweetbot explaining why it’s worth the price:

Of course, this is a review of the excellent new Tweetbot for Mac (Mac App Store link), not a review of Twitter’s business practices. But I bring up the latter here because one of the effects of Twitter’s new developer restrictions—specifically, the finite limit on how many users a given Twitter client can support—is that developer Tapbots is charging more for the new app than originally planned. Specifically, Tweetbot for Mac will cost you $20, at a time when many similar apps can be found for $10 or less. Which means that for many readers, the question isn’t just whether Tweetbot is good, but whether it’s worth the price.

My answer: Yes.

Like Friedman, I’ve sprung for Tweetbot on MacOS, iPad, and iPhone.


Stoked and Chapped

Forgive me Saint Steven for I have sinned. As a recently realized Apple fanboy I failed to tune in for today’s announcements. My employer sent me to occupy a lonely (but very shiny!) booth in the bowels of a second rate hotel with an exceedingly lackluster conference. But I should have found A Way!

The only in-depth coverage I have been able to read so far, is Marco Arment’s assorted thoughts. They have me stoked:

Prior to this, the computer I recommended for nearly everyone was the 13” MacBook Air. But the new 13” Retina MacBook Pro is only about 0.6 pounds heavier, has much higher CPU, RAM, and storage options, and has the much nicer Retina screen. It commands a premium of about $500, which is significant, but you get much more for it.

That’s pretty much the machine I’ve been fiending for for the longest time. I’ll nitnoid about not being able to amp the DRAM to 16Gb, but that’s the only thing I could quibble about. However, between lack of funds, and dampened laptop usage due to life and other gear, now I may just hang on to Ye Olde MacBook indefinitely. Or at least there is clear provocation.

But I’m simultaneously chapped

The timing of the update — just 6 months after the iPad 3, instead of the usual year — will anger a lot of iPad 3 owners. But the previous March releases of the iPad 2 and 3 were more problematic.

You mean The New Toy is already not the new hotness? Ya killin’ me.

At least I have my new Jesusphone 5 to keep me warm at night.


Mushroom Jazz Mixtapes

Just do as the man says.

[embed]https://twitter.com/djmarkfarina/statuses/260860717647921153[/embed]

[embed]https://twitter.com/OmRecords/statuses/260850819618979840[/embed]

Four classic Mushroom Jazz recordings, for free. What’s not to like?


Now With More Sparks

@bigdata continues to sing the praises of Spark.

[embed]https://twitter.com/bigdata/statuses/259710022173466625[/embed]

In an earlier post I listed a few reasons why I’ve come to embrace and use Spark. In particular I described why Spark is well-suited for many distributed Big Data Analytics tasks such as iterative computations and interactive queries, where it outperforms Hadoop. With version 0.6, Spark becomes even faster and easier to use. The release notes contain all the detailed changes, but as you’ll see from the highlights below, version 0.6 is a substantial release. Another good sign is the growth in number of contributors, with now over a third of the developers coming from outside the core team in Berkeley.


Wayne Enterprises Chronicles: Week 6

The Dark Knight Logo Mini Defeat. Sigh. This three game losing streak is getting me down. There’s really not much analysis to be done this week. RGIII kept me in the game with a big day, but when four out of eight spots are subpar, pulling out a win is difficult. At least my guys took it into the Monday night game where it basically boiled down to Eric Decker on my side versus Antonio Gates for the opponent. I don’t know what the Broncos were thinking, but they weren’t thinking about guarding Antonio Gates.

Bye week hell this round, but I’m pretty decently set. And the opponent is in a tough spot, with three of his key players on the Falcons who are off this week.


Discogs Doubleplus Good

Partially because it wasn’t obvious to me where to find it, but mostly because it’s cool

Download Discogs Data

Here you will find monthly dumps of Discogs Release, Artist, and Label data. The data is in XML format and formatted according to the API spec: http://www.discogs.com/developers/

This data is released under the Public Domain license: http://creativecommons.org/licenses/publicdomain/

And funny how we just have to rediscover some things.


Python API Building

Nice 4 step tutorial on building a RESTful API using Python, by K. P. Kaiser. The best tip was a pointer to elasticutils, an ElasticSearch library:

It’s always a good idea to see what your options are in a library. Initially, when I was building this integration I saw pyes, a very well written library, but the code to use it seemed a bit ugly for my tastes.

Luckily, after a bit more searching, I found elasticutils, which is, in my opinion, a much cleaner interface to the very simple elasticsearch server. It always pays to take a few minutes to read the introduction, and example code before deciding on a library. Elasticutils actually uses pyes under the covers.


FDWs FTW

Craig Kerstiens on putting Redis in Postgres

SQL is an expressive language, though people are often okay with accessing Mongo data through its own ORM. The real value is that you could actually query the data from within Postgres then join across your data stores, without having to do some ETL process to move data around.

From experience, here be dragons, but used judiciously Postgres’ Foreign Data Wrappers are a great feature.


Summarizing Streams

[embed]https://twitter.com/bigdata/statuses/258963194163380224[/embed]

From A Framework for Summarizing and Analyzing Twitter Feeds (PDF alert)

In this paper, we present a dynamic pattern driven approach to summarize data produced by Twitter feeds. We develop a novel approach to maintain an in-memory summary while retaining sufficient information to facilitate a range of user- specific and topic-specific temporal analytics. We empir- ically compare our approach with several state-of-the-art pattern summarization approaches along the axes of storage cost, query accuracy, query flexibility, and efficiency using real data from Twitter. We find that the proposed approach is not only scalable but also outperforms existing approaches by a large margin.

My quick half-ass summary:

  1. use frequent item set mining, to come up with a code book
  2. recode your content with the code book
  3. compress, which exploits the redundancy uncovered by the coding
  4. profit!!

But I need to read the paper more closely. And the real-time summarization and topic tracking aspects seem really cool.


Discogs oEmbed

Since I’ve been heavily using the WordPress embed feature for tweets, I had a though that it would be nice if you could do the same thing for referencing musical releases in Discogs.com. But alas, they seem not to support oEmbed.

However Discogs does seem to have a nice client API, with a supporting Python module to boot. Seems like a Discogs oEmbed proxy server might be a nice self-contained hacking project.


This Must Happen!

The Little Guy (TM) has been watching a lot of Yo Gabba Gabba! when it hit me.

Lady Miss Kier must do a dancey dance!! That would be AWESOOOOME!!

[embed]http://www.youtube.com/watch?v=Ibk4Diagkok[/embed]

If they can find a spot on the show for Metta World Peace, they can put a little groove in the heart.


Atomic News

[embed]https://twitter.com/mathewi/statuses/257899167073050625[/embed]

Circa seems pretty interesting even though it’s YAPTRN

Part of the thinking behind Circa comes from ideas that have been described by author and journalism professor Jeff Jarvis, as well as media-startup veteran David Cohn, who is also a founding partner of Circa and acts as its editor-in-chief. The main idea is that the traditional article or story format that newspapers and other news outlets have produced for so many years no longer fits with the way we produce or consume information now. The standard “inverted pyramid”-style article was designed for the days when people might only see one report about a news event, printed on dead trees and without links, so it had to include virtually everything.

The gravity of mobile is counteracting the inertia of print in terms of news distribution, leading to some interesting approaches. The Web in general provided the initial escape velocity, but couldn’t quite get news out of print’s orbit. The confluence of current economics and mobile adoption feels like it could provide enough acceleration to do the trick.

The key is that Circa has pushed back into the newsroom’s production process and if successful in a big way will be a sea change.

Feh. Even though I wrote it, that next to last paragraph was way too weasel wordy, wonky, and jargony.


Network Structure Taxonomies

Link parkin’: Taxonomies of networks from community structure [embed]https://twitter.com/bigdata/statuses/254974410363133953[/embed]

The study of networks has become a substantial interdisciplinary endeavor that encompasses myriad disciplines in the natural, social, and information sciences. Here we introduce a framework for constructing taxonomies of networks based on their structural similarities.

This paper may set a record for number and word count of author affiliations.


Nerds, Geeks, Dorks

Glyph Lefkowitz wrestles with some definitions:

So I always feel a twinge when I identify myself as a “geek”. I usually prefer to say that I am - or at least aspire to be - a nerd.

Have to say I’m quite sympathetic. Suppressing my inner dork is a daily, even hourly, challenge.


The Flask Stack

At work, I’ve been giving Flask a whirl, with a focus on building a REST based API emitting JSON responses. A much different approach from Django, much more amenable to building an API to my mind. Much less magic, but much lighter weight. And at least for me, a bit easier to understand how to generate RESTful responses.

The biggest part of the learning curve has been SQLAlchemy. Between the SQL Core for building DB queries, to the object relational mapper (with two styles of declaring schemas no less), to engines, to sessions, there’s a lot going on. Using PostGIS is exacerbating the issue as 1) the geo part needs extensions like GeoAlchemy, and 2) I take advantage of Postgres’ range types, which aren’t baked into SQLAlchemy and are best used with a non-standard operator (@>). So imagine coming to this Swiss Army chainsaw of object relational mapping system and you’re immediately into figuring how to extend the framework. Fun!

Some helpful Flask extensions: Flask-Restless and Flask-SQLAlchemy


Wayne Enterprises Chronicles: Week 5

The Dark Knight Logo Mini TRAGEDY!! I went into the Monday night game trailing by 24 points and a skoosh. A tough get, but I had the overall #1 pick, Arian Foster going. 120 yards and 2 touchdowns was definitely doable.

152 on the ground. 1 touchdown. 1 reception for 16 yards. 23 fantasy points.

And I lose by a lousy half a point!!

Foster did his part. The actual killer was Stevan Ridley of the Patriots giving up a fourth quarter fumble that cost me 2 points.

Given that my quarterback, Rober Griffin III, got knocked out and only supplied 3 points, I probably shouldn’t have been in the game anyway. But Tony Gonazalez had another big day with 24 points. Meanwhile, my kicker and defense provided double digits as well.

Need to get more production out of my wide receivers, but dang did that one sting!


djay Unleashed?

If I read this right, a vexing problem I had with djay as virtual turntables has been solved:

There are actually two points here. You have the ability to support multichannel audio interfaces in djay, previously unavailable, and you also can route headphone output and main output separately – so even without a multichannel interface, you could use, say, both the headphone jack and an HDMI or USB out. That’s a big deal for DJs, because finally, you can pre-cue tracks, but also should mean apps with multichannel recording and playback, other tools with separate cueuing, and, heck, even iPad-based surround rigs if you want them

Will have to rekick the tires.


Another Coder’s Font

Source Code Pro. Currently making the rounds on the Intarwebs:

Source Code Pro is a set of OpenType fonts that have been designed to work well in user interface (UI) environments. In addition to a functional OpenType font, this open source project provides all of the source files that were used to build this OpenType font by using the AFDKO makeotf tool.

Via Mikko Ohmataa, said post also having useful commentary and comments.


OSM & PGSQL

Michal Migurski drops knowledge on how to get OpenStreetMap data into PostgreSQL/PostGIS:

At first glance, OSM data and Postgres (specifically PostGIS) seem like a natural, easy fit for one another: OSM is vector data, PostGIS stores vector data. OSM has usernames and dates-modified, PostGIS has columns for storing those things in tables. OSM is a worldwide dataset, PostGIS has fast spatial indexes to get to the part you want. When you get to OSM’s free-form tags, though, the row/column model of Postgres stops making sense and you start to reach for linking tables or advanced features like hstore.


pandas 0.9.0

[embed]https://twitter.com/wesmckinn/statuses/255290426477641728[/embed]

Not much of a point-zero release fan, but I might make an exception for pandas.


Capability Challenge

Rafe Colburn has been on fire recently, but a recent post on capabilities really hit home for me:

China has deployed its own aircraft carrier. The vessel is a rehabilitated Ukrainian carrier that China purchased in 1998. Unfortunately, China does not have pilots who have practiced landing a plane on an aircraft carrier, nor do they have any planes that are capable of landing on an aircraft carrier.

What’s the point? That it’s a lot more difficult to develop a capability than it is to build something.

I work in an industry where “capability” is a bit of jargon and I work in an organization that likes to bandy the term about willy nilly. The industry is definitely cognizant of the meaning but there are quite a few people I work with who need to carry a printed version of this around.


Jesusphone 5

I ordered an iPhone 5 a couple of weeks ago, it arrived at my local AT&T this past Thursday, and I picked it up yesterday. I’ll admit it. I made a halfass attempt to wait in line on the first day. There was enough of a line to turn me away, even in the hinterlands of Northern Virginia. If I’d stuck it out, given I was going top shelf for a 64 Gb model, I might have actually gotten one to gloat about. But I had other stuff to do.

So here’s my pure, initial impulse reaction based on less than 36 hours of ownership.

I’d have been real $?!& salty if I’d stood in line a couple of hours for this thing. Physically longer and lighter? Big deal. Ear pods still don’t stay in my ears. Yet another cable to carry, woohoo! And I sort of liked the iPhone 4’s backglass, which gave it a nice heft. Maybe you can be too thin.

But. But! BUT!! … The combination of the new A6 processor and 4G networking makes the iPhone 5 fast as hell. Stuff I’d use to twiddle my thumbs waiting for on the 4? Done!

And the more you use it, the more you notice.

The biggest hurdle is that I transitioned from an early 2000’s poor excuse for a “smart phone” to the iPhone 4, a quantum leap. So anything less could feel like falling short.

But I think the 5 might be a keeper.


Cogs Command Line Toolkit

[embed]https://twitter.com/pypi/statuses/254276578270400512[/embed]

I doubt if I’ll switch from cliff, but it’s always good to be aware of your options. If I could just figure out how to use cliff without distribute or a setuptools install I’d be set.


The Cloud Unit of Computation

ZeroVM is to virtualization what SQLite is to DBMS.

Diggin’ in the feed cratez, I ran across a piece by Ben Lorica: “How ZeroVM changes analytics in the cloud”. As Lorica points out, ZeroVM is more akin to the Java Virtual Machine then virtualization containers. However, there’s an interesting implication:

Converged Storage in the Cloud The amount of time it takes to transfer data between two specialized clusters has led to storage systems with compute capabilities2. A recent example is storage vendor CleverSafe including Hadoop MapReduce into its dispersed storage network. Users of Hadoop MapReduce who have played with cloud computing are familiar with this issue: performing big data analysis in the cloud usually means having to first transfer data from storage systems (S3) to compute resources (EC2). This means that if lowering latency is an issue, bandwidth and data size limits what you can do. In contrast (assuming cloud services providers install it) ZeroVM lets you perform computations on the storage cluster!

Anyone who’s done any significant Big Data or parallel computation has run up against the issue of moving data versus moving computation. A computation container that’s cheap to deploy but isolates like an entire PC could be pretty handy. Throw in modern deployment tools like Chef and Puppet with Amazon Web Service style APIs and things get really interesting. Any chance AWS itself could get commoditized from below?


Tagging Tweets

We’re pleased to announce a new release of the CMU ARK Twitter Part-of-Speech Tagger, version 0.3.

I’ve used the ARK Tagger and can vouch highly for it.


Wayne Enterprises Chronicles: Week 4

The Dark Knight Logo Mini Failure. The Enterprises went down to defeat this past fantasy weekend. Run-DMC, Oakland’s Darren McFadden was the key blow, only scoring 4.3 fantasy points. Arian Foster had a solid 15.9 points, but that’s not enough for a RB1. So my running backs basically did me in.

Not to mention the opponent had Tom Brady storm back for 35+ points in addition to great days for Victor Cruz and Marshawn Lynch.

The real killer is that I lost by about 22 points, but I easily had that much extra left on my bunch. As soon as I taunt Cam Newton and sit him down, he goes (fantasy) off against Atlanta. Stevan Ridley joined in with Brady in abusing the Bills and would have been a nice replacement for McFadden.

But that’s fantasy football for you. Still in first, barely on point differential.


Hittin’ The Trifacta

[embed]https://twitter.com/joe_hellerstein/statuses/253872168100843520[/embed]

More details from Gigaom.

Although Prof. Hellerstein was after my graduate school time, Go Bears!, and all that.


Postgres Performance Tips

PostgreSQL Logo

For many application developers their database is a black box. Data goes in, comes back out and in between there developers hope its a pretty short time span. Without becoming a DBA there’s a few pieces of data that most application developers can easily grok which will help them understand if their database is performing adequately. This post will provide some quick tips that allow you to determine whether your database performance is slowing down your app, and if so what you can do about it.

Craig Kerstiens provides some handy Postgres SQL that reveals how well the RDBMS is handling your queries. Optimization is left as an exercise for the reader.


Brendan O’Connor on Powerset

An interesting insider look, by Brendan O’Connor, at Powerset, a failed natural language search startup:

There’s a lot to say about Powerset, the short-lived natural language search company (2005-2008) where I worked after college. AI overhype, flying too close to the sun, the psychology of tech journalism and venture capitalism, etc. A year or two ago I wrote the following bit about Powerset’s technology in response to a question on Quora. I’m posting a revised version here.

© 2008-2024 C. Ross Jam. Built using Pelican. Theme based upon Giulio Fidente’s original svbhack, and slightly modified by crossjam.