home ¦ Archives ¦ Atom ¦ RSS > Category: Uncategorized

Python API Building

Posted on: Sat 20 October 2012

Nice 4 step tutorial on building a RESTful API using Python, by K. P. Kaiser. The best tip was a pointer to elasticutils, an ElasticSearch library:

It’s always a good idea to see what your options are in a library. Initially, when I was building this integration I saw pyes, a very well written library, but the code to use it seemed a bit ugly for my tastes.

Luckily, after a bit more searching, I found elasticutils, which is, in my opinion, a much cleaner interface to the very simple elasticsearch server. It always pays to take a few minutes to read the introduction, and example code before deciding on a library. Elasticutils actually uses pyes under the covers.

FDWs FTW

Posted on: Fri 19 October 2012

Craig Kerstiens on putting Redis in Postgres

SQL is an expressive language, though people are often okay with accessing Mongo data through its own ORM. The real value is that you could actually query the data from within Postgres then join across your data stores, without having to do some ETL process to move data around.

From experience, here be dragons, but used judiciously Postgres’ Foreign Data Wrappers are a great feature.

Summarizing Streams

Posted on: Thu 18 October 2012

Eﬃcient stream summarization framework to incrementally build summaries of Twitter
message streams w/ one-pass http://t.co/GJ3z6cmi
— Ben Lorica 罗瑞卡 (@bigdata) October 18, 2012

From A Framework for Summarizing and Analyzing Twitter Feeds (PDF alert)

In this paper, we present a dynamic pattern driven approach to summarize data produced by Twitter feeds. We develop a novel approach to maintain an in-memory summary while retaining sufficient information to facilitate a range of user- specific and topic-specific temporal analytics. We empir- ically compare our approach with several state-of-the-art pattern summarization approaches along the axes of storage cost, query accuracy, query flexibility, and efficiency using real data from Twitter. We find that the proposed approach is not only scalable but also outperforms existing approaches by a large margin.

My quick half-ass summary:

use frequent item set mining, to come up with a code book
recode your content with the code book
compress, which exploits the redundancy uncovered by the coding
profit!!

But I need to read the paper more closely. And the real-time summarization and topic tracking aspects seem really cool.

Discogs oEmbed

Posted on: Wed 17 October 2012

Since I’ve been heavily using the WordPress embed feature for tweets, I had a though that it would be nice if you could do the same thing for referencing musical releases in Discogs.com. But alas, they seem not to support oEmbed.

However Discogs does seem to have a nice client API, with a supporting Python module to boot. Seems like a Discogs oEmbed proxy server might be a nice self-contained hacking project.

This Must Happen!

Posted on: Tue 16 October 2012

The Little Guy (TM) has been watching a lot of Yo Gabba Gabba! when it hit me.

Lady Miss Kier must do a dancey dance!! That would be AWESOOOOME!!

OEmbed Link rot on URL: http://www.youtube.com/watch?v=Ibk4Diagkok

If they can find a spot on the show for Metta World Peace, they can put a little groove in the heart.

Atomic News

Posted on: Mon 15 October 2012

new from me at GigaOM: “Circa wants to rethink the news on a sub-atomic level” http://t.co/ylNAIgEj
— Mathew Ingram (@mathewi) October 15, 2012

Circa seems pretty interesting even though it’s YAPTRN

Part of the thinking behind Circa comes from ideas that have been described by author and journalism professor Jeff Jarvis, as well as media-startup veteran David Cohn, who is also a founding partner of Circa and acts as its editor-in-chief. The main idea is that the traditional article or story format that newspapers and other news outlets have produced for so many years no longer fits with the way we produce or consume information now. The standard “inverted pyramid”-style article was designed for the days when people might only see one report about a news event, printed on dead trees and without links, so it had to include virtually everything.

The gravity of mobile is counteracting the inertia of print in terms of news distribution, leading to some interesting approaches. The Web in general provided the initial escape velocity, but couldn’t quite get news out of print’s orbit. The confluence of current economics and mobile adoption feels like it could provide enough acceleration to do the trick.

The key is that Circa has pushed back into the newsroom’s production process and if successful in a big way will be a sea change.

Feh. Even though I wrote it, that next to last paragraph was way too weasel wordy, wonky, and jargony.

Network Structure Taxonomies

Posted on: Sun 14 October 2012

Link parkin’: Taxonomies of networks from community structure

Functional Classiﬁcation of Networks based on community structure: mesoscopic structure used to derive taxonomies http://t.co/U3Ov3gD9
— Ben Lorica 罗瑞卡 (@bigdata) October 7, 2012

The study of networks has become a substantial interdisciplinary endeavor that encompasses myriad disciplines in the natural, social, and information sciences. Here we introduce a framework for constructing taxonomies of networks based on their structural similarities.

This paper may set a record for number and word count of author affiliations.

Nerds, Geeks, Dorks

Posted on: Sat 13 October 2012

Glyph Lefkowitz wrestles with some definitions:

So I always feel a twinge when I identify myself as a “geek”. I usually prefer to say that I am - or at least aspire to be - a nerd.

Have to say I’m quite sympathetic. Suppressing my inner dork is a daily, even hourly, challenge.

The Flask Stack

Posted on: Fri 12 October 2012

At work, I’ve been giving Flask a whirl, with a focus on building a REST based API emitting JSON responses. A much different approach from Django, much more amenable to building an API to my mind. Much less magic, but much lighter weight. And at least for me, a bit easier to understand how to generate RESTful responses.

The biggest part of the learning curve has been SQLAlchemy. Between the SQL Core for building DB queries, to the object relational mapper (with two styles of declaring schemas no less), to engines, to sessions, there’s a lot going on. Using PostGIS is exacerbating the issue as 1) the geo part needs extensions like GeoAlchemy, and 2) I take advantage of Postgres’ range types, which aren’t baked into SQLAlchemy and are best used with a non-standard operator (@>). So imagine coming to this Swiss Army chainsaw of object relational mapping system and you’re immediately into figuring how to extend the framework. Fun!

Some helpful Flask extensions: Flask-Restless and Flask-SQLAlchemy

Wayne Enterprises Chronicles: Week 5

Posted on: Thu 11 October 2012

TRAGEDY!! I went into the Monday night game trailing by 24 points and a skoosh. A tough get, but I had the overall #1 pick, Arian Foster going. 120 yards and 2 touchdowns was definitely doable.

152 on the ground. 1 touchdown. 1 reception for 16 yards. 23 fantasy points.

And I lose by a lousy half a point!!

Foster did his part. The actual killer was Stevan Ridley of the Patriots giving up a fourth quarter fumble that cost me 2 points.

Given that my quarterback, Rober Griffin III, got knocked out and only supplied 3 points, I probably shouldn’t have been in the game anyway. But Tony Gonazalez had another big day with 24 points. Meanwhile, my kicker and defense provided double digits as well.

Need to get more production out of my wide receivers, but dang did that one sting!

djay Unleashed?

Posted on: Thu 11 October 2012

If I read this right, a vexing problem I had with djay as virtual turntables has been solved:

There are actually two points here. You have the ability to support multichannel audio interfaces in djay, previously unavailable, and you also can route headphone output and main output separately – so even without a multichannel interface, you could use, say, both the headphone jack and an HDMI or USB out. That’s a big deal for DJs, because finally, you can pre-cue tracks, but also should mean apps with multichannel recording and playback, other tools with separate cueuing, and, heck, even iPad-based surround rigs if you want them

Will have to rekick the tires.

Another Coder’s Font

Posted on: Wed 10 October 2012

Source Code Pro. Currently making the rounds on the Intarwebs:

Source Code Pro is a set of OpenType fonts that have been designed to work well in user interface (UI) environments. In addition to a functional OpenType font, this open source project provides all of the source files that were used to build this OpenType font by using the AFDKO makeotf tool.

Via Mikko Ohmataa, said post also having useful commentary and comments.

OSM & PGSQL

Posted on: Tue 09 October 2012

Michal Migurski drops knowledge on how to get OpenStreetMap data into PostgreSQL/PostGIS:

At first glance, OSM data and Postgres (specifically PostGIS) seem like a natural, easy fit for one another: OSM is vector data, PostGIS stores vector data. OSM has usernames and dates-modified, PostGIS has columns for storing those things in tables. OSM is a worldwide dataset, PostGIS has fast spatial indexes to get to the part you want. When you get to OSM’s free-form tags, though, the row/column model of Postgres stops making sense and you start to reach for linking tables or advanced features like hstore.

pandas 0.9.0

Posted on: Mon 08 October 2012

pandas 0.9.0 has been released; a definite must-upgrade for all users! http://t.co/4Q1KRVpz #pydata
— Wes McKinney (@wesmckinn) October 8, 2012

Not much of a point-zero release fan, but I might make an exception for pandas.

Capability Challenge

Posted on: Mon 08 October 2012

Rafe Colburn has been on fire recently, but a recent post on capabilities really hit home for me:

China has deployed its own aircraft carrier. The vessel is a rehabilitated Ukrainian carrier that China purchased in 1998. Unfortunately, China does not have pilots who have practiced landing a plane on an aircraft carrier, nor do they have any planes that are capable of landing on an aircraft carrier.

What’s the point? That it’s a lot more difficult to develop a capability than it is to build something.

I work in an industry where “capability” is a bit of jargon and I work in an organization that likes to bandy the term about willy nilly. The industry is definitely cognizant of the meaning but there are quite a few people I work with who need to carry a printed version of this around.

Jesusphone 5

Posted on: Mon 08 October 2012

I ordered an iPhone 5 a couple of weeks ago, it arrived at my local AT&T this past Thursday, and I picked it up yesterday. I’ll admit it. I made a halfass attempt to wait in line on the first day. There was enough of a line to turn me away, even in the hinterlands of Northern Virginia. If I’d stuck it out, given I was going top shelf for a 64 Gb model, I might have actually gotten one to gloat about. But I had other stuff to do.

So here’s my pure, initial impulse reaction based on less than 36 hours of ownership.

I’d have been real $?!& salty if I’d stood in line a couple of hours for this thing. Physically longer and lighter? Big deal. Ear pods still don’t stay in my ears. Yet another cable to carry, woohoo! And I sort of liked the iPhone 4’s backglass, which gave it a nice heft. Maybe you can be too thin.

But. But! BUT!! … The combination of the new A6 processor and 4G networking makes the iPhone 5 fast as hell. Stuff I’d use to twiddle my thumbs waiting for on the 4? Done!

And the more you use it, the more you notice.

The biggest hurdle is that I transitioned from an early 2000’s poor excuse for a “smart phone” to the iPhone 4, a quantum leap. So anything less could feel like falling short.

But I think the 5 might be a keeper.

Cogs Command Line Toolkit

Posted on: Sun 07 October 2012

OEmbed Link rot on URL: https://twitter.com/pypi/statuses/254276578270400512

I doubt if I’ll switch from cliff, but it’s always good to be aware of your options. If I could just figure out how to use cliff without distribute or a setuptools install I’d be set.

The Cloud Unit of Computation

Posted on: Sat 06 October 2012

ZeroVM is to virtualization what SQLite is to DBMS.

Diggin’ in the feed cratez, I ran across a piece by Ben Lorica: “How ZeroVM changes analytics in the cloud”. As Lorica points out, ZeroVM is more akin to the Java Virtual Machine then virtualization containers. However, there’s an interesting implication:

Converged Storage in the Cloud The amount of time it takes to transfer data between two specialized clusters has led to storage systems with compute capabilities2. A recent example is storage vendor CleverSafe including Hadoop MapReduce into its dispersed storage network. Users of Hadoop MapReduce who have played with cloud computing are familiar with this issue: performing big data analysis in the cloud usually means having to first transfer data from storage systems (S3) to compute resources (EC2). This means that if lowering latency is an issue, bandwidth and data size limits what you can do. In contrast (assuming cloud services providers install it) ZeroVM lets you perform computations on the storage cluster!

Anyone who’s done any significant Big Data or parallel computation has run up against the issue of moving data versus moving computation. A computation container that’s cheap to deploy but isolates like an entire PC could be pretty handy. Throw in modern deployment tools like Chef and Puppet with Amazon Web Service style APIs and things get really interesting. Any chance AWS itself could get commoditized from below?

Tagging Tweets

Posted on: Fri 05 October 2012

We’re pleased to announce a new release of the CMU ARK Twitter Part-of-Speech Tagger, version 0.3.

I’ve used the ARK Tagger and can vouch highly for it.

Wayne Enterprises Chronicles: Week 4

Posted on: Thu 04 October 2012

Failure. The Enterprises went down to defeat this past fantasy weekend. Run-DMC, Oakland’s Darren McFadden was the key blow, only scoring 4.3 fantasy points. Arian Foster had a solid 15.9 points, but that’s not enough for a RB1. So my running backs basically did me in.

Not to mention the opponent had Tom Brady storm back for 35+ points in addition to great days for Victor Cruz and Marshawn Lynch.

The real killer is that I lost by about 22 points, but I easily had that much extra left on my bunch. As soon as I taunt Cam Newton and sit him down, he goes (fantasy) off against Atlanta. Stevan Ridley joined in with Brady in abusing the Bills and would have been a nice replacement for McFadden.

But that’s fantasy football for you. Still in first, barely on point differential.

Hittin’ The Trifacta

Posted on: Thu 04 October 2012

Proud to launch http://t.co/kdGpimwT with @jeffrey_heer & @seankandel. Building some beautiful, powerful, useful stuff. @trifacta
— Joe Hellerstein (@joe_hellerstein) October 4, 2012

More details from Gigaom.

Although Prof. Hellerstein was after my graduate school time, Go Bears!, and all that.

Postgres Performance Tips

Posted on: Wed 03 October 2012

For many application developers their database is a black box. Data goes in, comes back out and in between there developers hope its a pretty short time span. Without becoming a DBA there’s a few pieces of data that most application developers can easily grok which will help them understand if their database is performing adequately. This post will provide some quick tips that allow you to determine whether your database performance is slowing down your app, and if so what you can do about it.

Craig Kerstiens provides some handy Postgres SQL that reveals how well the RDBMS is handling your queries. Optimization is left as an exercise for the reader.

Brendan O’Connor on Powerset

Posted on: Wed 03 October 2012

An interesting insider look, by Brendan O’Connor, at Powerset, a failed natural language search startup:

There’s a lot to say about Powerset, the short-lived natural language search company (2005-2008) where I worked after college. AI overhype, flying too close to the sun, the psychology of tech journalism and venture capitalism, etc. A year or two ago I wrote the following bit about Powerset’s technology in response to a question on Quora. I’m posting a revised version here.

Solr vs Elasticsearch

Posted on: Tue 02 October 2012

Haven’t had a chance to dig in to this comparison of elasticsearch and Solr , but I’m link parkin’ to keep an eye on the alternatives. Have to say I’m currently enjoying elasticsearch at work, but it’s still early in my experimentation.

A good Solr vs. ElasticSearch coverage is long overdue. We make good use of our own Search Analytics and pay attention to what people search for. Not surprisingly, lots of people are wondering when to choose Solr and when ElasticSearch.

As the Apache Lucene 4.0 release approaches and with it Solr 4.0 release as well, we thought it would be beneficial to take a deeper look and compare the two leading open source search engines built on top of Lucene – Apache Solr and ElasticSearch. Because the topic is very wide and can go deep, we are publishing our research as a series of blog posts starting with this post, which provides the general overview of the functionality provided by both search engines.

Useful That

Posted on: Mon 01 October 2012

A Few Useful Things to Know about Machine Learning http://t.co/03cFbKxE by Pedro Domingos :: added to my recommended “must read” list for ML
— Jimmy Lin (@lintool) October 1, 2012

I have it on good reputation that this Domingos guy knows a thing or two about machine learning.

Ha, ha! Only serious.

Ruh-roh Apple?

Posted on: Sun 30 September 2012

About This Mac Snap I upgraded my lil’ ole MacBook to Mac OS X 10.7.5 (not worthy of Mountain Lion support) and now the poor thing has been crashing. And it borked my alpha of Tweetbot for Mac (fixed). May just be age, but I’m hoping there’s a pending os update to fix some obscure issue.

I upgraded my iPhone 4 to iOS 6 and now I think I’m suffering the battery life issues many others have been seeing. Haven’t done a scientific investigation, and an uptick in audio streaming over 3G may be responsible, but still irritating.

Not really Apple’s fault but …, I went to the AT&T store a day after the iPhone 5 launch to put in my order. Okay, it’s going to take 3-4 weeks, but I’m a patient guy. However, it shouldn’t take me 30 minutes to get a hold of a sales rep and complete the pre-order. Yikes!

And then there’s that maps issue.

Stuff used to “just work”.

More itertools

Posted on: Sat 29 September 2012

OEmbed Link rot on URL: https://twitter.com/pypi/status/250302968228880384

Here I’ve collected several routines I’ve reached for but not found. Since they are deceptively tricky to get right, I’ve wrapped them up into a library. We’ve also included implementations of the recipes from the itertools documentation. Enjoy! Any additions are welcome; just file a pull request.

Manna

Posted on: Sat 29 September 2012

Keurig B70 I finally gave up on the “coffee shop” options here in greater Leesburg, VA. 3 Starbucks’, a combo cafe and Sushi bar, and retrofitted cobbler’s storefront. The latter’s somewhat charming, but doesn’t quite feed the need.

What’s a former illycaffè barista to do? At home, I could do a standard drip brewer, or a French press, or even a small espresso machine. But all of those are too inconvenient. When I’m at home and feel the urge, I don’t need top of the line java, just something to meet the craving.

So I bought a Keurig B70 Platinum K-Cup coffee brewer this past weekend. The Keurig scratches the itch when I get a few free moments at home to read, or hack, or just stargaze. Doing that in a public place with interesting foot traffic? That’ll have to wait until our next abode.

elasticsearch Cool?

Posted on: Fri 28 September 2012

I’ve had an eye on elasticsearch for a while now but didn’t really have a good reason to use the Lucene based search engine. Until today that is, when at work I was looking at the use cases I was applying MongoDB towards. I’m not a hater, but MongoDB jus wasn’t fitting the bill.

Time to give elasticsearch a shot. At least Luca Cava of orange11 thinks elasticsearch is cool:

First of all, what you’ll notice as soon as you start up is how easy elasticsearch is to use. You index your JSON documents, then you can make a query and retrieve them, with no configuration needed. One of the reasons is that it’s schema-less, which means it uses some nice defaults to index your data unless you specify your own mapping. For more precision it has an automatic type guessing mechanism which detects the type of the fields you are indexing, and it uses by default the Lucene StandardAnalyzer while indexing the string fields.

If you need something beyond the standard choices you can always define your own mapping, simply using the Put Mapping API. In fact every feature is exposed as a REST API.

Seems auspicious, but we’ll put the engine to the test.

Thirty For Twelve

Posted on: Thu 27 September 2012

Milestone. Twelve months straight of 30+ posts per month in this here venue. Sometimes it’s a grind, but routine is good. And I’ve stashed away a heck of a bunch of good information. Still need to get cranking on a project that I can narrate.

eGenix PyRun

Posted on: Thu 27 September 2012

eGenix PyRun looks like it might come in handy some day.

Our new eGenix PyRun™ combines a Python interpreter with an almost complete Python standard library into a single easy-to-use executable, that does not require a system wide installation and is fully relocatable.

eGenix PyRun’s executable only needs 11MB, but still supports most Python application and scripts - and it can be further compressed to 3-4MB using gzexe or upx.

Compared to a regular Python installation of typically 100MB on disk, this makes eGenix PyRun ideal for applications and scripts that need to be distributed to many target machines, client installations or customers.

Ask, Answer, Tell, [Suggest?]

Posted on: Wed 26 September 2012

Sean J. Taylor interned with the LinkedIn Data Science team and had some cogent observations on what the work can really boil down to:

Most people who describe data science are actually describing what it takes to get the job. (e.g. Take statistics and machine learning courses. Learn such-and-such languages/packages. Hack.). Or they describe how the job is growing in importance. I’m going to describe the actual practice of data science.

The Data Science Loop:

Ask a good question.

Answer the question while economizing on resources.

Communicate your results.

(Sometimes) Make recommendations to engineers or managers.

The whole thing is worth a read, but most importantly, the closing graf.

Wayne Enterprises Chronicles: Week 3

Posted on: Tue 25 September 2012

Victory!, although Cam Newton had me scared on Thursday with a lousy eight points. That’s 3 and 0 for those counting at home. Leading the league at this moment, although as in all things fantasy, my luck could turn at any minute.

The big surprises this week were Darren McFadden with 18.5 points and Tony Gonzalez with 19.6 points. Great production out of RB2 and the TE position. McFadden scored more than Arian Foster (!) my, and the league’s, overall number one pick. Both my wide receivers were solid with double digit performances and even K Robbie Gould pitched in with 13 points.

I had a comfortable enough lead going into the Sunday night game, that I could even leave the Ravens DEF in against the New England offense. Luckily they didn’t go negative or I really would have been sweating. Serious consideration was given to benching the Ravens.

102 points without much production from a key position, QB, just shows the strength of this team. And Cam Newton might get benched for RGIII this weekend!

pyDAWG

Posted on: Mon 24 September 2012

Okay, tons of background knowledge is great, until you find yourself having to search 20Gb / 270+ million lines to find anything in that mountain of data. Full text indexing seems like the right choice, but again, who wants to deal with Solr/Lucene, ElastiSearch, or Sphinx just to get started?

Enter DAWG

This package provides DAWG-based dictionary-like read-only objects for Python (2.x and 3.x).

String data in a DAWG (Directed Acyclic Word Graph) may take 200x less memory than in a standard Python dict or list and the raw lookup speed is comparable. DAWG may be even faster than built-in dict for some operations. It also provides fast advanced methods like prefix search.

Based on dawgdic C++ library.

I’m going to give it a shot at work and see what happens. The big upside is fast prefix search (hopefully) over and above a key/value store’s fast key lookup. Only obvious downside I can see is constantly hearing Xzibit in my head when I read the docs for this module.

What A Trip!

Posted on: Sun 23 September 2012

OEmbed Link rot on URL: https://twitter.com/Dope_Den/status/249882120196079618

Two decades ago, Chicago-native Mark Farina began experimenting at his DJ gigs, dropping slower, deeper tracks along the lines of disco classics, acid-jazz, hip hop, and downtempo. In Chicago, which is primarily known as a house music town, his selections were deemed more appropriate for home listening than for the dance-floor. Despite this, Farina continued to develop his sound, which soon after became coined as Mushroom Jazz. A few mixtapes followed, but his experiment-turned-endeavor finally got off the ground in his new home, San Francisco.

Farina juggled between his two residencies of Chicago and San Francisco until he made a permanent move to SF in 1994. He saw opportunities in the city with its vibrant music scene, which featured a slew of promoters catering to fans across a broad spectrum of music styles including hip hop, jazz, house, and reggae to Wicked-style breaks and techno. A couple years before he made his move, he teamed up with Patty Ryan-Smith to throw his own Mushroom Jazz club night.

I’ve done some dumb things in my life. Not going to a Mushroom Jazz event, when I was essentially a very gradual student at Berkeley, has to rank pretty high on the list.

Diggin’ On Whoosh

Posted on: Sat 22 September 2012

I’m starting to observe that when dealing in data exploration, right after summary statistics, keyword style searching is high up on the TODO list. Until you really need them though, pulling out the big boys like Solr/Lucene or Sphinx are sort of a pain. When you’re in iterative exploration mode the tax of dealing with enterprise scalable software is substantial. YAGNI probably applies. However, if you’re of a Pythonic mind the Whoosh is a nice, lightweight starter toolkit.

Whoosh is a fast, featureful full-text indexing and searching library implemented in pure Python. Programmers can use it to easily add search functionality to their applications and websites. Every part of how Whoosh works can be extended or replaced to meet your needs exactly.

I’ve been incorporating Whoosh into some data analysis on a tiny data set and it’s been a blast. So much so I’ll soon try it out on a bigger, but not massive, pile of bits. A nice feature of a pure Pythonic search library is that you can stash arbitrary Python data structures in the index. This really increases the utility of dealing with search results as opposed to having to go to another store to retrieve more complex non-indexed objects.

Whoosh is also useful for embedding in Swiss Army command shells built using cliff.

python-omgeo

Posted on: Fri 21 September 2012

Link parkin’:

OEmbed Link rot on URL: https://twitter.com/pypi/statuses/248871216629313536

Background Knowledge

Posted on: Fri 21 September 2012

Diggin’ in the feed cratez, ran across From Words to Concepts and Back: Dictionaries for Linking Text, Entities and Ideas, announced by Google’s Valentin Spitkovsky and Peter Norvig:

Human language is both rich and ambiguous. When we hear or read words, we resolve meanings to mental representations, for example recognizing and linking names to the intended persons, locations or organizations. Bridging words and meaning — from turning search queries into relevant results to suggesting targeted keywords for advertisers — is also Google’s core competency, and important for many other tasks in information retrieval and natural language processing. We are happy to release a resource, spanning 7,560,141 concepts and 175,100,788 unique text strings, that we hope will help everyone working in these areas.

And the data is readily accessible with publication capturing the details.

The Data Kalashnikov

Posted on: Thu 20 September 2012

CSV is the data Kalashnikov: not pretty, but many wars have been fought with it and kids can use it. #okfest
— @pudo@berlin.social (@pudo) September 19, 2012

Word. Although I will say that today I managed to export two Excel worksheets to tab separated value, text files encoded in UTF-16 and then safely turn them into CSVs using Python’s csvkit. No children were harmed.

psycopg2 and copy_from

Posted on: Wed 19 September 2012

Yesterday I was developing some code to do bulk loading of data into PostgreSQL. The db’s COPY command is the best, if not exactly most inviting way to do this, modulo possible extensions, but those haven’t worked for me.

Unfortunately, COPY reads from files local to the server or over standard input. I was dreading having to jerry-rig something out of SSH when I revisited the psycopg2 module. Turns out psycopg2 has a copy_from method that does what you’d expect local or remote DB. Python FTW yet again.

Probably should have been using psycopg2 from the get go.