home ¦ Archives ¦ Atom ¦ RSS > Category: Uncategorized

PgSQL Bulk Ingest of JSON

After months of sometimes painful experimentation here’s a technique others may find useful. This is for those who need to ingest millions of JSON objects, collected from the wild, stored as line ordered text files, into a PostgreSQL table.

  1. Transform your JSON lines into base64 hex encoded lines
  2. Start a transaction
  3. Create a temp table that deletes on commit
  4. COPY from your hex encoded file into the temp table
  5. Run an INSERT into your destination table that draws from a SELECT that base64 decodes the JSON column
  6. COMMIT, dropping the the temp table
  7. Profit!!
  8. For bonus points, use the temp table to delete duplicate records or other cleanup

The benefit of this approach is that it saves you from dealing with escaping and encoding horrors. Basically, JSON can contain any character so it’s almost impossible to safely outsmart stupid users and PgSQL strictness. Once you’re over a million records, you’ll get bit by one or the other.

Reliably applied to over 300 million Tweets from the Twitter streaming API.

Update. If you’re doing this with Python make sure to use psycopg2 since you can control the transaction isolation and it supports COPY.


For Budding Data Scientists

Hilary Mason has some advice:

I get quite a few e-mail messages from very smart people who are looking to get started in data science. Here’s what I usually tell them:

The best way to get started in data science is to DO data science!

First, data scientists do three fundamentally different things: math, code (and engineer systems), and communicate. Figure out which one of these you’re weakest at, and do a project that enhances your capabilities. Then figure out which one of these you’re best at, and pick a project which shows off your abilities.

Good input, although I’m still somewhat partial to Sean Taylor’s “Ask, Answer, Tell” for what data scientists do. I think coding is important, but while engineering systems can be scientific, the practice is much more pragmatic and provides much different challenges than scientific exploration. Probably a minor quibble but engineering is quite different from science.

Still, getting your hands dirty is better than noodling in the abstract on the sidelines.


Scala and IntelliJ

If I’m going to get serious about learning Scala, I’m going to need a serious development environment. My default would be to figure something out with Emacs, but it’s time to check out the modern kit.

Jeethu Rao documented getting IntelliJ IDEA set up for Scala:

The class used the Typesafe Scala IDE, which is built on Eclipse. I’ve been using JetBrains PyCharm for most of my python work and when they had IntelliJ 12 Ultimate Edition on sale at a 75% discount last week, I bought it. Last night, I spent a bit of time setting up IntelliJ to work with Scala on my Retina MBP. I’m posting this as a future reference.

Hope IDEA is a bit lighter weight than Eclipse, whose enormity has always confounded me.


DBPedia and SQLite

DBpedia has always felt like a great resource to me, but I’ve never taken the time to play with the data, being somewhat daunted by RDF and triplestores.

Jean-Louis Fuchs has done yeoman’s work in creating a repository of DBpedia infoboxes in SQLite and releasing his import script. May have to grab and experiment with.

When I learned about DBpedia, I wanted to have it installed locally, I read the tutorials on sparql and how to install DBpedia: that has to be simpler. I kind of worship Simple and I didn’t want to learn yet another query language. So I hacked an SQLite import script for DBpedia types and infobox-properties in one evening. The script takes about 50 hours to read instance_types_en.nt and infobox_properties_en.nt. It creates a table per type and a table per property, after tables are created everything gets an index, the whole database is analyzed and finally vacuumed.

An aside thought, wonder if there’s a straightforward way to take DBpedia data and stuff it into a full text indexing system like ElasticSearch? You’d give up the reasoning properties of a SPARQL based triplestore, but would have quick and dirty access to a significant semi-structured knowledge repository.


2 mrjob Thoughts

I’ve been putting mrjob to the task quite a bit recently. Two quick thoughts on a highly useful package.

First, Map/Reduce is a fairly handy data processing framework even if you don’t need a ton of scalability. Any complex UNIX text file processing involving transforms, filters, and aggregations might be more easily expressed as a Map/Reduce computation. mrjob makes that quite simple to do on a single machine at a high level.

Second, mrjob supports the bundling and uploading of custom code and data to support the job. I’ve mainly been using it to add Python extension modules, but there’s no reason one couldn’t include supplemental data in a binary form, like a Python pickle, or an SQLite DB. Very handy for map side augmentation or filtering presuming the data set isn’t too large.


Mr. Potter’s a Real Ass

Yeah, I knew he was a top-N villain from It’s A Wonderful Life but jeez, Potter knew Uncle Billy lost that money, knew George Bailey was covering for him, and still had to kick poor George while he was down. Jerk!

Greatest Christmas movie ever. If you haven’t had a shitty day like George Bailey, you’re not human. And hopefully you made it over the hump.

Cheers!!


D3.0

Be it resolved for 2013, learn and apply D3.js, the recent 3.0 release taking the JavaScript visualization library to 11:

In addition to all of the above, D3 includes a handful of bug fixes, performance improvements, and minor new features. A new d3.shuffle method provides the Fisher–Yates shuffle. The d3.format class now supports align and fill, and both d3.format and d3.time.format support POSIX localization at build time. The d3.layout.treemap now supports the “slice-and-dice” algorithm. Lastly, all of the D3 examples have been moved to bl.ocks.org and GitHub Gist for easier browsing & forking.


Original Content

As the year winds down I’ve been looking at some of my older MPR content. One noticeable difference is the amount of original content. Definitely has decreased over time. There’s been a discernible increase in embeds and pull quotes, probably correlating to increased posting from iOS devices.

Something to think about going into next year. On the one hand quoting and linking makes it easier to post consistently. On the other, I’d really rather not publish a link blog although I could be accused of doing that already.


Bamboo

Looking for a good time series data storage and analysis service. Maybe it’s bamboo: OEmbed Link rot on URL: https://twitter.com/pypi/status/282975117741527040

bamboo is an application that systematizes realtime data analysis. bamboo provides an interface for merging, aggregating and adding algebraic calculations to dynamic datasets. Clients can interact with bamboo through a a REST web interface and through Python.

bamboo supports a simple querying language to build calculations (e.g. student teacher ratio) and aggregations (e.g. average number of students per district) from datasets. These are updated as new data is received.


UltraJSON

Need to check out UltraJSON at work:

UltraJSON is an ultra fast JSON encoder and decoder written in pure C with bindings for Python 2.5+ and 3.


The Next MF Mix CD

URBAN COYOTES - Mixed By Mark Farina - Digi-Pak CD

PRE-ORDER Your copy of Mark Farina’s latest CD featuring 15 tracks all from the Coyote Cuts imprint mixed by the man himself with all the fixins. Released as a 4 panel digi-pak with art by Aaron Miller and Josh Adams. CD’s will begin to ship around the last week of December.

Done!


Spark Deployment

Wow! I didn’t realize how easy it was to run a Spark/Mesos cluster on Amazon EC2:

The spark-ec2 script, located in Spark’s ec2 directory, allows you to launch, manage and shut down Spark clusters on Amazon EC2. It automatically sets up Mesos, Spark and HDFS on the cluster for you. This guide describes how to use spark-ec2 to launch clusters, how to run jobs on them, and how to shut them down. It assumes you’ve already signed up for an EC2 account on the Amazon Web Services site.

Might be fun to just spin one up and run against some Common Crawl data.


IPython Tutorials

Jean-Louis Fuchs has collected some essential IPython videos:

These are videos you have to see, if you don’t know ipython yet! I’ve worked as a programmer for more than 15 years and I thought I’ve seen it by now, but ipython got me as excited as I was when I dialed into a BBS first time… it bursts your horizon.


A Python Distutils Tip

Parkin’ this in case it helps someone else, including me in the future. Oddly difficult to find documentation on the feature, the closest being an aside on distutils config files.

When building a Python package that has a C extension you can build the extension separately, using the build_ext command. This allows you to control the include directories, link directories, and link libraries. Handy for those ornery libraries that you just can’t install in /usr/lib or thereabouts. I recently needed it for geos and Shapely.


Tail Recursive Acronym

I don’t know if the games are any good, but I’ll give Michael Cook a +1 for a moby hack. Growing games is sort of cool.

But +11 for a tail-recursive acronym:

On Friday, however, Cook released a game called A Puzzling Present which shows off the much of the hard work Cook’s put into Angelina (a tail-recursive acronym for “A Novel Game-Evolving Labrat I’ve Named ANGELINA”). Instead of a simple platformer with maps and obstacles, the latest games by Angelina now include new mechanics for the player (in this particular seasonal game, the character is Santa Claus) at each level.


ICWSM 2013

The call for papers for ICWSM, the International Conference on Weblogs and Social Media, 2013 is out. I’ve been a reviewer and attendee in the past and may actually be able to submit something this year. Yeah!!

But, according to Danah Boyd there might have been a bit of a botch:

In the end, I found that there was such extensive miscommunication that the gulf between what I had stated upfront as being key to me running the conference was miles away from what AAAI would find acceptable. Both of us felt as though we were contorting ourselves to make this work and it left both of us very bitter and unable to work with one another. I decided that I could not run a conference that wasn’t accessible to many of the core parts of my research community just to please other parts of the research community. I felt trapped and realized that it would be better to walk away than to let go of my principles.

Boooo!!


Wayne Enterprises Chronicles: Week 14

The Dark Knight Logo MiniVictory!! Your’s truly ends the season on a five game winning streak and takes the regular season title on point differential. Onwards to the Fantasy Playoffs!!

The matchup was unnecessarily tense in that I started RGIII (17.47 points) over Cam Newton (39.17 points). Given the weather, rainy, and opponent, Ravens, I was seriously leaning towards benching Griffin. Instead I went with the “don’t get cute” mantra. Similarly, I went with an oversold Greg Jennings instead of reliable Randall Cobb.

Thankfully, I made a great waiver wire pickup in Knowshon Moreno for 24 points. And no one else underperformed. Thankfully Arian Foster collected just enough garbage time points to get me over the hump. A five point squeaker that had me sweating into the second half of Monday Night Football.

Feeling good about my playoff chances although my opponent has some scary explosive players in Josh Freeman (hot, good matchup), Torrey Smith (overdue), Calvin Johnson (he’s Megatron), and Adrian Peterson (’nuff said). I’m guessing I’ll need someone to go way over projection to pull this one out.


Verizon vs AT&T

Criminy, I know it’s the Verizon Center, but there’s no need to jam the AT&T cells!!

Or at least that’s what it feels like whenever I come to a Wizards game. All I want is to check some stats, lookup random trivia, and pen the occasional blog post. Is that too much to ask? #firstworldproblems


IPython Funding

Cool! Hoping the IPython notebook get’s even more useful and interesting.


PyPgDay

I like Python. I like PostgreSQL. I like PyCon. I should like PyPgDay.

In coordination with the Postgres community and PyCon US 2013, PyPgDay is occurring in Santa Clara on March 13 from 9am-6pm.

PyPgDay will be a full-day event with seven talks about PostgreSQL and Python, including talks by contributors to PostgreSQL, Django, PostGIS, and Python. Half the talks will help PostgreSQL DBAs, and the other half will focus on developing Python applications using Postgres features.


Crawley

Link parkin’: Crawley Project

Crawley is Pythonic Crawling / Scraping framework intented to change the way you think about extracting data from the internet

It’s never actually materialized, but I’m still hoping that user grade, commodity cost, focused crawlers become a reality. Seems like the tech has caught up with the concept. Maybe something like Crawley plus Python’s copious machine learning and data mining toolkits could provide the foundation.

I believe not actually telling folks you’re exploiting focused crawling is one of the key tricks.


Good Luck Doug!

I’m not in the “everybody should code” cargo cult, but a bit of knowledgeable advocacy for one’s discipline is always welcome. Looks like Douglas Rushkoff is getting a chance to do some stumping for Computer Science education:

This week is Computer Science Education Week, which is being observed around the United States with events aimed at highlighting the promise — and paucity — of digital education. The climax of the festivities, for me anyway, will be the opportunity to address members of Congress and their staffers on Wednesday in Washington about the value of digital literacy. I’ve been an advocate of digital culture for the past 20 years, and this feels like the culmination of a lifetime of arguing.

Knock ’em dead!

Memo to self, read Program or Be Programmed over the holidays.


graph-tool

Link parkin’: graph-tool

graph-tool is an efficient python module for manipulation and statistical analysis of graphs (a.k.a. networks). Contrary to most other python modules with similar functionality, the core data structures and algorithms are implemented in C++, making extensive use of template metaprogramming, based heavily on the Boost Graph Library. This confers a level of performance which is comparable (both in memory usage and computation time) to that of a pure C/C++ library.

The Boost Graph Library gives me the willies though. I’ve never had much luck building or using it embedded within Python. Weird linking or dynamic library loading issues got me. At a 2.x release though, this module might be worth a test drive.

Via @hmason


Models vs Resources

This nugget from Jacob Kaplan-Moss is a bit old, and from a Django-centric perspective, but captures what’s been tripping me up with Python web app frameworks and building RESTful APIs:

Conflating models and resources

In the REST world, the resource is key, and it’s really tempting to simply look at a Django model and make a direct link between resources and models — one model, one resource. This fails, though, as soon as you need to provide any sort of aggregated resource, and it really fails with highly denormalized models. Think about a Superhero model: a single GET /heros/superman/ought to return all his vital stats along with a list of related Power objects, a list of his related Friend objects, etc. So the data associated with a resource might actually come out of a bunch of models. Think select_related(), except arbitrary.


Wayne Enterprises Chronicles: Week 13

The Dark Knight Logo Mini Victory! A satisfying win over the league leader avenging an earlier loss this fantasy season. That makes for four wins in a row. RG III closed the door on Monday Night Football, but if I’d have started Cam Newton, the deal would have been done Sunday night.

On my side, it was mostly pedestrian performances with RG III scoring 18, Arian Foster and Stevan Ridley for 13 apiece, and Tony Gonzalez (Go Bears!!) for 17. I will pat myself on the back though for grabbing Cecil Shorts off of the waiver wire and getting 17 points out of him.

I also lucked into my opponent’s QB, Drew Brees, having an uncharacteristically bad game, delivering 1.37 points. Brees is usually good for 20+. Hey, whatever it takes.

Now I’m locked into the first or second seed for our fantasy playoffs with one regular season game to go. Need a little help, but I already like how week 14 is working out.


Wiz, Heat

Somehow I lucked out and had a ticket for the second Washington Wizards victory of the season. Against reigning NBA champs, the Miami Heat, no less.

Have to say I was fairly subdued until the final minute, when I realized the Wiz could actually win. Otherwise, I was anticipating collapse, but the Heat never really turned on the defense.


That Dang Lockscreen

I’m not rooting for Windows 8, but I hope it spurs Apple to do something interesting with the well nigh useless (for me) iPhone lockscreen. Brent Caswell takes a crack at a redesign:

What isn’t very straightforward is the lockscreen. I set out to make the lockscreen flexible and open to the apps on your device, without throwing everything that works really well out the window. But before I get to my ideas, how does the lockscreen work now?

Seems like a missed opportunity for both high utility and magnificent delight.

Via Daring Fireball

P.S. Without having to jailbreak the device.


simplekml

Link parkin’: OEmbed Link rot on URL: https://twitter.com/pypi/status/275454924970680320

simplekml is a python package which enables you to generate KML with as little effort as possible.

Took a quick look at the docs and seems quite elegant, thorough, and useful.


Sleepin’

On the drive in to work today I was listening to the local guy talk, faux sports talk, AM show. They opened a segment with a freestyle rap which I could immediately ID as Q-Tip but couldn’t place the track name.

DJ finally announced it as Excursions from The Low End Theory. Hit the parking garage, got outside of the building, and had it downloaded in 5 minutes through the modern miracle of LTE.

Life’s good, but why the hell have I been sleepin’ on The Low End Theory? Gotta’ get that on my Jesusphone 5 post haste.


Jacque Vaughn?!

TIL that Jacque Vaughn is the head coach of the Orlando Magic. I remember how I used to love college basketball and Vaughn’s Kansas Jayhawks team was one to watch every game.

Man I’m old.


PyCon 2013 Talks

The list of PyCon 2013 talks was announced:

As you may already know, this was an incredibly hard decision for the Program Commitee: we had over 450 submissions for only 114 slots on the program. Further, the quality of submissions was very high; the committee debated each and every talk very closely. I want to sincerely thank everyone who submitted a talk: the quality of PyCon comes from our speakers, and this year you all blew it out of the water.

Sounds like the potential for a great program. Unfortunately, I can’t be as excited as I was last year since I’m not as confident that I can get away. Still, I’m quite interested in the more hardcore groupings like “Big Data” (yeah I know buzzword compliant) and even more so in the tutorials lineup.


Python REST API Toolkits and Search

Following the Python PyPi Twitter stream reveals a number of toolkits that help you build RESTful APIs. These modules usually get you most of the way mapping models from an ORM into an API. However, where they’ve always fallen short for me is in easing the generation of search end points. Maybe I’m doing something wrong, but I need more help in 1) handling incoming query args, 2) turning them into searches against the model(s), and 3) generating a RESTful response, especially the resource URLs.

LazyWeb, make it so!


Geospatial, Time series, and Torque

Torque is a (relatively) recently announced toolkit built on top of CartoDB 2.0 for performant browser rendering of large geospatially oriented time series data:

Torque allows you to create beautiful visualizations with big temporal datasets by bundling HTML5 browser rendering technologies with a generic and efficient temporal data transfer format created using the CartoDB SQL API. Torque visualisations work on desktop and ipads, and work well on temporal datasets with hundreds of thousands or even millions of datapoints. In anticipation of the Strata Conference starting this week in London, we have prepared some examples to share. Simon Tokumine will be there, so ping us for a private demo there.

Previously, I had found CartoDB a bit daunting for a hosted self-installation, but looking at the requirements now it doesn’t seem that bad. Feels like one of those tools where if you absorb the pain and get in on the ground floor, one can build a distinct competitive advantage. Working on tablets and desktops is a nice tease.


Diggin’ On Supervisord

Finally got a chance to put supervisord into action at work. It’s very handy for organizing, launching, and managing a bunch of server processes. I’m especially liking the interactive command shell provided by supervisorctl. Plus I think it’s easily launchable from a cron @reboot action which is great for this non-root user.


Pydoop

How have I not heard of Pydoop until now?! OEmbed Link rot on URL: https://twitter.com/pypi/status/274179982736121856

Welcome to Pydoop. Pydoop is a package that provides a Python API for Hadoop MapReduce and HDFS. Pydoop has several advantages 1 over Hadoop’s built-in solutions for Python programming, i.e., Hadoop Streaming and Jython: being a CPython package, it allows you to access all standard library and third party modules, some of which may not be available for other Python implementations – e.g., NumPy; in addition, Pydoop provides a Python HDFS API which, to the best of our knowledge, is not available in other solutions.


Sans Irony

Rafe Colburn’s reference to Christy Wampole’s How to Live Without Irony finally pushed me over the edge and I had to read her opinion piece. A good screed but before one gets all self-congratulatory about not falling for hipsterism, please check out David Foster Wallace’s “E Unibus Pluram: Television and U.S. Fiction”.

Admittedly Wampole is probably quite aware of the essay, but I’m not sure how many in the target audience are. This was the most impactful of the essays I read from DFW’s collection “A Supposedly Fun Thing I’ll Never Do Again” and has stuck with me for a very long time. Pretty much convinced me of the corrosiveness of television beyond the silliness of much of the content. Mule Variations has a good look at the importance of “E Unibus Pluram”, but it’s not a bad way to “ease” into DFW, if there is such a thing.

Hipsters are one thing but good luck assailing irony within mass media.


Uptime

ITermScreenSnapz003

374 days of uptime, a bit over one year, is tricky with any machine and limited sysadmin attention. Pattin’ myself on the back, and a thumbs up for Linode. Now I can get get around to doing a distribution upgrade.

P.S. That’s the machine that hosts this here very blog.


Wayne Enterprises Chronicles: Week 12

The Dark Knight Logo Mini Victory!! A week ago, I said there wasn’t much sweeter than going into the Sunday night game, knowing you had a win in hand. Well that feeling is topped by going into Sunday period, fairly comfortable you’re going to win and clinch a playoff spot.

I didn’t really plan it this way, but I had two fantasy players in each of Thanksgiving Day’s three NFL games:

  • Robert Griffin III lit up Dallas, and delivered 35 fantasy points.
  • Miles Austin, thanks for nothing
  • Arian Foster, got his usual 20+, 26 in fact, befitting the #1 overall pick
  • Shayne Graham, got the benefit of overtime for 11 points out of the kicker position
  • Stevan Ridley, picked up some garbage points in New England’s shellacking of the Jets, delivering 15.7 fantasy points
  • I learned my lesson last week about the New England DEF and they rewarded me this week with 25 in the bank

Rolled into Sunday with 113 points on the board, and 2 players left. Going into Sunday night, Aaron Rodgers would have needed a historic fantasy performance without giving up much to Randall Cobb, who I was starting. Neither did particular well, which was fine by me.

That’s three wins in a row. And if my math’s correct, I’m back in the fantasy playoffs.


Think Sports Stats

Wondering how hard it would be to build a good set of autodidact materials for learning probability and statistics similar to Think Stats and Baseball Hacks. The twist would be to use sports repositories ala Retrosheet, cleanly integrated into pandas for analysis (bonus points for iPython HTML notebook usage), and then have a follow up course on building data oriented, interactive web sites.

Might be a market for that.


Curious

Completely unscientific, coming from my mother-in-law’s basement in Palos Heights, Illinois, using speedtest.net:

  • iPhone 5, AT&T LTE, 8.5 Mbps down, 0.9 up
  • iPad 3gen, Verizon LTE, 24.2 Mbps down, 15.8 up
  • iPhone 5, Comcast XFinity Wi-Fi, 0.5 Mbps down, 2.5 up

The XFinity pipe might be carrying the burden of IPTV on the cable box, but doesn’t seem like it’s living up to its billing. Upside is there’s no explicit bandwidth cap.

Verizon Wow!!

© 2008-2024 C. Ross Jam. Built using Pelican. Theme based upon Giulio Fidente’s original svbhack, and slightly modified by crossjam.