home ¦ Archives ¦ Atom ¦ RSS

Swimming With Shark

Oy! I had forgotten how painful the bleeding edge of academic research can be. Spent most of the day wrasslin’ with Shark from the Berkeley AMPLab, Go Bears! Shark = Spark + Hive. I’m trying to build the BigRAM (TM) chops and this seems to be a relatively straightforward way to do so.

Thanks to a version mismatch between HDFS client libraries, I got sucked into rebuilding most of the Shark stack. Managed to tame Scala, Hive, Spark, and sbt, but couldn’t eliminate the pesky incompatibility. Being behind a firewall while having lots of remote dependencies didn’t help things. I swear Java has more different ways to define an http proxy. Currently Shark 1 - Me 0.

Oh well. It’s off to jar hunting tomorrow.


More Evidence

In-memory data processing of massive data needs a catchy nickname. I hearby christen thee Big RAM (™). Probably won’t stick but maybe I can find some unsuspecting proposal to slap it into. Unfortunately close to bigram to be distinctively visible in search engines. Might be able to make some useful social media handles out of that.

In any event, herewith two more nuggets regarding the march of the RAM/SSD combo architecture. First a bit of a rant on SSDs, from the not disinterested Brian Bulkowski:

The economics of flash memory are staggering. If you’re not using SSD, you are doing it wrong.

Not quite true, but close. Some small applications fit entirely in memory – less than 100GB – great for in-memory solutions. There’s a place for rotational drives (HDD) in massive streaming analytics and petabytes of data. But for the vast space between, flash has become the only sensible option.

My quick scan of the comments didn’t see anyone unearthing blinding fundamental errors, other than relying on Dell Enterprise SSD pricing as a baseline, but read with an appropriate grain of salt.

Then today, yet another announcement from Amazon Web Services:

Our new High Memory Cluster Eight Extra Large (cr1.8xlarge) instance type is designed to host applications that have a voracious need for compute power, memory, and network bandwidth such as in-memory databases, graph databases, and memory intensive HPC.

Here are the specs:

  • Two Intel E5-2670 processors running at 2.6 GHz with Intel Turbo Boost and NUMA support.
  • 244 GiB of RAM.
  • Two 120 GB SSD for instance storage.
  • 10 Gigabit networking with support for Cluster Placement Groups.
  • HVM virtualization only.
  • Support for EBS-backed AMIs only.

This is a real workhorse instance, with a total of 88 ECU (EC2 Compute Units). You can use it to run applications that are hungry for lots of memory and that can take advantage of 32 Hyperthreaded cores (16 per processor). We expect this instance type to be a great fit for in-memory analytics systems like SAP HANA and memory-hungry scientific problems such as genome assembly.

At $3.50 an hour, rent 6 for $21/hour, and get off on eight hours of 192 cores with an aggregate 1TB of RAM, 1TB of SSD, and $168 out of your pocket.

Now that would be a fun day at the office!


Timely Advice

From last year’s post on MLK’s Drum Major Instinct speech:

The drum major instinct is real. (Yes!) And you know what else it causes to happen? It often causes us to live above our means. (Make it plain!) It’s nothing but the drum major instinct. Do you ever see people buy cars that they can’t even begin to buy in terms of their income? (Amen!) [laughter] You’ve seen people riding around in Cadillacs and Chryslers who don’t earn enough to have a good T-Model Ford. (Make it plain!) But it feeds a repressed ego. …

The drum major instinct can lead to exclusivism in one’s thinking and can lead one to feel that because he has some training, he’s a little better than that person who doesn’t have it. Or because he has some economic security, that he’s a little better than that person who doesn’t have it. And that’s the uncontrolled, perverted use of the drum major instinct.

Still good advice.


Mesos Tech Talk

I need to watch this AirBnB Tech Talk on Apache Mesos:

OEmbed Link rot on URL: http://www.youtube.com/watch?v=Hal00g8o1iY

Benjamin Hindman is one of the creators of Mesos, a platform for building and running distributed systems. He has done research in the areas of programming languages and distributed systems as a graduate student at Berkeley, where he hopes to one day finish his PhD. These days he spends most of his time hacking on Mesos at Twitter where it is being used in production.

For some reason I’m really intrigued by the potential of Mesos in production. Go Bears!


SSD Projections

From the Ars Technica article: SSD predictions at CES: fewer OEMs, lower prices

Storage enthusiasts sitting on the edge of your seats for revolutionary SSD announcements out of this year’s CES can rest easy: there’s not anything mind-blowing coming up that you need to be worried about. Ars sat down today with both LSI/Sandforce and Samsung, and while both had plenty of neat stuff to talk about with regard to their current product line, neither had anything earthshaking to share. Like the headline says, this isn’t necessarily a bad thing: now’s an excellent time to buy an SSD if you don’t already have one, and the ever-present enthusiast fear of buying something that will soon be obsolete or out of date isn’t one that really applies for solid state disks.

Having seen at work the performance delta that SSDs can provide, I jumped on this bandwagon at home and bought a 500 Gb Samsung 840. Grand total of $345. Also grabbed a Western-Digital 1 Tb drive for $80. I’m following my aforementioned plan of souping Ye Olde MacBook to squeeze out another year or two as my personal laptop.

I don’t claim to be the sharpest tool in the shed, but this SSD trend has serious implications for advanced data analysis and processing.


O’Reilly In-Memory Report

Looking forward to this upcoming big data product from O’Reilly:

In a forthcoming report we will highlight technologies and solutions that take advantage of the decline in prices of RAM, the popularity of distributed and cloud computing systems, and the need for faster queries on large, distributed data stores. Established technology companies have had interesting offerings, but what initially caught our attention were open source projects that started gaining traction last year.


Wiz Nuggets

The Wiz managed to hang on against the Denver Nuggets for a win. Notably, the game was on the road against a decent team. The Wizards started letting it slip away, but guys like John Wall and Bradley Beal, stepped up big in some late possessions.

This is progress.


BlinkDB

Cool! BlinkDB provides Interactive timescale querying of massive data through declaring how much error you’re willing to tolerate in the answer. Hive on top, Hadoop and/or Spark underneath.

Today’s web is predominantly data-driven. People increasingly depend on enormous amounts of data (spanning terabytes or even petabytes in size) to make intelligent business and personal decisions. Often the time it takes to make these decisions is critical. However, unfortunately, quickly analyzing large volumes of data poses significant challenges. For instance, scanning 1TB of data may take minutes, even when the data is spread across hundreds of machines and read in parallel. BlinkDB is a massively parallel, sampling-based approximate query engine for running interactive queries on large volumes of data. The key observation in BlinkDB is that one can make perfect decisions.

Paper preprint available. Go Bears!!

Via @bigdata


PgSQL Indexes

Two recent Postgres performance nuggets, both touching on indexes.

First, Craig Kerstiens talks about performance measurement with pg_stat, but lands on conditional indexes:

To further optimize this we would great a conditional OR composite index. A conditional would be where only current = true, where as the composite would index both values. A conditional is commonly more valuable when you have a smaller set of what the values may be, meanwhile the composite is when you have a high variability of values.

Next, the Instagram team illustrates some of how they’ve scaled with Postgres. They provide functional indexes as one of their tips:

On some of our tables, we need to index strings (for example, 64 character base64 tokens) that are quite long, and creating an index on those strings ends up duplicating a lot of data. For these, Postgres’ functional index feature can be very helpful:

Both very good to know.


TPM IdeaLab, Subscribed

Been meaning to put Talking Points Memo’s Idea Lab in the ole’ feedroll since Carl Franzen was one of the few to cover the most recent State of the Map:

The tensions facing the community were on full display at OpenStreetMap’s annual “State of the Map USA” conference in Portland, Oregon from October 13 through 14, a frenzied, jam-packed series of over 50 presentations and countless other informal talks between avid geographers and programmers who sprawled over a few generally overcrowded rooms at the Oregon Convention Center, fueled by coffee (beer at night) and their boundless enthusiasm for using and improving the vast and increasingly vital public map.

Subscribed!


Data Hacking Books

An interesting collection of books concerning “big data” and “data science”. I already have a couple in hand, but Russell Jurney’s Agile Data looks quite interesting.

Working with data is HARD. Let’s face it, you’re brave to even attempt it, let alone make it your everyday job.

Fortunately, some incredibly talented people have taken the time to compile and share their deep knowledge for you.

Here are 7 books we recommend for picking up some new skills in 2013:

Via Mortar Data


MF Urban Coyotes

Now I can stop complaining. My Mark Farina, Urban Coyotes CD is physically in hand. This is the first physical CD I’ve bought since Farina’s Mushroom Jazz 7 and I haven’t even downloaded much over the past year and a half. So I’m looking forward to getting something new in the mix. Don’t have time to rip it tonight, but it’ll be on the Jesusphone 5 posthaste.


20 Mile Marchin’

The Art of Manliness dug into the book Great by Choice, and pulled a common theme of successful organizations, “The 20 Mile March”:

Collins and Morten dubbed the slow and steady approach taken by Southwest and other 10X companies “The 20 Mile March.” They took this moniker from imagining a man determined to walk across the United States, and how he could accomplish his goal faster by committing to walking 20 miles every single day – rain or shine — rather than walking for 40-50 miles in good weather and then very few miles or not at all during inclement conditions.

My biggest achievements last year, collecting lots of Tweets and blogging continuously, inadvertently embraced The 20 Mile March ethos. I’ll be looking to build my 2013 resolutions with even more awareness of the components of this approach. Weight reduction/exercise and side-hacking are going to make repeat appearances this year. Let’s see if applying 20 Mile March techniques help increase my level of success.


Wiz Hawks

I don’t know if John Wall is going to make that much of a difference overall, but his return definitely put a pep in the step of the Wiz. They actually rolled to a relatively easy victory over the Hawks.

Two in a row! Who’d a thunk the Wiz would have a winning streak this year.


Sad

Sad news from the campus newspaper of my alma mater:

Computer activist Aaron H. Swartz committed suicide in New York City yesterday, Jan. 11, according to his uncle, Michael Wolf, in a comment to The Tech. Swartz was 26.

The accomplished Swartz co-authored the now widely-used RSS 1.0 specification at age 14, founded Infogami which later merged with the popular social news site reddit, and completed a fellowship at Harvard’s Ethics Center Lab on Institutional Corruption. In 2010, he founded DemandProgress.org, a “campaign against the Internet censorship bills SOPA/PIPA.”

I didn’t know Swartz, but knew of him, from his precocious emergence as a teen in Chicago (when I lived there) to his various Web, Internet, and political activities. He was of that ultra-principled, uber-nerd ilk reminiscent of Richard Stallman, Philip Greenspun, Olin Shivers, Eric Raymond, and Mark Pilgrim. It’s one thing to be able to hack, but quite another to be able to translate that focus into the public sphere.

We should all be so lucky to have done as much as he did in the time that we have been given.


Diggin’ On GraphLab

What is GraphLab?

GraphLab is a graph-based, high performance, distributed computation framework written in C++. While GraphLab was originally developed for Machine Learning tasks, it has found great success at a broad range of other data-mining tasks; out-performing other abstractions by orders of magnitude.

In the process of experimenting with alternatives to Hadoop for work, I revisited GraphLab. It’s come a long way since it’s 1.0 release. A lot more usable and the performance is off the charts. Let me put it to you this way, I’m able to run graph algorithms on graphs with over 60 million nodes and 1 billion edges that complete in minutes. Often in the time it would take a Hadoop job to even get started.

And often on a 2 year old Dell Optiplex with 2 lousy cores and desktop grade I/O. Okay, it can’t do billions of edges fast, but an order of magnitude down is quite reasonable.

You can do a lot of damage with a decent amount of RAM and a handful of cheap, stock SSDs. The floor at which you really need to buy into distributed, horizontal scaling is much higher than a lot of people think.


pluggin’ away

As I get more experience building “command shells” in Python, I’ve become interested in making them easily extensible. In the same way cliff is my goto command line building toolkit maybe stevedore can plug this hole:

Python makes loading code dynamically easy, allowing you to configure and extend your application by discovering and loading extensions (“plugins”) at runtime. Many applications implement their own library for doing this, using import or importlib. stevedore avoids creating yet another extension mechanism by building on top of setuptools entry points. The code for managing entry points tends to be repetitive, though, so stevedore provides manager classes for implementing common patterns for using dynamically loaded extensions.


Common Crawl’s New URL Index

Scott Robertson put together an index of the URLS in the Common Crawl dataset, so everyone doesn’t have to trawl the whole contents looking for where the links are:

I’m happy to announce the first public release of the Common Crawl URL Index, designed to solve the problem of finding the locations of pages of interest within the archive based on their URL, domain, subdomain or even TLD (top level domain).

Keeping with Common Crawl tradition we’re making the entire index available as a giant download. Fear not, there’s no need to rack up bandwidth bills downloading the entire thing. We’ve implemented it as a prefixed b-tree so you can access parts of it randomly from S3 using byte range requests. At the same time, you’re free to download the entire beast and work with it directly if you desire.

Information about the format, and samples of accessing it using python are available on github. Feel free to post questions in the issue tracker and wikis there.

Combine with a previous hare-brained scheme to good effect.


blekko izik

blekko unleashed a tablet search application. Might be worth checking out. Good to know blekko is still competing hard and creatively in the search space. Maybe their zag will be to make search a lot more usable everywhere instead of gobbling up every search in the world.

Friends of blekko!

We are very pleased to announce izik, our new tablet search app. We launched izik on Friday, and today is it currently the #3 free reference app in the Apple app store.

We believe that the move to the tablet from the desktop/laptop is an environmental shift in how people consume web content. We have developed a search product that addresses the following unique problems of tablet search:


New Bitly APIs

For whatever the reason, today was chock full of link fodder. Some of it I’ll be saving for later in the week, but Hilary Mason’s announcement of new, real-time, streaming APIs tickles my fancy.

We just released a bunch of social data analysis APIs over at bitly. I’m really excited about this, as it’s offering developers the power to use social data in a way that hasn’t been available before. There are three types of endpoints and each one is awesome for a different reason.

Already scheming some data collection and analysis projects.


Wiz Thunder

I was there in person. Beal for 2 to take the lead @ 0.3 seconds in the fourth. Didn’t realize how undermanned the Wiz were. It was a bit more exciting than the BCS National Championship Game.


Python Hadooping

In my day job, I’ve been using Yelp!’s mrjob framework to run a lot of Hadoop Map-Reduce jobs. Takes a lot of the Java pain away. So it was with quite a bit of interest that I dug into Uri Laserson’s A Guide to Python Frameworks for Hadoop:

I recently joined Cloudera after working in computational biology/genomics for close to a decade. My analytical work is primarily performed in Python, along with its fantastic scientific stack. It was quite jarring to find out that the Apache Hadoop ecosystem is primarily written in/for Java. So my first order of business was to investigate some of the options that exist for working with Hadoop from Python.

In this post, I will provide an unscientific, ad hoc review of my experiences with some of the Python frameworks that exist for working with Hadoop, including:

His bottom line is that straight Hadoop Streaming with Python has the best performance. Meanwhile, mrjob is well maintained, active, and quite productive, but comes with a significant performance hit.

A small sample size, but worth reading to know all your options if you’re into mixing elephants and snakes.


A Decade of NFL Play-by-Play

This archive of NFL play-by-play data popped back up on Hacker News. More grist for the sports data hacking mill:

I’ve recently completed a project to compile publicly-available NFL play-by-play data. It took a while, but now it’s ready.

The resulting database comprises nearly all non-preseason games from the 2002 through [edit: 2012] seasons. I have not performed any analysis on the data, so what you’ll get are the only basics—time, down, distance, yard line, play description, and score. It’s almost exactly what I started with. I’ll leave any analysis up to you.


Radar Themes 2013

I get a lot of my good tech nuggets through the O’Reilly Radar blog. This year they’ve given a heads-up on their themes, and the process behind their selection, for 2013:

In our discussions we worked through about 35 potential themes and narrowed them down to 10 that we are actively working on in the first quarter of 2013. Here’s the list along with the Radar team member who is focused on it.

Topics like In-Memory Data Management, Consumer AI, and Future of Programming look really interesting, but I’m really curious to see how Ben Lorica and Nat Torkington tie in.


Scalable Message Queueing

At work I started digging into ActiveMQ, mostly because I wanted to whale on some folks who I thought had an unhealthy devotion to some mediocre, unscalable middleware. Then I stumbled upon the ActiveMQ Apollo project:

ActiveMQ Apollo is a faster, more reliable, easier to maintain messaging broker built from the foundations of the original ActiveMQ. It accomplishes this using a radically different threading and message dispatching architecture. Like ActiveMQ, Apollo is a multi-protocol broker and supports STOMP, Openwire, MQTT, SSL, and WebSockets.

If Apollo’s benchmarks prove out in actual deployment, holy smokes!! With 1.x release implying a certain amount of stability to boot. I would like to know how it scales with a high number of concurrent subscribers thou.

And ActiveMQ isn’t all that mediocre although it has some clear scaling limitations.


Slow Data

Stephen Few is one of those useful contrarians that actually proposes serious alternatives in addition to eloquently railing against the status quo. I actually believe there are some interesting technical trends and advances (e.g. newly accessible distributed programming) underlying the buzzword compliant “Big Data” movement. But Few’s Slow Data Movement is worth considering:

I’d like to introduce a set of goals that should sit alongside the 3Vs to keep us on course as we struggle to enter the information age—an era that remains elusive. May I present to you the 3Ss: small, slow, and sure.


Free Data Science Books

Carl Anderson pulls together a nice list of links to quality textbooks online regarding information retrieval and machine learning

I’ve been impressed in recent months by the number and quality of free datascience/machine learning books available online. I don’t mean free as in some guy paid for a PDF version of an O’Reilly book and then posted it online for others to use/steal, but I mean genuine published books with a free online version sanctioned by the publisher. That is, “the publisher has graciously agreed to allow a full, free version of my book to be available on this site.”

Here are a few in my collection:

Of course, be sure to help the authors out by picking up a dead trees version or two. Donate to some school in the developing world, or even the developed world, if you don’t really need the paper.


Where’s Mine?

@djmarkfarina urban coyotes could of been MJ8 :) — sick mix Mark! everyone should own. — http://t.co/BfZuHqPT #house http://t.co/UjenYV8p

Patiently awaiting my next Mark Farina mix CD. Sort of, grumble.


TIL pythonbrew

Today I learned of pythonbrew

OEmbed Link rot on URL: https://twitter.com/pypi/status/286494923291770882

pythonbrew is a program to automate the building and installation of Python in the users $HOME

How did I, of all people, not know of pythonbrew?!


Scala Vice Python

Jean Yang brings an interesting perspective on why one might want to learn Scala:

Scala is also relatively easy to learn, especially compared to other strongly statically typed languages (OCaml; Haskell). I have previously said that Python is the most marketable language for beginners, but I am beginning to change my mind. Like Python, Scala has clean syntax, nice libraries, good online documentation, and lots of people in industry using it. Unlike Python, Scala also has a static type system that can prevent you from doing bad things (whereas in Python the bad things just happen when you run the program).

You could say I’m somewhat biased towards PhDs and MIT folksen though, as they do tend to take their programming languages seriously.

Cribbed from the TypeSafe Blog.


@bigdata’s 2012 Toolset

I’m slowly following in Ben Lorica’s big data footsteps. On his recommendation I’ve decided to jump into Scala, starting with Odersky et. al.’s Programming in Scala. Like what I’m seeing so far. Functional programming, with a somewhat Pythonic sensibility, and performant exploitation of the JVM.


Was It Achieved?

Before I generate a set of resolutions for 2103, I thought it would be worthwhile to check in on the results of last year’s goals:

  • Improve Physical Health. Wash. Lost some weight, pretty close to target, but didn’t really incorporate exercise. Improved the diet, but failed to play any ultimate.
  • Maintain Financial Health. Achieved. Hit all targets.
  • Shine The Skillz. Wash. Did not get into any real programmer events, but work offered a lot more hacking opportunity than I anticipated.
  • Expand The Network. Fail. I joined the MIT DC Alumni Club, but haven’t had a chance to attend any events. Missed on everything else.
  • Get Rid of Stuff. Complete Fail.
  • Build A Tribe. I’ll call it a fail, although the jury’s still out, pending my performance reviews at work. Even without that feedback, I’m still disappointed I haven’t been able to energize a really high performing group of people, with a good sense of camaraderie.
  • Less Watch, More Do. Achieved. Took a sabbatical from college football to good effect. Also severely cut down on the channel surfing.

A couple of wins, a couple of washes, and three fails, so an overall “enh” from me. These were intended to be stretch goals so I can’t complain too much.

Given a few other things that happened in my life in 2012, I’ll happily bid goodbye to the annum, but hold back on the good riddance.


40 Posts Per

Interesting. Despite being a pretty consistent poster over the last 15 months or so, I’ve never had a 40 post month. New goal for 2013.


Boinging the Common Crawl

Here’s an interesting hack I don’t really don’t have time to execute on. Take a BoingBoing data dump and dissect either references to Boing Boing pages or outlinks from the archive using the Common Crawl dataset. Might be some interesting intersections, or you could trace out the patterns of BoingBoing influence.

Tangential side project, build up a set of mrjob modules to work with either dataset. The BoingBoing stuff comes in one big file so it might be useful for someone else to bust it up into some smaller units and stuff onto Amazon S3, or otherwise make publicly available.


Wayne Enterprises Chronicles: Playoffs

The Dark Knight Logo Mini DEFEAT!! After a great run at the end of the season, I got hit with one of those vexing fantasy defeats to flush me out of title contention. Appropriate on this final weekend of the NFL regular season that I document my collapse.

On the one hand, I got beat by a sizeable margin, 122.7 to 85.30. So it would appear my roster choices really wouldn’t have made a difference. On the other hand, consider this:

  • I listened to the fantasy “experts” and at the last minute substituted the red hot Danario Alexander instead of my usual Eric Decker. Alexander goose egg, Decker 23 fantasy points, me as GM -23.
  • On the same expert advice I played the St. Louis Rams DEF for a grand total of 1 point. I could have easily played one or two other units for 6-9 more points
  • I stuck with Lawrence Tynes as my kicker, admittedly with no encouragement, for the only game where the Giants got shut out. That means Tynes got 0 points too.
  • Arian Foster came up short of his projection by 8 points.
  • I pooh poohed a roster move by my opponent benching Torrey Smith and playing Brandon Lloyd. +23 for him.
  • The opposition survived a 0.1 performance from his QB position.

So as I float into the fantasy off-season, I’ll console myself with the thought that my optimal lineup wouldn’t have won. But three self-inflicted zeros leaves a quite bitter taste. If I had added 30 points who knows what could have happened.


PgSQL Bulk Ingest of JSON

After months of sometimes painful experimentation here’s a technique others may find useful. This is for those who need to ingest millions of JSON objects, collected from the wild, stored as line ordered text files, into a PostgreSQL table.

  1. Transform your JSON lines into base64 hex encoded lines
  2. Start a transaction
  3. Create a temp table that deletes on commit
  4. COPY from your hex encoded file into the temp table
  5. Run an INSERT into your destination table that draws from a SELECT that base64 decodes the JSON column
  6. COMMIT, dropping the the temp table
  7. Profit!!
  8. For bonus points, use the temp table to delete duplicate records or other cleanup

The benefit of this approach is that it saves you from dealing with escaping and encoding horrors. Basically, JSON can contain any character so it’s almost impossible to safely outsmart stupid users and PgSQL strictness. Once you’re over a million records, you’ll get bit by one or the other.

Reliably applied to over 300 million Tweets from the Twitter streaming API.

Update. If you’re doing this with Python make sure to use psycopg2 since you can control the transaction isolation and it supports COPY.


For Budding Data Scientists

Hilary Mason has some advice:

I get quite a few e-mail messages from very smart people who are looking to get started in data science. Here’s what I usually tell them:

The best way to get started in data science is to DO data science!

First, data scientists do three fundamentally different things: math, code (and engineer systems), and communicate. Figure out which one of these you’re weakest at, and do a project that enhances your capabilities. Then figure out which one of these you’re best at, and pick a project which shows off your abilities.

Good input, although I’m still somewhat partial to Sean Taylor’s “Ask, Answer, Tell” for what data scientists do. I think coding is important, but while engineering systems can be scientific, the practice is much more pragmatic and provides much different challenges than scientific exploration. Probably a minor quibble but engineering is quite different from science.

Still, getting your hands dirty is better than noodling in the abstract on the sidelines.


Scala and IntelliJ

If I’m going to get serious about learning Scala, I’m going to need a serious development environment. My default would be to figure something out with Emacs, but it’s time to check out the modern kit.

Jeethu Rao documented getting IntelliJ IDEA set up for Scala:

The class used the Typesafe Scala IDE, which is built on Eclipse. I’ve been using JetBrains PyCharm for most of my python work and when they had IntelliJ 12 Ultimate Edition on sale at a 75% discount last week, I bought it. Last night, I spent a bit of time setting up IntelliJ to work with Scala on my Retina MBP. I’m posting this as a future reference.

Hope IDEA is a bit lighter weight than Eclipse, whose enormity has always confounded me.


DBPedia and SQLite

DBpedia has always felt like a great resource to me, but I’ve never taken the time to play with the data, being somewhat daunted by RDF and triplestores.

Jean-Louis Fuchs has done yeoman’s work in creating a repository of DBpedia infoboxes in SQLite and releasing his import script. May have to grab and experiment with.

When I learned about DBpedia, I wanted to have it installed locally, I read the tutorials on sparql and how to install DBpedia: that has to be simpler. I kind of worship Simple and I didn’t want to learn yet another query language. So I hacked an SQLite import script for DBpedia types and infobox-properties in one evening. The script takes about 50 hours to read instance_types_en.nt and infobox_properties_en.nt. It creates a table per type and a table per property, after tables are created everything gets an index, the whole database is analyzed and finally vacuumed.

An aside thought, wonder if there’s a straightforward way to take DBpedia data and stuff it into a full text indexing system like ElasticSearch? You’d give up the reasoning properties of a SPARQL based triplestore, but would have quick and dirty access to a significant semi-structured knowledge repository.


2 mrjob Thoughts

I’ve been putting mrjob to the task quite a bit recently. Two quick thoughts on a highly useful package.

First, Map/Reduce is a fairly handy data processing framework even if you don’t need a ton of scalability. Any complex UNIX text file processing involving transforms, filters, and aggregations might be more easily expressed as a Map/Reduce computation. mrjob makes that quite simple to do on a single machine at a high level.

Second, mrjob supports the bundling and uploading of custom code and data to support the job. I’ve mainly been using it to add Python extension modules, but there’s no reason one couldn’t include supplemental data in a binary form, like a Python pickle, or an SQLite DB. Very handy for map side augmentation or filtering presuming the data set isn’t too large.

© 2008-2025 C. Ross Jam. Built using Pelican. Theme based upon Giulio Fidente’s original svbhack, and slightly modified by crossjam.