home ¦ Archives ¦ Atom ¦ RSS

Hadoop Munging OSM

Michal Migurski crisply documents his Hadoop/Elastic Map Reduce process for working with Open Street Map data:

Usable line generalization for OSM roads and routes has been a hobby project of mine for several years now, since I argued for it at the first U.S. State of the Map conference in Atlanta, 2010. I’ve finally put the last piece of this project in place with the use of Hadoop to parallelize the geometry processing.

I’ve learned a lot about moving geographic data between Postgres and Hadoop. The result is available at Streets and Routes.

The end result is something I’m super happy about: a complete worldwide dataset of simplified roads and routes that’s suitable for high-quality labels and route shields.

A couple of punch lines to that final statement. First, Migurski was a relative newbie to Hadoop, and didn’t use anything more sophisticated than Hadoop Streaming and straight Python with Shapely. Second, his expensive run cost him $9.40 and 7 hours of waiting.


Posterous Done

Hey this blog has been around long enough to see even YCombinator grade companies be born, exit, and shutdown:

Posterous launched in 2008. Our mission was to make it easier to share photos and connect with your social networks. Since joining Twitter almost one year ago, we’ve been able to continue that journey, building features to help you discover and share what’s happening in the world – on an even larger scale.

On April 30th, we will turn off posterous.com and our mobile apps in order to focus 100% of our efforts on Twitter. This means that as of April 30, Posterous Spaces will no longer be available either to view or to edit.

Right now and over the next couple months until April 30th, you can download all of your Posterous Spaces including your photos, videos, and documents.

I always thought Posterous had promise, and even used it lightly for some personal side-blogging, but it never really quite clicked or demonstrated a “wow” feature moment. I think the two towers of WordPress and Tumblr have pretty much sucked up all the air of blogging tools.

A worthy attempt though!


M*F*in Flu

They were not kidding this year. This flu is kicking my ass, and I even had the dang’ preventative shot.

At least I’m not passing out on The Left Coast, and no, fingers crossed, pink eye.


Graph Based Recommendation

OEmbed Link rot on URL: https://twitter.com/andyhickl/status/303691512284327937

Color me skeptical, but Reco4j might be worth a test drive:

Reco4j is an open source project that aims at developing a recommendation framework based on graph data sources. We choose graph databases for several reasons. They are NoSQL databases that are “schemaless”. This means that it is possible to extend the basic data structure with intermediate information, i.e. similarity value between item and so on. Moreover, since every information are expressed with some properties, nodes and relations, the recommendation process can be customized to work on every graph.

Hmmmm. But are they high quality recommendations for every graph?


Snap Polling

Saw this post by Carl Franzen, a while ago on IdeaLab. Found it interesting that there maybe a new “polling science” that could emerge out of real-time social media analytics. The trick will be having some real science, backed by validation.

Social media is a big draw for over 2.3 billion users around the globe, but its true value is only starting to be unlocked, according to Poptip, a New York-based startup company has seen early success running quick, realtime text analysis of tweets on Twitter as a kind of hi-tech snap polling method.

Mainstream companies from Pepsi to People Magazine to ESPN have begun experimenting with Poptip’s first and to date, only product, an app that allows users to post their questions on Twitter, then tracks and displays responses from respondents in realtime.

Even curiouser, the original post seems to have disappeared from IdeaLab. Factually incorrect? Scammed? Weird. Need to do some digging.


Using psql

Craig Kerstiens lets loose with his preferred interface to Postgres:

Sometimes it leans more to, what is the Sequel Pro equivilant for Postgres. My default answer is I just use psql, though I do have to then go on to explain how I use it. For those just interested you can read more below or just get the highlights here:

There’s a few handy TILs in there.


Precog in PgSQL

Love to do as much as possible within Postgres:

Today, the Precog team has released a free implementation of Precog for PostgreSQL. Precog for PostgreSQL empowers users to easily perform data science on PostgreSQL.


Morphology Is King

Quoting Lucas Gonze, who’s always been a good read. Might be worth a sub:

I’m blogging about new form factors for computing on a Tumblr blog at formfactor.gonze.com. My topic there is post-mobile contexts like biohacking, head mounted displays, wearable computing, smart TV, haptics, and neural interfaces. The subtext is new applications enabled by changes in the physical shape of devices.


Pandas Whirlwind

Gotta go easy tonight. Some bug has gotten hold of me.


Data Science Toolkit Details

I sort of already knew of Pete Warden’s Data Science Toolkit, but Ajay Ohri gives a nice overview of what’s actually inside:

The Data Science Toolkit is a collection of the best open data sets and open-source tools for data science, wrapped in an easy-to-use REST/JSON API with command line, Python and Javascript interfaces. Available as a self-contained Vagrant VM or EC2 AMI that you can deploy yourself.

The Data Science Toolkit is essentially a specialized Linux distribution, with a lot of useful data software pre-installed and exposing a simple interface. Developer documentation is quite nicely done.


Urban Coyotes Quick Thoughts

Mark Farina Urban Coyotes Cover So I’ve finally been able to give Mark Farina’s Urban Coyotes mix a few listens. Enjoying the blend although my current stance is that it’s at the quality of some of his Great Lakes Audio podcast mixes. This isn’t to say that Urban Coyotes is bad, just not super, duper, mega, head nodding awesome like some other Farina House music mixes. In particular, the only peak for me is Flapjackers Candy.

So far, not at the top of the pantheon, but it I need to give it some more wear for fit. I’m definitely not sending it back though.

Am I True Chicago (™) though in immediately trainspotting Ron Magers on the voice overs? The “coyotes” pronunciation bugs me however. Should be three syllables, not two, with a long e.


Will It Python?

A series on data analysis, from R to Python, definitely worth tracking:

Will it Python? posts are my attempts to port data analyses originally done in R into Python.

The objective isn’t to just make a key that translates functions and methods in R into Python equivalents. Instead, the goal is to reproduce the results and insights of the analysis in idiomatic Python (to the extent I’m qualified to judge such a thing). Sometimes there will be a direct translation from a line of R to a line of Python; other times Python will suggest an altogether different approach to the problem.


More Linden Links

Greg Linden is back with another collection of interesting links. A skoosh more business oriented than is my want, and probably more than any one human should read in a single sitting, but still some good nuggets to be had:

The future of maps on smartphones: “It’ll be like you’re a local everywhere you go. You’ll know your way through the back alleys and hutongs of Beijing, you’ll know your way all around Paris even if you’ve never been before. Signs will seem to translate themselves for you. This kind of extra-smartness is coming to people.” (1)


Revisiting Prismatic

Previously, Prismatic was pretty low on the Yet Another Place To Find News depth chart. Feeds are Number 1, Twitter is Number 2, Hacker News 3, a few sub-Reddits 4, and nothing else really mattered.

However I’m getting disillusioned with Hacker News and need a break. Not enough good, core tech nuggets, and too much aspirational and political content reaching the front page. And I also trawl the new submission page but that’s pretty hit or miss.

So I’m going to provisionally bump Prismatic up to the three hole, and try to replace my Hacker a news with it. I think Prismatic’s link saving/stashing mechanism has greatly improved since my last serious usage, so maybe it’ll feel much more useful

Update Forgot about one reason I stopped using Prismatic regularly. Whatever JavaScript Fu it’s doing, Prismatic routinely crashes Safari on my iPad. Downer.


Wiz Nets

That’s 3 home wins in 1 week, including taking out Manhattan and Brooklyn. Got any more boroughs for us to beat NYC?


Other Python and Geo Playthings

Some other things that caught my fancy recently

Geobases

This project provides tools to play with geographical data. It also works with non-geographical data, except for map visualizations :).

Glue

Glue is a Python library to explore relationships within and among related datasets.

On the current state of reverse geocoding:

At Flickr we spent a while working on turning the point where a photo was set on map (or whose GPS coordinates were shoved into the EXIF) into a place. The work of reverse geocoding is about taking a point, and finding out which polygon its in. This is a well solved problem. With two caveats:

  1. places don’t have neat boundaries, but overlap all over each other. And people disagree about the overlaps

  2. even if places had neat boundaries, and people agreed on them, availability of information about those boundaries is variable at best.

Once I made a request for a pure geo indexing capability. Ask and ye shall receive.

A Java reverse geocoder that uses Geotools and JTS to index Flickr Shapefiles to map (latitude, longitude) to WOEID.

The code has no external server dependencies and runs in memory. You may have to specify a larger max heap size if you use the largest files (e.g. the localities dataset uses around 450Mb on my Mac).


Mongo Geo Indexes

New Geospatial Indexes with GeoJSON and Improved Spherical Geometry

The 2.3 series adds a new type of geospatial index that supports improved spherical queries and GeoJSON.

MongoDB may have its failings and critics, but it’s here, working and getting better. New geospatial indexing features might make me take another look for some ad hoc hackery.


Fast, Cheap, and Persistent

Operating System Implications of Fast, Cheap, Non-Volatile Memory

The existence of two basic levels of storage (fast/volatile and slow/non-volatile) has been a long-standing premise of most computer systems, influencing the design of OS components, including file systems, virtual memory, scheduling, execution models, and even their APIs. Emerging resistive memory technologies – such as phase-change memory (PCM) and memristors – have the potential to provide large, fast, non-volatile memory systems, changing the assumptions that motivated the design of current operating systems. This paper examines the implications of non-volatile memories on a number of OS mechanisms, functions, and properties.

Starting to think UltraRAM rolls off the tongue better than my tongue in cheek moniker BigRAM. Besides being cheap enough to splurge on, RAM will be different in the future.


Big Data Etymology

This is just of general interest to me, but there’s also the specific angle that I’ve personally met the guy who trademarked “bigdata”. Smart guy. Was wondering where that term actually originated. Will be stuff of lore for years to come.

The term Big Data is so generic that the hunt for its origin was not just an effort to find an early reference to those two words being used together. Instead, the goal was the early use of the term that suggests its present connotation — that is, not just a lot of data, but different types of data handled in new ways.


The Funniest Tweet

… of last night, Super Bowl Sunday, at least for me. Then again it was timed perfectly for the blackout, I’m a big Dark Knight fan, and I’m really easily amused.


A Chihuahua in Your Pig

That would be Pig UDF for GraphChi. A complete hack of the best kind:

Pig is a powerful query language for Hadoop commonly used for large scale data processing. Now it is possible to run GraphChi programs as parts of Pig-scripts, with just one line of script! This allows easy huge scale graph computation with data stored in HDFS (Hadoop File System). As GraphChi will ultimately execute only on a single Hadoop machine (see HowGraphChiForPigWorks), the size of the Hadoop Cluster is not a limiting factor.

GraphChi for Pig is a viable alternative to Giraph, which is a distributed graph engine built on top of Hadoop. With GraphChi, you can develop your algorithms on your laptop (with realistically sized data) and then deploy them to run the big cluster. GraphChi will also often run faster and uses much less resources than alternatives.


Stuper Sunday

From the best (only?!) post to American McCarver in a while:

But football is a contest and picking sides is what competition is about. So who does a disinterested party root for this weekend?

Both teams are populated with terrible human beings, so that doesn’t help. (I’m tempted to give the edge to the Ravens, because at least Ray Lewis knew what he was doing when he denied it later.) The normally reliable heuristic of rooting against a Harbaugh doesn’t work. Colin Kaepernick is the obvious hero from every football movie ever, but nobody should have to cheer for someone with a chin beard. Seriously, Colin, stop taking grooming tips from Brian Wilson.

This is the least interest I’ve had in a Super Bowl since that Denver-Atlanta stinker in 1999. I absentmindedly listened to that one on radio and slept through part of it. In 2013, I have no sympathy for any of the players. As an East Bay aficionado, I can’t stand the ’Niners and their fans. Not to mention their jerk of a former Stanford coach. Even though I have Baltimore ties, the Ravens are the Browns, not the Colts. And Ray Ray’s act and history irk me to no end.

Guess I’ll hold my nose and go for the Ravens just for Ozzie Newsome’s sake.

Bonus. Sapp before Haley? I’m really just starting to dislike the NFL.


Mission Accomplished

MacBook HDs Snap

With the help of an iFixit YouTube video and repair Guide, I managed to complete the upgrading of Ye Olde MacBook as previously advertised. Unfortunately, the guide delivered with the MCE OptiBay seems specific to MacBook Pros, with no info for us old MacBook peasants. The iFixit material did in a pinch though. Way more tiny screws were involved than I planned, but the machine currently seems none the worse for the experience.

Now I need to put the HD through it’s paces a little, making it the home of my iTunes Library. But I feel good this little MacBook can make it to its fifth anniversary and well beyond.


TIL EIN

I’ve been looking for a decent mash up of IPython and Emacs. Will definitely test drive EIN:

Emacs IPython Notebook (EIN) provides a IPython Notebook client and integrated REPL (like SLIME) in Emacs. While EIN makes notebook editing very powerful by allowing you to use any Emacs features, it also expose IPython features such as code evaluation, object inspection and code completion to the Emacs side. These features can be accessed anywhere in Emacs and improve Python code editing and reading in Emacs.

Via John D. Cook


A Mallet in Your Pig

Diving deeper into the Hadoop ecosystem, I’m starting to appreciate languages like Hive and Pig much more. The ability to extend each with UDFs really expands their possibilities.

Here’s an old, but still neat, hack from Jacob Perkins, The Data Chef. He’s folding the capabilities of Mallet, an open source topic modeling toolkit, into the Pig language:

I’m going to use Apache Pig and Mallet, a java based machine learning and natural language processing library to discover topics in the 20 newsgroups data set. This corpus is nice since each document already belongs to a newsgroup (a topic) and so it gives us a way of checking how well our topic discovery is doing. …

So it’s clear that we’re going to need a java udf to do the actual topic clustering. Right? This udf will operate on a DataBag of documents and return a DataBag containing the discovered topics.

Extensibility For The Win.


44

Barack Obama Button

Finally broke the 40 post in a month barrier, this being the 44th, so I thought it only appropriate to give a nod to the reelection and inauguration of the 44th President of the United States.

One thing I can say is, thank God all that political advertising has gone away for a while here in battleground Virginia. Still traumatized from the bombardment.


Wiz Logo

Wizards Ball Logo

If nothing else, the Wizards have a cool logo this year. Good enough that I bought my first piece of sports logo apparel in years. A winter hat with this logo as a patch, for the rest of DC’s wintry days. Nice bonus that it’s not a walking brand advertisement to boot, shades of Cayce Pollard.


Indiepocalpyse Now

Andy Baio thinks we’ve seen a phase change for creative endeavors:

Digital distribution subverted the monopolies held by physical distribution, bypassing distribution deals with record stores entirely, allowing artists to sell directly to fans. Social media and online music services changed the way people discover music, making the payola systems of MTV and radio airplay feel quaint. Production costs dropped dramatically as computers became more powerful and audio editing software got dirt cheap, along with new services for printing on demand. And, finally, Kickstarter and other crowdfunding platforms offset the financial risk to artists.

Most importantly, each new platform let artists find, communicate, and sell directly to their fans.

I want to believe, but whenever these transformational moments involve transition for mega-corporate concerns (if not outright downfall), I have this thought that there’s still an Empire Strikes Back chapter to be written.


Time To Learn Vagrant…

’Cuz Pete Warden said so:

I have fallen in love with Vagrant over the last year, it turns an entire logical computer as a single unit of software. In simple terms, you can easily set up, run, and maintain a virtual machine image with all the frameworks and data dependencies pre-installed. You can wipe it, copy it to a different system, branch it to run experimental changes, keep multiple versions around, easily share it with other people, and quickly deploy multiple copies when you need to scale up. It’s as revolutionary as the introduction of distributed source control systems, you’re suddenly free to innovate because mistakes can be painlessly rolled back, and you can collaborate other people without worrying that anything will be overwritten.


People Are Cluing In

The Basketball Jones discovers that our Wiz don’t completely suck:

Over the last few weeks, though, there’s been life, legitimate life out of DC. Not only did they win five straight at home, but they beat a handful of pretty good teams — the Thunder, Hawks and the previously mentioned Bulls on that Saturday-nighter — and absolutely ran a couple lousy ones off the court, destroying the Magic 120-91, and drubbing the Timberwolves 114-101 in a game that wasn’t even as close as the final score indicated. Even on their five-game West Coast road trip in between home stints, they went a respectable 2-3, beating two solid teams in the Nuggets and Blazers, and not losing any game by more than seven points. The improvement from the team that started 4-28 was evident, to say the least.

But as of this writing, at the end of the third quarter, Andrew’s ’76ers are in good position. Wiz with 63 lousy points.

Even worse, Nick Young is having a good revenge game, although it’s still “good bye, good riddance”.


Mining the News

We describe and evaluate methods for learning to forecast forthcoming events of interest from a corpus containing 22 years of news stories. We consider the examples of identifying significant increases in the likelihood of disease out- breaks, deaths, and riots in advance of the occurrence of these events in the world. We provide details of methods and studies, including the automated extraction and generalization of sequences of events from news corpora and multiple web resources. We evaluate the predictive power of the approach on real-world events withheld from the system.

Mining the Web to Predict Future Events (PDF heads-up)

Reads cool, but for two things. First, they’re only mining scraped stories from The New York Times. Not quite The Web in my book. Okay, slight credit for exploiting Linked Data, but still.

Second, scraping The New York Times?! C’mon Redmond. You’re better than that. Ha, ha! Only serious!


Me 1 - Shark 1

Previously, UC Berkeley’s Shark had been giving me build fits. Finally tamed the savage beast. Had to resort to java -verbose:class to figure out where various classes and jars were coming from. Then a little surgery to put the appropriate Hadoop jars in the proper place and voilà, I can run the Shark examples against an hdfs repository.

With that out of the way, I’m now getting sucked into Apache Hive development. Yum, data crunching for fun and profit. I didn’t read too closely but I like what I perused of Programming Hive by Dean Wampler. Feels comprehensive and up to date.


Lurv’ Me Some SSD

Samsung SSD The first step in resuscitating Ye Olde MacBook has taken place. I replaced it’s 320 Gb spinning disk hard drive, with a 500 Gb version of the Samsung SSD pictured to the right.

The thing is way overkill, with a 6 Gb / sec transfer rate on a bus that can only do 1.5 Gb / sec tops. But lordy’, has it made my measly old 4 Gb RAM laptop feel usable. Recently it had been pretty neglected, grinding to a halt when I had both Chrome and Firefox open with a bunch of tabs. Now it feels great with nary a pause or spinning beach ball. A larger sample size is needed, but if this keeps up, the machine may rise again to prime blogging platform.

Next up, I need to deconstruct the machine’s innards, replace the Apple issued SuperDrive with an OptiBay, and put 1Tb of spinning platters inside. That should do the trick for at least a year or two.

Update. But I’m also feeling the lack of the Retina display. To be continued.


Pandas High Performance File Parsing

An oldie but goodie deep dive into improving pandas’ csv file parsing by Wes McKinney

TL;DR I’ve finally gotten around to building the high performance parser engine that pandas deserves. It hasn’t been released yet (it’s in a branch on GitHub) but will after I give it a month or so for any remaining buglets to shake out:

That was from way back in October 2012. Since we’re up to pandas 0.10.1 this feature should be pretty stable.


Wiz Bulls

Surprising to me the Wiz pulled away in the third quarter, stifling the Bulls offense for over eight minutes. They coasted the rest of the way, shutting down a few feeble surges. It should be noted that the Bulls were playing without Luol Deng.

Still need a larger sample size, but the Wiz have at least moved from under the barrel to the bottom of the barrel. And that’s within shouting distance of the playoffs in the Eastern Conference. Who’d a thunk?!


Maps In Tweets

MapBox has a new feature that let’s you embed a map in a tweet. I just think this is dang cool! But as a studied Twitter observer, I’m also interested in the various media types that can be distributed through Twitter. Seems like there’s a research project or two in there.

MapBox maps are now fully integrated with Twitter. When you tweet a link to any map hosted on MapBox, it will show up inline on Twitter.com and in Twitter’s mobile apps. Your custom styles and markers will all show up as inline images with the title and description of your map. And when you click on them, they’ll take you to the full interactive map on MapBox.com.


Freedom To Tinker Predictions

Had to let ’em simmer for a bit, but Freedom To Tinker’s annual predictions are always a good read. If it all happens, it’s gonna be a bad year, but particularly in the skies:

Civilian versions of military UAVs, like the Predator, will gain broader approval for use in domestic airspace and will be rapidly adopted by the obvious government agencies (e.g. police departments, border patrol, USGS) as well as all manner of unexpected non-government applications (e.g. traffic reporting, aerial banner advertising). “Deconflicting” airspace will become a hot topic for discussion.

You have to search and scroll a bit, but the earlier years’ predictions often had a surprisingly prescient nugget. For example, there are certain bets regarding the CFAA that are quite on point given the recent Aaron Swartz tragedy.

And of course everyone has a clunker or two:

(28) Facebook will be sold for $4 billion and Mark Zuckerberg will step down as CEO.

That’s the price of prognostication.


Swimming With Shark

Oy! I had forgotten how painful the bleeding edge of academic research can be. Spent most of the day wrasslin’ with Shark from the Berkeley AMPLab, Go Bears! Shark = Spark + Hive. I’m trying to build the BigRAM (TM) chops and this seems to be a relatively straightforward way to do so.

Thanks to a version mismatch between HDFS client libraries, I got sucked into rebuilding most of the Shark stack. Managed to tame Scala, Hive, Spark, and sbt, but couldn’t eliminate the pesky incompatibility. Being behind a firewall while having lots of remote dependencies didn’t help things. I swear Java has more different ways to define an http proxy. Currently Shark 1 - Me 0.

Oh well. It’s off to jar hunting tomorrow.


More Evidence

In-memory data processing of massive data needs a catchy nickname. I hearby christen thee Big RAM (™). Probably won’t stick but maybe I can find some unsuspecting proposal to slap it into. Unfortunately close to bigram to be distinctively visible in search engines. Might be able to make some useful social media handles out of that.

In any event, herewith two more nuggets regarding the march of the RAM/SSD combo architecture. First a bit of a rant on SSDs, from the not disinterested Brian Bulkowski:

The economics of flash memory are staggering. If you’re not using SSD, you are doing it wrong.

Not quite true, but close. Some small applications fit entirely in memory – less than 100GB – great for in-memory solutions. There’s a place for rotational drives (HDD) in massive streaming analytics and petabytes of data. But for the vast space between, flash has become the only sensible option.

My quick scan of the comments didn’t see anyone unearthing blinding fundamental errors, other than relying on Dell Enterprise SSD pricing as a baseline, but read with an appropriate grain of salt.

Then today, yet another announcement from Amazon Web Services:

Our new High Memory Cluster Eight Extra Large (cr1.8xlarge) instance type is designed to host applications that have a voracious need for compute power, memory, and network bandwidth such as in-memory databases, graph databases, and memory intensive HPC.

Here are the specs:

  • Two Intel E5-2670 processors running at 2.6 GHz with Intel Turbo Boost and NUMA support.
  • 244 GiB of RAM.
  • Two 120 GB SSD for instance storage.
  • 10 Gigabit networking with support for Cluster Placement Groups.
  • HVM virtualization only.
  • Support for EBS-backed AMIs only.

This is a real workhorse instance, with a total of 88 ECU (EC2 Compute Units). You can use it to run applications that are hungry for lots of memory and that can take advantage of 32 Hyperthreaded cores (16 per processor). We expect this instance type to be a great fit for in-memory analytics systems like SAP HANA and memory-hungry scientific problems such as genome assembly.

At $3.50 an hour, rent 6 for $21/hour, and get off on eight hours of 192 cores with an aggregate 1TB of RAM, 1TB of SSD, and $168 out of your pocket.

Now that would be a fun day at the office!


Timely Advice

From last year’s post on MLK’s Drum Major Instinct speech:

The drum major instinct is real. (Yes!) And you know what else it causes to happen? It often causes us to live above our means. (Make it plain!) It’s nothing but the drum major instinct. Do you ever see people buy cars that they can’t even begin to buy in terms of their income? (Amen!) [laughter] You’ve seen people riding around in Cadillacs and Chryslers who don’t earn enough to have a good T-Model Ford. (Make it plain!) But it feeds a repressed ego. …

The drum major instinct can lead to exclusivism in one’s thinking and can lead one to feel that because he has some training, he’s a little better than that person who doesn’t have it. Or because he has some economic security, that he’s a little better than that person who doesn’t have it. And that’s the uncontrolled, perverted use of the drum major instinct.

Still good advice.

© 2008-2024 C. Ross Jam. Built using Pelican. Theme based upon Giulio Fidente’s original svbhack, and slightly modified by crossjam.