home ¦ Archives ¦ Atom ¦ RSS

Other Python and Geo Playthings

Some other things that caught my fancy recently

Geobases

This project provides tools to play with geographical data. It also works with non-geographical data, except for map visualizations :).

Glue

Glue is a Python library to explore relationships within and among related datasets.

On the current state of reverse geocoding:

At Flickr we spent a while working on turning the point where a photo was set on map (or whose GPS coordinates were shoved into the EXIF) into a place. The work of reverse geocoding is about taking a point, and finding out which polygon its in. This is a well solved problem. With two caveats:

  1. places don’t have neat boundaries, but overlap all over each other. And people disagree about the overlaps

  2. even if places had neat boundaries, and people agreed on them, availability of information about those boundaries is variable at best.

Once I made a request for a pure geo indexing capability. Ask and ye shall receive.

A Java reverse geocoder that uses Geotools and JTS to index Flickr Shapefiles to map (latitude, longitude) to WOEID.

The code has no external server dependencies and runs in memory. You may have to specify a larger max heap size if you use the largest files (e.g. the localities dataset uses around 450Mb on my Mac).


Mongo Geo Indexes

New Geospatial Indexes with GeoJSON and Improved Spherical Geometry

The 2.3 series adds a new type of geospatial index that supports improved spherical queries and GeoJSON.

MongoDB may have its failings and critics, but it’s here, working and getting better. New geospatial indexing features might make me take another look for some ad hoc hackery.


Fast, Cheap, and Persistent

[embed]https://twitter.com/bigdata/status/297397203150970882[/embed]

Operating System Implications of Fast, Cheap, Non-Volatile Memory

The existence of two basic levels of storage (fast/volatile and slow/non-volatile) has been a long-standing premise of most computer systems, influencing the design of OS components, including file systems, virtual memory, scheduling, execution models, and even their APIs. Emerging resistive memory technologies – such as phase-change memory (PCM) and memristors – have the potential to provide large, fast, non-volatile memory systems, changing the assumptions that motivated the design of current operating systems. This paper examines the implications of non-volatile memories on a number of OS mechanisms, functions, and properties.

Starting to think UltraRAM rolls off the tongue better than my tongue in cheek moniker BigRAM. Besides being cheap enough to splurge on, RAM will be different in the future.


Big Data Etymology

This is just of general interest to me, but there’s also the specific angle that I’ve personally met the guy who trademarked “bigdata”. Smart guy. Was wondering where that term actually originated. Will be stuff of lore for years to come.

The term Big Data is so generic that the hunt for its origin was not just an effort to find an early reference to those two words being used together. Instead, the goal was the early use of the term that suggests its present connotation — that is, not just a lot of data, but different types of data handled in new ways.


The Funniest Tweet

[embed]https://twitter.com/attackerman/status/298244044126044161[/embed]

… of last night, Super Bowl Sunday, at least for me. Then again it was timed perfectly for the blackout, I’m a big Dark Knight fan, and I’m really easily amused.


A Chihuahua in Your Pig

That would be Pig UDF for GraphChi. A complete hack of the best kind:

Pig is a powerful query language for Hadoop commonly used for large scale data processing. Now it is possible to run GraphChi programs as parts of Pig-scripts, with just one line of script! This allows easy huge scale graph computation with data stored in HDFS (Hadoop File System). As GraphChi will ultimately execute only on a single Hadoop machine (see HowGraphChiForPigWorks), the size of the Hadoop Cluster is not a limiting factor.

GraphChi for Pig is a viable alternative to Giraph, which is a distributed graph engine built on top of Hadoop. With GraphChi, you can develop your algorithms on your laptop (with realistically sized data) and then deploy them to run the big cluster. GraphChi will also often run faster and uses much less resources than alternatives.


Stuper Sunday

From the best (only?!) post to American McCarver in a while:

But football is a contest and picking sides is what competition is about. So who does a disinterested party root for this weekend?

Both teams are populated with terrible human beings, so that doesn’t help. (I’m tempted to give the edge to the Ravens, because at least Ray Lewis knew what he was doing when he denied it later.) The normally reliable heuristic of rooting against a Harbaugh doesn’t work. Colin Kaepernick is the obvious hero from every football movie ever, but nobody should have to cheer for someone with a chin beard. Seriously, Colin, stop taking grooming tips from Brian Wilson.

This is the least interest I’ve had in a Super Bowl since that Denver-Atlanta stinker in 1999. I absentmindedly listened to that one on radio and slept through part of it. In 2013, I have no sympathy for any of the players. As an East Bay aficionado, I can’t stand the ’Niners and their fans. Not to mention their jerk of a former Stanford coach. Even though I have Baltimore ties, the Ravens are the Browns, not the Colts. And Ray Ray’s act and history irk me to no end.

Guess I’ll hold my nose and go for the Ravens just for Ozzie Newsome’s sake.

Bonus. Sapp before Haley? I’m really just starting to dislike the NFL.


Mission Accomplished

MacBook HDs Snap

With the help of an iFixit YouTube video and repair Guide, I managed to complete the upgrading of Ye Olde MacBook as previously advertised. Unfortunately, the guide delivered with the MCE OptiBay seems specific to MacBook Pros, with no info for us old MacBook peasants. The iFixit material did in a pinch though. Way more tiny screws were involved than I planned, but the machine currently seems none the worse for the experience.

Now I need to put the HD through it’s paces a little, making it the home of my iTunes Library. But I feel good this little MacBook can make it to its fifth anniversary and well beyond.


TIL EIN

I’ve been looking for a decent mash up of IPython and Emacs. Will definitely test drive EIN:

Emacs IPython Notebook (EIN) provides a IPython Notebook client and integrated REPL (like SLIME) in Emacs. While EIN makes notebook editing very powerful by allowing you to use any Emacs features, it also expose IPython features such as code evaluation, object inspection and code completion to the Emacs side. These features can be accessed anywhere in Emacs and improve Python code editing and reading in Emacs.

Via John D. Cook


A Mallet in Your Pig

Diving deeper into the Hadoop ecosystem, I’m starting to appreciate languages like Hive and Pig much more. The ability to extend each with UDFs really expands their possibilities.

Here’s an old, but still neat, hack from Jacob Perkins, The Data Chef. He’s folding the capabilities of Mallet, an open source topic modeling toolkit, into the Pig language:

I’m going to use Apache Pig and Mallet, a java based machine learning and natural language processing library to discover topics in the 20 newsgroups data set. This corpus is nice since each document already belongs to a newsgroup (a topic) and so it gives us a way of checking how well our topic discovery is doing. …

So it’s clear that we’re going to need a java udf to do the actual topic clustering. Right? This udf will operate on a DataBag of documents and return a DataBag containing the discovered topics.

Extensibility For The Win.


44

Barack Obama Button

Finally broke the 40 post in a month barrier, this being the 44th, so I thought it only appropriate to give a nod to the reelection and inauguration of the 44th President of the United States.

One thing I can say is, thank God all that political advertising has gone away for a while here in battleground Virginia. Still traumatized from the bombardment.


Wiz Logo

Wizards Ball Logo

If nothing else, the Wizards have a cool logo this year. Good enough that I bought my first piece of sports logo apparel in years. A winter hat with this logo as a patch, for the rest of DC’s wintry days. Nice bonus that it’s not a walking brand advertisement to boot, shades of Cayce Pollard.


Indiepocalpyse Now

Andy Baio thinks we’ve seen a phase change for creative endeavors:

Digital distribution subverted the monopolies held by physical distribution, bypassing distribution deals with record stores entirely, allowing artists to sell directly to fans. Social media and online music services changed the way people discover music, making the payola systems of MTV and radio airplay feel quaint. Production costs dropped dramatically as computers became more powerful and audio editing software got dirt cheap, along with new services for printing on demand. And, finally, Kickstarter and other crowdfunding platforms offset the financial risk to artists.

Most importantly, each new platform let artists find, communicate, and sell directly to their fans.

I want to believe, but whenever these transformational moments involve transition for mega-corporate concerns (if not outright downfall), I have this thought that there’s still an Empire Strikes Back chapter to be written.


Time To Learn Vagrant…

’Cuz Pete Warden said so:

I have fallen in love with Vagrant over the last year, it turns an entire logical computer as a single unit of software. In simple terms, you can easily set up, run, and maintain a virtual machine image with all the frameworks and data dependencies pre-installed. You can wipe it, copy it to a different system, branch it to run experimental changes, keep multiple versions around, easily share it with other people, and quickly deploy multiple copies when you need to scale up. It’s as revolutionary as the introduction of distributed source control systems, you’re suddenly free to innovate because mistakes can be painlessly rolled back, and you can collaborate other people without worrying that anything will be overwritten.


People Are Cluing In

The Basketball Jones discovers that our Wiz don’t completely suck:

Over the last few weeks, though, there’s been life, legitimate life out of DC. Not only did they win five straight at home, but they beat a handful of pretty good teams — the Thunder, Hawks and the previously mentioned Bulls on that Saturday-nighter — and absolutely ran a couple lousy ones off the court, destroying the Magic 120-91, and drubbing the Timberwolves 114-101 in a game that wasn’t even as close as the final score indicated. Even on their five-game West Coast road trip in between home stints, they went a respectable 2-3, beating two solid teams in the Nuggets and Blazers, and not losing any game by more than seven points. The improvement from the team that started 4-28 was evident, to say the least.

But as of this writing, at the end of the third quarter, Andrew’s ’76ers are in good position. Wiz with 63 lousy points.

Even worse, Nick Young is having a good revenge game, although it’s still “good bye, good riddance”.


Mining the News

[embed]https://twitter.com/bigdata/statuses/295222087457583105[/embed]

We describe and evaluate methods for learning to forecast forthcoming events of interest from a corpus containing 22 years of news stories. We consider the examples of identifying significant increases in the likelihood of disease out- breaks, deaths, and riots in advance of the occurrence of these events in the world. We provide details of methods and studies, including the automated extraction and generalization of sequences of events from news corpora and multiple web resources. We evaluate the predictive power of the approach on real-world events withheld from the system.

Mining the Web to Predict Future Events (PDF heads-up)

Reads cool, but for two things. First, they’re only mining scraped stories from The New York Times. Not quite The Web in my book. Okay, slight credit for exploiting Linked Data, but still.

Second, scraping The New York Times?! C’mon Redmond. You’re better than that. Ha, ha! Only serious!


Me 1 - Shark 1

Previously, UC Berkeley’s Shark had been giving me build fits. Finally tamed the savage beast. Had to resort to java -verbose:class to figure out where various classes and jars were coming from. Then a little surgery to put the appropriate Hadoop jars in the proper place and voilà, I can run the Shark examples against an hdfs repository.

With that out of the way, I’m now getting sucked into Apache Hive development. Yum, data crunching for fun and profit. I didn’t read too closely but I like what I perused of Programming Hive by Dean Wampler. Feels comprehensive and up to date.


Lurv’ Me Some SSD

Samsung SSD The first step in resuscitating Ye Olde MacBook has taken place. I replaced it’s 320 Gb spinning disk hard drive, with a 500 Gb version of the Samsung SSD pictured to the right.

The thing is way overkill, with a 6 Gb / sec transfer rate on a bus that can only do 1.5 Gb / sec tops. But lordy’, has it made my measly old 4 Gb RAM laptop feel usable. Recently it had been pretty neglected, grinding to a halt when I had both Chrome and Firefox open with a bunch of tabs. Now it feels great with nary a pause or spinning beach ball. A larger sample size is needed, but if this keeps up, the machine may rise again to prime blogging platform.

Next up, I need to deconstruct the machine’s innards, replace the Apple issued SuperDrive with an OptiBay, and put 1Tb of spinning platters inside. That should do the trick for at least a year or two.

Update. But I’m also feeling the lack of the Retina display. To be continued.


Pandas High Performance File Parsing

An oldie but goodie deep dive into improving pandas’ csv file parsing by Wes McKinney

TL;DR I’ve finally gotten around to building the high performance parser engine that pandas deserves. It hasn’t been released yet (it’s in a branch on GitHub) but will after I give it a month or so for any remaining buglets to shake out:

That was from way back in October 2012. Since we’re up to pandas 0.10.1 this feature should be pretty stable.


Wiz Bulls

[embed]https://twitter.com/DidTheWizWin/status/295354859379621889[/embed]

Surprising to me the Wiz pulled away in the third quarter, stifling the Bulls offense for over eight minutes. They coasted the rest of the way, shutting down a few feeble surges. It should be noted that the Bulls were playing without Luol Deng.

Still need a larger sample size, but the Wiz have at least moved from under the barrel to the bottom of the barrel. And that’s within shouting distance of the playoffs in the Eastern Conference. Who’d a thunk?!


Maps In Tweets

MapBox has a new feature that let’s you embed a map in a tweet. I just think this is dang cool! But as a studied Twitter observer, I’m also interested in the various media types that can be distributed through Twitter. Seems like there’s a research project or two in there.

MapBox maps are now fully integrated with Twitter. When you tweet a link to any map hosted on MapBox, it will show up inline on Twitter.com and in Twitter’s mobile apps. Your custom styles and markers will all show up as inline images with the title and description of your map. And when you click on them, they’ll take you to the full interactive map on MapBox.com.


Freedom To Tinker Predictions

Had to let ’em simmer for a bit, but Freedom To Tinker’s annual predictions are always a good read. If it all happens, it’s gonna be a bad year, but particularly in the skies:

Civilian versions of military UAVs, like the Predator, will gain broader approval for use in domestic airspace and will be rapidly adopted by the obvious government agencies (e.g. police departments, border patrol, USGS) as well as all manner of unexpected non-government applications (e.g. traffic reporting, aerial banner advertising). “Deconflicting” airspace will become a hot topic for discussion.

You have to search and scroll a bit, but the earlier years’ predictions often had a surprisingly prescient nugget. For example, there are certain bets regarding the CFAA that are quite on point given the recent Aaron Swartz tragedy.

And of course everyone has a clunker or two:

(28) Facebook will be sold for $4 billion and Mark Zuckerberg will step down as CEO.

That’s the price of prognostication.


Swimming With Shark

Oy! I had forgotten how painful the bleeding edge of academic research can be. Spent most of the day wrasslin’ with Shark from the Berkeley AMPLab, Go Bears! Shark = Spark + Hive. I’m trying to build the BigRAM (TM) chops and this seems to be a relatively straightforward way to do so.

Thanks to a version mismatch between HDFS client libraries, I got sucked into rebuilding most of the Shark stack. Managed to tame Scala, Hive, Spark, and sbt, but couldn’t eliminate the pesky incompatibility. Being behind a firewall while having lots of remote dependencies didn’t help things. I swear Java has more different ways to define an http proxy. Currently Shark 1 - Me 0.

Oh well. It’s off to jar hunting tomorrow.


More Evidence

In-memory data processing of massive data needs a catchy nickname. I hearby christen thee Big RAM (™). Probably won’t stick but maybe I can find some unsuspecting proposal to slap it into. Unfortunately close to bigram to be distinctively visible in search engines. Might be able to make some useful social media handles out of that.

In any event, herewith two more nuggets regarding the march of the RAM/SSD combo architecture. First a bit of a rant on SSDs, from the not disinterested Brian Bulkowski:

The economics of flash memory are staggering. If you’re not using SSD, you are doing it wrong.

Not quite true, but close. Some small applications fit entirely in memory – less than 100GB – great for in-memory solutions. There’s a place for rotational drives (HDD) in massive streaming analytics and petabytes of data. But for the vast space between, flash has become the only sensible option.

My quick scan of the comments didn’t see anyone unearthing blinding fundamental errors, other than relying on Dell Enterprise SSD pricing as a baseline, but read with an appropriate grain of salt.

Then today, yet another announcement from Amazon Web Services:

Our new High Memory Cluster Eight Extra Large (cr1.8xlarge) instance type is designed to host applications that have a voracious need for compute power, memory, and network bandwidth such as in-memory databases, graph databases, and memory intensive HPC.

Here are the specs:

  • Two Intel E5-2670 processors running at 2.6 GHz with Intel Turbo Boost and NUMA support.
  • 244 GiB of RAM.
  • Two 120 GB SSD for instance storage.
  • 10 Gigabit networking with support for Cluster Placement Groups.
  • HVM virtualization only.
  • Support for EBS-backed AMIs only.

This is a real workhorse instance, with a total of 88 ECU (EC2 Compute Units). You can use it to run applications that are hungry for lots of memory and that can take advantage of 32 Hyperthreaded cores (16 per processor). We expect this instance type to be a great fit for in-memory analytics systems like SAP HANA and memory-hungry scientific problems such as genome assembly.

At $3.50 an hour, rent 6 for $21/hour, and get off on eight hours of 192 cores with an aggregate 1TB of RAM, 1TB of SSD, and $168 out of your pocket.

Now that would be a fun day at the office!


Timely Advice

From last year’s post on MLK’s Drum Major Instinct speech:

The drum major instinct is real. (Yes!) And you know what else it causes to happen? It often causes us to live above our means. (Make it plain!) It’s nothing but the drum major instinct. Do you ever see people buy cars that they can’t even begin to buy in terms of their income? (Amen!) [laughter] You’ve seen people riding around in Cadillacs and Chryslers who don’t earn enough to have a good T-Model Ford. (Make it plain!) But it feeds a repressed ego. …

The drum major instinct can lead to exclusivism in one’s thinking and can lead one to feel that because he has some training, he’s a little better than that person who doesn’t have it. Or because he has some economic security, that he’s a little better than that person who doesn’t have it. And that’s the uncontrolled, perverted use of the drum major instinct.

Still good advice.


Mesos Tech Talk

I need to watch this AirBnB Tech Talk on Apache Mesos:

[embed]http://www.youtube.com/watch?v=Hal00g8o1iY[/embed]

Benjamin Hindman is one of the creators of Mesos, a platform for building and running distributed systems. He has done research in the areas of programming languages and distributed systems as a graduate student at Berkeley, where he hopes to one day finish his PhD. These days he spends most of his time hacking on Mesos at Twitter where it is being used in production.

For some reason I’m really intrigued by the potential of Mesos in production. Go Bears!


SSD Projections

From the Ars Technica article: SSD predictions at CES: fewer OEMs, lower prices

Storage enthusiasts sitting on the edge of your seats for revolutionary SSD announcements out of this year’s CES can rest easy: there’s not anything mind-blowing coming up that you need to be worried about. Ars sat down today with both LSI/Sandforce and Samsung, and while both had plenty of neat stuff to talk about with regard to their current product line, neither had anything earthshaking to share. Like the headline says, this isn’t necessarily a bad thing: now’s an excellent time to buy an SSD if you don’t already have one, and the ever-present enthusiast fear of buying something that will soon be obsolete or out of date isn’t one that really applies for solid state disks.

Having seen at work the performance delta that SSDs can provide, I jumped on this bandwagon at home and bought a 500 Gb Samsung 840. Grand total of $345. Also grabbed a Western-Digital 1 Tb drive for $80. I’m following my aforementioned plan of souping Ye Olde MacBook to squeeze out another year or two as my personal laptop.

I don’t claim to be the sharpest tool in the shed, but this SSD trend has serious implications for advanced data analysis and processing.


O’Reilly In-Memory Report

Looking forward to this upcoming big data product from O’Reilly:

In a forthcoming report we will highlight technologies and solutions that take advantage of the decline in prices of RAM, the popularity of distributed and cloud computing systems, and the need for faster queries on large, distributed data stores. Established technology companies have had interesting offerings, but what initially caught our attention were open source projects that started gaining traction last year.


Wiz Nuggets

[embed]https://twitter.com/DidTheWizWin/status/292494965521661953[/embed]

The Wiz managed to hang on against the Denver Nuggets for a win. Notably, the game was on the road against a decent team. The Wizards started letting it slip away, but guys like John Wall and Bradley Beal, stepped up big in some late possessions.

This is progress.


BlinkDB

Cool! BlinkDB provides Interactive timescale querying of massive data through declaring how much error you’re willing to tolerate in the answer. Hive on top, Hadoop and/or Spark underneath.

Today’s web is predominantly data-driven. People increasingly depend on enormous amounts of data (spanning terabytes or even petabytes in size) to make intelligent business and personal decisions. Often the time it takes to make these decisions is critical. However, unfortunately, quickly analyzing large volumes of data poses significant challenges. For instance, scanning 1TB of data may take minutes, even when the data is spread across hundreds of machines and read in parallel. BlinkDB is a massively parallel, sampling-based approximate query engine for running interactive queries on large volumes of data. The key observation in BlinkDB is that one can make perfect decisions.

Paper preprint available. Go Bears!!

Via @bigdata

[embed]https://twitter.com/bigdata/status/291228309910614016[/embed]


PgSQL Indexes

Two recent Postgres performance nuggets, both touching on indexes.

First, Craig Kerstiens talks about performance measurement with pg_stat, but lands on conditional indexes:

To further optimize this we would great a conditional OR composite index. A conditional would be where only current = true, where as the composite would index both values. A conditional is commonly more valuable when you have a smaller set of what the values may be, meanwhile the composite is when you have a high variability of values.

Next, the Instagram team illustrates some of how they’ve scaled with Postgres. They provide functional indexes as one of their tips:

On some of our tables, we need to index strings (for example, 64 character base64 tokens) that are quite long, and creating an index on those strings ends up duplicating a lot of data. For these, Postgres’ functional index feature can be very helpful:

Both very good to know.


TPM IdeaLab, Subscribed

Been meaning to put Talking Points Memo’s Idea Lab in the ole’ feedroll since Carl Franzen was one of the few to cover the most recent State of the Map:

The tensions facing the community were on full display at OpenStreetMap’s annual “State of the Map USA” conference in Portland, Oregon from October 13 through 14, a frenzied, jam-packed series of over 50 presentations and countless other informal talks between avid geographers and programmers who sprawled over a few generally overcrowded rooms at the Oregon Convention Center, fueled by coffee (beer at night) and their boundless enthusiasm for using and improving the vast and increasingly vital public map.

Subscribed!


Data Hacking Books

An interesting collection of books concerning “big data” and “data science”. I already have a couple in hand, but Russell Jurney’s Agile Data looks quite interesting.

Working with data is HARD. Let’s face it, you’re brave to even attempt it, let alone make it your everyday job.

Fortunately, some incredibly talented people have taken the time to compile and share their deep knowledge for you.

Here are 7 books we recommend for picking up some new skills in 2013:

Via Mortar Data

[embed]https://twitter.com/mortardata/status/291206882968891393[/embed]


MF Urban Coyotes

Now I can stop complaining. My Mark Farina, Urban Coyotes CD is physically in hand. This is the first physical CD I’ve bought since Farina’s Mushroom Jazz 7 and I haven’t even downloaded much over the past year and a half. So I’m looking forward to getting something new in the mix. Don’t have time to rip it tonight, but it’ll be on the Jesusphone 5 posthaste.


20 Mile Marchin’

The Art of Manliness dug into the book Great by Choice, and pulled a common theme of successful organizations, “The 20 Mile March”:

Collins and Morten dubbed the slow and steady approach taken by Southwest and other 10X companies “The 20 Mile March.” They took this moniker from imagining a man determined to walk across the United States, and how he could accomplish his goal faster by committing to walking 20 miles every single day – rain or shine — rather than walking for 40-50 miles in good weather and then very few miles or not at all during inclement conditions.

My biggest achievements last year, collecting lots of Tweets and blogging continuously, inadvertently embraced The 20 Mile March ethos. I’ll be looking to build my 2013 resolutions with even more awareness of the components of this approach. Weight reduction/exercise and side-hacking are going to make repeat appearances this year. Let’s see if applying 20 Mile March techniques help increase my level of success.


Wiz Hawks

[embed]https://twitter.com/DidTheWizWin/status/290283226562428928[/embed]

I don’t know if John Wall is going to make that much of a difference overall, but his return definitely put a pep in the step of the Wiz. They actually rolled to a relatively easy victory over the Hawks.

Two in a row! Who’d a thunk the Wiz would have a winning streak this year.


Sad

Sad news from the campus newspaper of my alma mater:

Computer activist Aaron H. Swartz committed suicide in New York City yesterday, Jan. 11, according to his uncle, Michael Wolf, in a comment to The Tech. Swartz was 26.

The accomplished Swartz co-authored the now widely-used RSS 1.0 specification at age 14, founded Infogami which later merged with the popular social news site reddit, and completed a fellowship at Harvard’s Ethics Center Lab on Institutional Corruption. In 2010, he founded DemandProgress.org, a “campaign against the Internet censorship bills SOPA/PIPA.”

I didn’t know Swartz, but knew of him, from his precocious emergence as a teen in Chicago (when I lived there) to his various Web, Internet, and political activities. He was of that ultra-principled, uber-nerd ilk reminiscent of Richard Stallman, Philip Greenspun, Olin Shivers, Eric Raymond, and Mark Pilgrim. It’s one thing to be able to hack, but quite another to be able to translate that focus into the public sphere.

We should all be so lucky to have done as much as he did in the time that we have been given.


Diggin’ On GraphLab

What is GraphLab?

GraphLab is a graph-based, high performance, distributed computation framework written in C++. While GraphLab was originally developed for Machine Learning tasks, it has found great success at a broad range of other data-mining tasks; out-performing other abstractions by orders of magnitude.

In the process of experimenting with alternatives to Hadoop for work, I revisited GraphLab. It’s come a long way since it’s 1.0 release. A lot more usable and the performance is off the charts. Let me put it to you this way, I’m able to run graph algorithms on graphs with over 60 million nodes and 1 billion edges that complete in minutes. Often in the time it would take a Hadoop job to even get started.

And often on a 2 year old Dell Optiplex with 2 lousy cores and desktop grade I/O. Okay, it can’t do billions of edges fast, but an order of magnitude down is quite reasonable.

You can do a lot of damage with a decent amount of RAM and a handful of cheap, stock SSDs. The floor at which you really need to buy into distributed, horizontal scaling is much higher than a lot of people think.


pluggin’ away

As I get more experience building “command shells” in Python, I’ve become interested in making them easily extensible. In the same way cliff is my goto command line building toolkit maybe stevedore can plug this hole:

Python makes loading code dynamically easy, allowing you to configure and extend your application by discovering and loading extensions (“plugins”) at runtime. Many applications implement their own library for doing this, using import or importlib. stevedore avoids creating yet another extension mechanism by building on top of setuptools entry points. The code for managing entry points tends to be repetitive, though, so stevedore provides manager classes for implementing common patterns for using dynamically loaded extensions.


Common Crawl’s New URL Index

Scott Robertson put together an index of the URLS in the Common Crawl dataset, so everyone doesn’t have to trawl the whole contents looking for where the links are:

I’m happy to announce the first public release of the Common Crawl URL Index, designed to solve the problem of finding the locations of pages of interest within the archive based on their URL, domain, subdomain or even TLD (top level domain).

Keeping with Common Crawl tradition we’re making the entire index available as a giant download. Fear not, there’s no need to rack up bandwidth bills downloading the entire thing. We’ve implemented it as a prefixed b-tree so you can access parts of it randomly from S3 using byte range requests. At the same time, you’re free to download the entire beast and work with it directly if you desire.

Information about the format, and samples of accessing it using python are available on github. Feel free to post questions in the issue tracker and wikis there.

Combine with a previous hare-brained scheme to good effect.

© 2008-2024 C. Ross Jam. Built using Pelican. Theme based upon Giulio Fidente’s original svbhack, and slightly modified by crossjam.