Hadoop and GIS

Posted on: Thu 28 March 2013

Very convenient. I’ve been on the lookout for an approach to integrating geospatial information processing into Hadoop. Looks like ESRI is open sourcing some GIS Tools for Hadoop:

Data that includes location, and that is enhanced with geographic information in a structured form, is often referred to as Spatial Data. Doing Analysis on Spatial data requires an understanding of geometry and operations that can be preformed on it. Enabling Hadoop to include spatial data and spatial analysis is the goal of this Esri Open Source effort.

GIS Tools for Hadoop is an open source toolkit intended for Big Spatial Data Analytics. The toolkit provides different libraries:

The libraries:

Via NoSQL Weekly

Lily

Posted on: Wed 27 March 2013

Link parkin’: Lily looks like an interesting mashup of various NoSQL projects:

Lily is a data management platform combining planet-sized data storage, indexing and search with on-line, real-time usage tracking, audience analytics and content recommendations. It’s a one-stop-platform for any organization confronted with Big Data challenges that seeks rapid implementation, rock-solid performance at scale, and efficiency at management.

Lily unifies Apache HBase, Hadoop and Solr into a comprehensively integrated, interactive data platform with easy-to-use access APIs, a high-level data model and schema language, flexible, real-time indexing and the expressive search power of Apache Solr. Best of all, Lily is open source - allowing anyone to explore and learn what Lily can do.

The Cloudera Blog

Posted on: Wed 27 March 2013

Speaking of Cloudera, I’m a dumbass for not subscribing to the Cloudera blog. Their posts tend to be quite meaty and well written. Over and above the Cloudera ML announcement I’ve enjoyed:

Problem fixed!

Bonus: TIL about the DEBS 2013 Challenge. Mmmmmmm, data.

Cloudera ML

Posted on: Tue 26 March 2013

Some command line tools I wrote with Crunch that make Mahout easier to use for new data scientists: http://t.co/tayYRNIPRp
— JosH100 (@josh_wills) March 22, 2013

Cloudera ML, at least making Apache Mahout a bit more usable

Today, I’m pleased to introduce Cloudera ML, an Apache licensed collection of Java libraries and command line tools to aid data scientists in performing common data preparation and model evaluation tasks. Cloudera ML is intended to be an educational resource and reference implementation for new data scientists that want to understand the most effective techniques for building robust and scalable machine learning models on top of Hadoop.

Good on ’ya Mr. Wills.

Robert Wilensky

Posted on: Tue 26 March 2013

Sad news from Soda Hall:

Robert Wilensky, professor emeritus of computer science at the University of California, Berkeley, and one of the campus’s first faculty members in artificial intelligence when the field was just taking off, has died at age 61.

…

During Wilensky’s tenure at UC Berkeley, he served as chair of the Computer Science Division, director of the Berkeley Cognitive Science Program, director of the Berkeley Artificial Intelligence Research Project, and board member of the International Computer Science Institute.

Even though I wasn’t an AI guy, Prof. Wilensky was chair for a significant part of my graduate career. While I didn’t have a plethora of interactions with him, I always found them enjoyable. He was great for the Computer Science Division, presiding over the department during The Great Migration (TM) from Evans Hall, or shortly thereafter (need to confirm). Whatever. I just remember him being one of the earliest occupants of that office on the first floor, and suavely striding out, being one of the few faculty to pull off a suit on a daily basis. Not necessarily business, but definitely a notch up from elbow patches.

Godspeed, kind sir.

Wiz Griz

Posted on: Mon 25 March 2013

47-8-7.
— Wiz Win Last Night? (@DidTheWizWin) March 26, 2013

I’ve always said truly great players often seem to have explosive nights, early in their careers, that really demonstrate their potential. Think Jordan lighting up the Celtics for 63 and juking Larry Bird in a first round playoff loss in Jordan’s rookie season. Not 100% true, took a while for Steve Nash to get going, but I bet there’s a high correlation.

John Wall’s not quite there yet, but 47 points, with 8 assists, in carrying a decimated Wizards to victory over a good Grizzlies team is a solid sign. And that so quickly after torching the Lakers at Staples.

There’s hope here in DC.

The PyData Ecosystem

Posted on: Sun 24 March 2013

Ben Lorica posts some observations from the recent PyData conference:

It’s getting easier to use the Python data stack:

There are tools that facilitate the dissemination and sharing of code and programming environments. IPython2 notebooks allow Python code and markup in the same document. Notebooks are used to record and share complex workflows and are used heavily for (conference) tutorials. As the data stack grows, one of the major pain points is getting all the packages to work properly together (version compatibility is a common issue). In particular setting up environments were all the pieces work together can be a pain. There are now a few solutions that address this issue: Anaconda and cloud-based Wakari from Continuum Analytics, and cloud computing platform PiCloud.

One of the things I find most exciting is the quality incorporation of Python into the Hadoop ecosystem and varied spinoffs. While Java will always be on the top of the depth chart, since it’s the implementation language and Java is so enterprisey, Python is a reasonably first class citizen. I’ve gone a decent ways building Hadoop processing pipelines with essentially no Java programming.

Markdown Fieldguide

Posted on: Sun 24 March 2013

Like me are you looking to become a more sophisticated Markdown user? You might want to checkout MacSparky’s Markdown Fieldguide. Only 10 bucks!

Markdown started as a clever way to write for the web but has become so much more. This book demystifies Markdown, making it easy for anybody to learn. This book includes 130 pages and 27 screencasts totaling more than one and a half hours of video. There is also an additional hour of audio interviews. This book will take you from zero knowledge of Markdown to being a Markdown pro and change the way you write for the better.

I’ll be checking out the PDF version since I’m a Kindle Store sucke… err, fan.

Update: Decided to go with the iBookstore version just for a little variety.

Via Daring Fireball

Emacs 24.3 and Python

Posted on: Sat 23 March 2013

OEmbed Link rot on URL: https://twitter.com/wesmckinn/status/314928646068531201

I’m with you Wes. Python mode in Emacs 24.3 doesn’t feel very polished. May have to fall back to your solution as well.

OEmbed Link rot on URL: https://twitter.com/wesmckinn/status/314932693752242177

Wiz Lakers

Posted on: Sat 23 March 2013

Love beating the Lakers.
— Wiz Win Last Night? (@DidTheWizWin) March 23, 2013

C’mon @DidTheWizWin! You can do better than that. The Wiz came from double digits down, at the Staples Center, against a fully loaded Lakers with Pau and Kobe. Plus the Wizards went without Bradley Beal, Emeka Okafor, and AJ Price. Only nine guys got in the box score.

And John Wall showed something with 24 points and 16 assists in 44 minutes under the bright lights.

Weird box score observation. The Wiz totaled only 236 minutes, but by my math every regulation game should have 240. Guessing due to round down of fractional minutes.

Spotify’s Luigi

Posted on: Sat 23 March 2013

Been looking around for a flexibly engineered batch scheduling tool that’s Hadoop friendly. Spotify open sourced its Luigi framework which looks like it might fit the bill:

Luigi is a Python package that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization, handling failures, command line integration, and much more.

The purpose of Luigi is to address all the plumbing typically associated with long-running batch processes. You want to chain many tasks, automate them, and failures will happen. These tasks can be anything, but typically long running things like Hadoop jobs, dumping data to/from databases, running machine learning algorithms, or anything else.

There’s also an attendant post on other uses of Python at Spotify.

Iterate Like a Superhero

Posted on: Fri 22 March 2013

I enjoyed reading Ned Batchelder’s presentation on idiomatic iteration in Python, even though I knew of everything he presented. Made me feel like I actually knew something about Python! It’s good baseline knowledge any Pythonista should comprehend and internalize.

He also ingeniously made it available in multiple formats.

This is a presentation I gave at PyCon 2013. You can read the slides and text on this page, or open the actual presentation in your browser (use right and left arrows to advance the slides), or watch the video:

Bracketless

Posted on: Fri 22 March 2013

Thank goodness I’ve retired from filling out NCAA Mens Basketball Tournament brackets. Georgetown, New Mexico, and Wisconsin would have given me headaches, and if the Hoyas don’t come back, whatever bracket I had would be busted.

Now I can just kick back, root for Cal, and keep my blood pressure low.

rvm

Posted on: Thu 21 March 2013

Link parkin’: rvm, it’s like virtualenv for Ruby.

RVM is a command-line tool which allows you to easily install, manage, and work with multiple ruby environments from interpreters to sets of gems.

Stupid Feed Tricks

Posted on: Wed 20 March 2013

I was amused by Brent Simmons’ republishing of Brian Reischl’s Stupid Feed Tricks

Brian is intimately acquainted with the different ways feeds can be screwed up. So he posted Stupid Feed Tricks on Google Docs.

I quote the entire thing below for people like me who don’t have Google accounts. The below is all by Brian:

… Putting a tiny number of posts in the feed (sometimes just one). These types then usually publish 10 articles in the space of two minutes, and wonder why you’re missing 9 of them. …

Now ponder scaling these issues up to (tens?, hundreds?) millions of feeds and users and you can see the challenge facing anyone providing a Google scale feed synch ecosystem.

Also, don’t miss Brent’s favorite way to screw up a feed. Really, hotel connectivity needs to die a horrible death at the hands of 4G and LTE. Conference provisioning is “only” way overpriced. Lodger oriented is overpriced and busted.

Ahhh the wonders of the Web. Thankfully, it’s the least worst, global, open, networked, hypermedia system we have.

Who I Was Waiting to Hear From

Posted on: Wed 20 March 2013

With the GReader Apocalypse upon us, I was waiting to hear what the vendors of my two key feedreading tools were going to do. Got an answer now:

First, Black Pixel and The Return of NetNewsWire:

First, we intend to bring sync to future versions of NetNewsWire. It’s too soon to go into details about this, but you should know that we recognize how extremely important it is and that it is a top priority for us.

Second, even though we’ve been quiet about it, we have been working on new versions of NetNewsWire for Mac, iPhone, and iPad. We have some great new features and a modern design that we can’t wait to show you.

Next up, Oliver Fürniß, publisher of Mr. Reader, and his thoughts:

In my opinion all of those Google Reader alternatives should support this API, instead of going the easy way and creating their own. This would be a big win for millions of Google Reader users. Those services will have native clients within a short time-frame across many different platforms and devices. It’s not a big deal for the client developers to adjust the API endpoint. I’ve already changed my code (not uploaded yet) so that the API endpoint can be edited like suggested by Marco (Instapaper developer). Yes, I already hear your feedback that some alternatives provide many additional and awesome features. But do we really need features like ‘folders in folders in folders’ like those implemented in, for example, Tiny Tiny RSS?

Basically, uncertainty. I want to believe Black Pixel, but it’s been so long since a NetNewsWire update, I can’t have complete confidence. But I’ll keep on using it until it’s completely busted.

I really do think it would be great if Yahoo! jumped in and provided GReader compatible infrastructure. Microsoft could probably pull it off as well and maybe such an ecosystem could help out Bing.

A little over three months until D-Day. I’ll be looking to see how it all shakes out.

Dangerous Paywalls

Posted on: Tue 19 March 2013

Man, if Garret can’t hack it in the slowly diminishing, open news ecosystem, it’s a really bad sign:

So, to sum up: I soon may not be able to get the information I want in a form optimized for linkblogging. I can’t offer ‘added value’ because the most-worthy ‘valuables’ are hidden behind paywalls. Trying to overcome these two issues to maintain quality is already taking more time than I can spare. The few items I can succeed with to my satisfaction, are not enough to attract and keep an audience. Quality and volume were a hallmark of my blogging style. The internet is now actively working against me on those two points. I can’t go back to having long lists of links on my site — I just can’t. That was so time-consuming as to be idiotic. It was only the excitement of the early metacosm that kept me going in that fashion.

I’ll also toss in the pernicious creep of embed culture, a kissing cousin of animated GIFs, as another factor, even though I’m a judicious offender myself. Cedes a lot of control back to the content provider and breaks “fixity”.

vbox

Posted on: Mon 18 March 2013

Link parkin’: vbox, “Yet another Python library of Python bindings for Virtual Box CLI (Command Line Interface).” OEmbed Link rot on URL: https://twitter.com/pypi/status/313271372153491456

If I ever get super dissatisfied with vagrant, and it does have a wart here or there, I may have to take another look at vbox.

Continuum.io’s Anaconda

Posted on: Mon 18 March 2013

Currently kicking the tires on Anaconda, Continuum.io’s Python distribution focused on high performance compute processing:

Completely free enterprise-ready Python distribution for large-scale data processing, predictive analytics, and scientific computing

Looks really nice and well put together although it doesn’t seem to play well with virtualenv and has its own environment model and tool, conda:

… Our users need to work with different versions of Python, NumPy, SciPy, and a variety of other packages. Moreover, they must be able to easily share live, runnable versions of their work, including all supporting packages, to their colleagues or the general public.

We created the conda package and environment management system to solve these problems. It allows users to install multiple versions of binary packages (and any required libraries) appropriate for their platform and easily switch between them, as well as easily download updates from an upstream repository. …

Bit of shame as virtualenv is really a core part of the Python ecosystem and personally makes my life a lot easier. But I can understand the tradeoff.

Anyway, Continuum provides the baseline Anaconda, which feels pretty competitive with the Enthought Python Distribution, for free and with a reasonable (YMMV), proprietary friendly license.

TIL HTSQL

Posted on: Sun 17 March 2013

Via Catherine Devlin, and her PyCon lightning talk, I just learned of HTSQL

HTSQL is a comprehensive navigational query language for relational databases.

HTSQL is designed for data analysts and other accidental programmers who have complex business inquiries to solve and need a productive tool to write and share database queries.

Not A Bad Thought

Posted on: Sun 17 March 2013

Maybe Yahoo! could pick up the ball that Google Reader will be dropping:

But Yahoo now has a huge opportunity to make itself relevant again to the types of consumers who wrote the company off long ago. Now that Yahoo is referring to itself as a tech company rather than a media firm, it can bolster its techie cred with a product that provides utility that Google won’t match.

I say it’s not a bad thought because Yahoo! probably still has enough engineering talent and infrastructure to execute on this at scale. Mashable (really Todd Wasserman) misses an even bigger opportunity in that a small team could clone the unofficial GReader API and provide an alternative feed synching ecosystem. That would really be hot for mobile.

Fabric or Capistrano?

Posted on: Sat 16 March 2013

So at this point in my multi-vm explorations, it’s pretty clear I’ll need some configuration automation beyond provisioning. Sometimes you just need to shut down and restart services across a bunch of machines in a certain order. SSHing into each one and doing it by hand is now sufficiently painful to consider how the pros do it.

Capistrano has the advantage of getting me deeper into Ruby. Fabric, based on Python, means an easier learning curve. Choices, choices.

TIL dnsmasq

Posted on: Fri 15 March 2013

Today I Learned about dnsmasq

Dnsmasq is a lightweight, easy to configure DNS forwarder and DHCP server. It is designed to provide DNS and, optionally, DHCP, to a small network. It can serve the names of local machines which are not in the global DNS. The DHCP server integrates with the DNS server and allows machines with DHCP-allocated addresses to appear in the DNS with names configured either in each host or in a central configuration file. Dnsmasq supports static and dynamic DHCP leases and BOOTP/TFTP/PXE for network booting of diskless machines.

Dnsmasq is targeted at home networks using NAT and connected to the internet via a modem, cable-modem or ADSL connection but would be a good choice for any smallish network (up to 1000 clients is known to work) where low resource use and ease of configuration are important.

So far dnsmasq has come in real handy for multivm, host-only, networks of virtual machines. It seems to cleanly handle reverse DNS lookups and is nicely packaged for most distros.

GReader’s Real Value

Posted on: Thu 14 March 2013

Chuck Shotton gets to the heart of why Google Reader shutting down will be painful:

Here is the incredibly powerful thing that Google Reader provides that will leave a huge, gaping hole in my daily RSS reading:

Synchronization.

Google Reader was at best an average RSS reader. But it excelled at keeping all of my other 3rd party RSS reader apps in sync. By providing a set of APIs that allowed remote readers to mark/unmark individual articles as read, it let me start reading news on my phone with Feeddler, continue on my desktop with Google Reader, and switch to Flipboard on my tablet later in the day without having to wade through the same news articles twice. What was marked as read on my phone never showed up as unread on my tablet. It also gave me centralized management of all my RSS feeds. When I nuked an entire feed on my desktop computer, it disappeared from my mobile devices.

I wouldn’t hold my breath Chuck on someone filling the void. Matching the scale, speed, availability, and cross-client API support of GReader will be tough. Then again scalable data processing techniques and infrastructure have advanced far faster than the RSS reading population has. So maybe a small team can take this on as a side project, or even lifestyle business hustle.

The Apocalypse Is Nigh

Posted on: Wed 13 March 2013

NOOOOOO: Google Reader to shut down July 1st http://t.co/EQ4p2LEBSR
— Hilary Mason (@hmason) March 13, 2013

Google Reader, dead July 1, 2013. Victim of another Google Spring Cleaning.

We launched Google Reader in 2005 in an effort to make it easy for people to discover and keep tabs on their favorite websites. While the product has a loyal following, over the years usage has declined. So, on July 1, 2013, we will retire Google Reader. Users and developers interested in RSS alternatives can export their data, including their subscriptions, with Google Takeout over the course of the next four months.

As a feedaholic from the earliest days, this’ll be the end of an era. But maybe it’s the extinction event that will spur a new round of innovation with RSS. I’d do a Munch like Scream, but everyone knew this was coming.

I would like to heartily thank the GReader team for a boatload of utility over the years.

On Timestamps

Posted on: Tue 12 March 2013

All timestamps should be in UTC and ISO 8601 format. If I were CEO of the world, I'd fire anyone who disagrees. http://t.co/zmwmGrHDrq
— John Rauser (@jrauser) February 27, 2013

What he said. ISO 8601 everywhere please.

OneTab For Chrome

Posted on: Mon 11 March 2013

As a tabaholic, maybe the OneTab plug-in can help with my addiction. It might also get me back on Chrome more regularly although really my main issue is the inconsistent behavior of the 1Password extension.

Whenever you find yourself with too many tabs, click the OneTab icon to convert all of your tabs into a list. When you need to access the tabs again, you can either restore them individually or all at once.

When your tabs are in the OneTab list, you will save up to 95% of memory because you will have reduced the number of tabs open in Google Chrome.

Emacs 24.3

Posted on: Mon 11 March 2013

Looks like Emacs 24.3 is an official release now. Mickey at Mastering Emacs has an overview of some of the changes:

A new version of python.el, which provides several new features, including: per-buffer shells, better indentation, Python 3 support, and improved shell-interaction compatible with iPython (and virtually any other text based shell).

I blogged about a new python mode a long time ago and it seems it’s made it into trunk. That’s probably good news for most Python users, but as I haven’t yet explored the new python.el mode yet, I will cover this in much greater detail in another post.

Better iPython integration would be a boon although I’ll admit the Emacs IPython Notebook is pretty saucy. Definitely, a must test drive.

Sparkin’ EMR

Posted on: Sun 10 March 2013

Link parkin’. How to layer Spark within an Amazon Elastic MapReduce cluster. Basic idea is to use a bootstrap script to deploy the toolkit.

In this article, we’ll explain how to install Shark and Spark on a cluster managed by Amazon EMR. By combining these technologies, you’ll be able to enjoy the speed enhancements of the Shark data warehouse as well as the operational and financial advantages of running your cluster on Amazon EMR.

Bad Ball

Posted on: Sun 10 March 2013

The Sports Gods had a bad night last night. Of the seven NBA games on Saturday, March 9, six had a victory margin of 10 or more points. That’s the league’s definition of a blowout I believe. The Knicks-Utah game was over in 6 minutes. The Wizards even got a laugher over the Bobcats, who look horribly awful.

Plus the Caps got blown out by the Islanders. Georgetown crushed Syracuse. North Carolina stunk it up, on their home court, on Senior Night, against Duke. Man City won 5-1 in their FA cup match. Tiger Woods is starting to pull away in the PGA Tour stop.

Sunday was looking sort of weak as well, with Man United going up 2-0 early in the first half against Chelsea. At least The Blues got their act together, turned it on in the second half, and put a lot of exciting pressure on the Red Devils.

Thankful for a taste of quality this weekend. Maybe they’re saving up for NCAA Conference tournaments and March Madness.

NetflixGraph

Posted on: Sat 09 March 2013

Interesting. If you have relatively small and relatively static graph data, you can easily ship it around a distributed processing platform thanks to NetflixGraph

NetflixGraph is a compact in-memory data structure used to represent directed graph data. You can use NetflixGraph to vastly reduce the size of your application’s memory footprint, potentially by an order of magnitude or more. If your application is I/O bound, you may be able to remove that bottleneck by holding your entire dataset in RAM. This may be possible with NetflixGraph; you’ll likely be very surprised by how little memory is actually required to represent your data.

NetflixGraph provides an API to translate your data into a graph format, compress that data in memory, then serialize the compressed in-memory representation of the data so that it may be easily transported across your infrastructure.

Beware The Ides of Data

Posted on: Fri 08 March 2013

An interesting interview with Kate Crawford

Kate Crawford: I’m currently researching how big data practices are affecting different industries, from news to crisis recovery to urban design. This talk was based on that upcoming work, touching on questions of smartphones as sensors, on dealing with disasters (like Hurricane Sandy), and new epistemologies — or ways we understand knowledge — in an era of big data.

When “Six Provocations for Big Data” came out in 2011, we were critiquing the very early stages of big data and social media. In the two years since, the issues we raised are even more prominent.

I’m now looking beyond social media to a range of other areas where big data is raising questions of social justice and privacy. I’m also editing a special issue on critiques of big data, which will be coming out later this year in the International Journal of Communications.

Bonus: Nassim Nicholas Taleb’s take on Big Data.

Only In Berkeley

Posted on: Thu 07 March 2013

Seen in #Berkeley: Dick Karp at Nefeli reading Les Valiant's book.
— Joe Hellerstein (@joe_hellerstein) March 7, 2013

One Turing Award winner, Karp, ~~reading~~ reviewing another, Valiant, at a friendly little neighborhood cafe. Of course, I have a soft spot for Nefeli, since I used to work there. Wondering if I’m the only Computer Science Division student to have that privilege. I’ll have to ask Nasos the next time I’m in town.

Hellerstein is no slouch either.

Reinout’s vagrant setup

Posted on: Wed 06 March 2013

More vagrant-fu from Reinout van Rees:

I said in using vagrant for developing on OSX: why? that I chose vagrant for setting up my development environment. Now it’s time for some specifics.

I’ll also point out that vagrant’s pretty good at multivm specification, provisioning, and booting. After you get the hang of it, it’s not too bad setting up a virtualized Hadoop cluster.

basebox

Posted on: Wed 06 March 2013

I’ve managed to use veewee to successfully build baseboxes for vagrant, but didn’t exactly find the experience pleasant. Mainly because the veewee post install scripts are totally isolated within the vm. I really wanted to replace the default ssh keys for the default vagrant account which are hardwired to be downloaded from a public url on the net. Feels like a recipe for disaster to me, but I could only come up with kludgy post basebox build fixup-script to solve the problem.

Maybe the Python based basebox can make this a little cleaner:

Basebox is a small Python library for building and interacting with Vagrant boxes using Fabric. Its goals are somewhat similar to the veewee project, but is specifically geared toward developing and testing Fabric deployments.

Definitely On The DL

Posted on: Tue 05 March 2013

Unlike this guy, I’m not seriously considering dumping the NFL, but I’m definitely creeping with The Premiership. And The Champions League. La Liga every now and then. Plus macking on the FA Cup once in a while.

(6) A team owned by an insane Russian oligarch, who considers money no object to pursuing success (I think this is Chelsea, though it might be Man City. In any case the other one is owned by a sheik who consider money no object etc.)

Hey, at least I know Man City is owned by the Sheik who considers money no object to pursuing success.

The best part of world football is that the dang games are done in two hours, guaranteed. Plus the time zone shift means the slate is essentially done by 2 PM Eastern, so you really can’t blow an entire day laying on the couch watching live game action.

Now if I could only get the Bundesliga’s phone number.

Strata Trip Reports

Posted on: Mon 04 March 2013

Michael Malak does yeoman’s work writing up his observations (Days 0 & 1, Day 2, Day 3) of the Strata Conference Santa Clara. First knock on Storm I’ve heard of:

Spark streaming, according to the presented, beats its competitor Storm at calculating metrics on the fly as they come off a queue like Kafka or Flume because Spark has fault-tolerance through node redundancy and because Spark avoids Storm’s problem of double-counting events by maintaining full historical data in memory for the specified desired window (e.g. 10 minutes). He said there is a layer over Storm that can prevent double-counting, but it achieves it by wrapping each individual event in its own transaction, and most users just abandon that solution for being non-performant.

…

In a barely audible aside during the presentation, they confirmed the weakness of Storm that was stated during the previous day’s Spark Streaming presentation, which is that the layer on top of Storm, Trident, that prevents double-counting is not performant.

Vagrant DSTK

Posted on: Mon 04 March 2013

Pete Warden built a Data Science Toolkit Vagrant basebox:

I have fallen in love with Vagrant over the last year, it turns an entire logical computer as a single unit of software. In simple terms, you can easily set up, run, and maintain a virtual machine image with all the frameworks and data dependencies pre-installed. You can wipe it, copy it to a different system, branch it to run experimental changes, keep multiple versions around, easily share it with other people, and quickly deploy multiple copies when you need to scale up. It’s as revolutionary as the introduction of distributed source control systems, you’re suddenly free to innovate because mistakes can be painlessly rolled back, and you can collaborate other people without worrying that anything will be overwritten.

Before I discovered Vagrant, I’d attempted to do something similar with my Data Science Toolkit package, distributing a VMware image of a full linux system with all the software and data it required pre-installed. It was a large download, and a lot of people used it, but the setup took more work than I liked. Vagrant solved a lot of the usability problems around downloading VMs, so I’ve been eager to create a compatible version of the DSTK image. I finally had a chance to get that working over the weekend, so you can create your own local geocoding server just by running:

vagrant box add dstk http://static.datasciencetoolkit.org/dstk_0.41.box

vagrant init

Cool! I’m becoming more of a fan of vagrant as well. This may have to be the first basebox I try out on Ye ’Olde MacBook. I was thinking a CartoDB 2.0 basebox build would be fun to do, but someone already beat me to it.

Harlem Shake-Off

Posted on: Sun 03 March 2013

OEmbed Link rot on URL: http://www.youtube.com/watch?v=Ir2TdfSwH8g

Speaking of the Miami Heat, I’m usually immune to Internet memes, but I got sucked in by the Miami Heat’s version of the Harlem Shake. My Little Guy (™) had to watch it 20 times in a row. Where’s Andy Baio when we need him?

That’s a lot of high-priced talent having a big old goof. LeBron James seems to really get into it. But what I really want to know is who’s wearing the championship belt? My guess is Joel Anthony, but more investigation is needed.

However, I’m actually somewhat partial to the Kansas Jayhawks edition which I suspect might have inspired the Heat. via Mario Chalmers? Could be a skoosh longer but has the upside of a Bill Self appearance:

Strata Observations

Posted on: Sun 03 March 2013

New post => Data Science Tools: Fast, easy to use, & scalable http://t.co/wBRYGiSXYh (observations from #strataconf)
— Ben Lorica 罗瑞卡 (@bigdata) March 3, 2013

More Ben Lorica with some observations coming out of the recent O’Reilly Strata Conference:

Here are a few observations based on conversations I had during the just concluded Strata Santa Clara conference.

Spark is attracting attention

I’ve written numerous times about components of the Berkeley Data Analytics Stack (Spark, Shark, MLbase). Two Spark-related sessions at Strata were packed (slides here and here) and I talked to many people who were itching to try the BDAS stack. Being able to combine batch, real-time, and interactive analytics in a framework that uses a simple programming model is very attractive. The release of version 0.7 adds a Python API to Spark’s native Scala interface and Java API.

I’m already in the tank for Spark, but Lorica’s got a couple of other interesting observations to add.