BDAS Tutorial

Posted on: Sat 02 March 2013

Batch/Interactive/Realtime analytics combined: Strong interest in BDAS at #strataconf Tue morning session was packed! http://t.co/8gxqk7a8Dj
— Ben Lorica 罗瑞卡 (@bigdata) February 27, 2013

This tutorial-the first of a two-part series-will provide an introduction to BDAS, the Berkeley Data Analytics Stack. BDAS is an open source, next-generation data analytics stack under development at the UC Berkeley AMPLab whose current components include Spark, Shark and Mesos. We will start by covering Spark, a high-speed cluster computing system compatible with Hadoop that can outperform it by up to 100x thanks to its ability to perform computations in memory. Spark provides concise, high-level APIs in both Scala and Java, and is in use at Foursquare, Conviva, Klout, Quantifind, and other companies. We will provide an overview of the Spark architecture, typical data analytics workflows (e.g., loading data from HDFS into memory and interactively querying it), and how users are applying Spark. In addition, we will also introduce Shark, a port of Apache Hive onto Spark that is compatible with existing Hive warehouses and queries. Shark can answer HiveQL queries up to 100x faster than Hive without modification to the data and queries, and is also open source as part of BDAS.

Tutorial Part 1 (with PowerPoint slides) and Part 2.

AtBat ’13

Posted on: Fri 01 March 2013

Erica Ogg at GigaOM did a quick review of MLB’s AtBat ’13 mobile app:

Opening Day of the Major League Baseball season is still about a month away. But the best app for following all the games is here already.

I agree and have already anted up my for my season pass. It was well worth it last year. By far the best and most useful of the professional sports apps.

API Monetizing

Posted on: Thu 28 February 2013

Jacob Perkins describes turning a side hacking project into something “profitable”:

Mashape was just what I needed to monetize the text-processing API, and it’s improved tremendously since I started using it. They handle all the necessary details, plus a lot more, like usage charts, latency & uptime measurements, and automatic client library generation. This last is one of my favorite features, because the client libraries are generated using your API documentation, which provides a great incentive to accurately document the ins & outs of your API. Once you’ve documented your API, downloadable libraries in 5 different programming languages are immediately available, making it that much easier for new users to consume your API. As of this writing, those languages are Java, PHP, Python, Ruby, and Objective C.

Spark and Amazon EMR

Posted on: Thu 28 February 2013

Howto on deploying Spark into Amazon’s elastic environment:

A common business scenario is the need to store and query large data sets. You can do this by running a data warehouse on a cluster of computers. By distributing the data over many computers, you return results quickly because the computers share the load of processing the query. One limitation on the speed at which queries can be returned, however, is the time it takes to retrieve the data from disk.

You can increase the speed of queries returned from a data warehouse by using the Shark data warehouse system. Shark runs on top of Spark, an open-source cluster computing system optimized for speed. Spark speeds up data analytics by loading data into memory, providing much faster performance than a disk-based system like Hadoop. For more information on Spark, see http://spark-project.org/.

Spark and Python

Posted on: Wed 27 February 2013

New API for Spark in Python, released today (no Scala necessary). #strataConf
— julia ferraioli (@juliaferraioli) February 27, 2013

Confirmed. PySpark and Streaming Spark in the same release? I may have died and gone to heaven. Go Bears!

Updated link to point to Spark release notes

LibShortText

Posted on: Wed 27 February 2013

Looks like this might be a useful library for processing social media:

LibShortText is an open source tool for short-text classification and analysis. It can handle the classification of, for example, titles, questions, sentences, and short messages.

Poppin’ Soda Can

Posted on: Tue 26 February 2013

Sometimes you just know.

I’ve only given it two spins on the iPhone, but Mark Farina’s Soda Pop Can edition of the Great Lakes Audio podcast, is the most listenable thing I’ve got my hands on approaching TWO YEARS!! The closest recently comparable mix I’ve acquired was Evol Intent’s Us Against the World, which I got back in late April of 2011.

Soda Pop Can starts out deceptively languorous and down tempo, then maintains pretty close to the same BPM just about all the way throughout. This would typically be a recipe for monotony, but Farina is at the top of his selectin’ game, so the entire mix is an end-to-end toe tapping, head nodder. Peak moment is reserved for K&B’s Let’s Stay Together (a B-side no less), promptly followed by Ben La Desh’s hypnotic Motion.

One hour, seventeen minutes of pure house?, downtempo? bliss.

You… must…download…now!

vagrant and veewee

Posted on: Mon 25 February 2013

As promised, I’m getting up to speed on vagrant at work. It’s working out about as well as expected, with an anticipated steep learning curve. The big hurdle is having to learn a bit of automated configuration management using Puppet at the same time. All in all I’m surprised at how quickly I’ve managed to get a multi-VM Hadoop virtual cluster going.

There is one concern though, adequately stated, and solved by the veewee project:

Vagrant is a great tool to test out new things or changes in a Virtual Machine (Virtualbox) using either chef or puppet.

The first step to build a new Virtual Machine is to download an existing ‘base box’.

I believe this scares a lot of people as they don’t know who or how this box was built. Therefore lots of people end up first building their own base box which is time consuming and often cumbersome.

Veewee aims to automate all the steps for building base boxes and to collect best practices in a transparent way.

Veewee is next up on the stack of tools to learn.

Waiting For NetNewsWire

Posted on: Sun 24 February 2013

Sigh. Took a look at the last release date for NetNewsWire in the App Store. Nov 12th, 2010. The product still works for me but it’s feeling a but pre-Retina at least, and it would be nice to see a couple of new features. With the potential GReaderpocalypse, may have to cast around for a backup iOs feedreader.

Columnar Storage Overview

Posted on: Sat 23 February 2013

A one post review of some of the principles behind columnar storage formats which underly some of the advanced Big Data query engines:

You’re going to hear a lot about columnar storage formats in the next few months, as a variety of distributed execution engines are beginning to consider them for their IO efficiency, and the optimisations that they open up for query execution. In this post, I’ll explain why we care so much about IO efficiency and show how columnar storage – which is a simple idea – can drastically improve performance for certain workloads.

New D3 Book

Posted on: Fri 22 February 2013

Scott Murray is winding down on Interactive Data Visualization for the Web. Will make a nice partner on my Kindle shelf for my copy of Getting Started With D3 which I, uhhh, really need to get work on reading.

Murray’s book started out as a collection of D3 Tutorials.

Emacs For Python

Posted on: Fri 22 February 2013

Emacs-for-python is a new toolkit/package/collection of Python Emacs-lisp. Purports to have an in-development mode for virtualenvs, which the world really needs and deserves:

I’m collecting and customizing the perfect environment for python developement, using the most beautiful emacs customization to obtain a really modern and exciting (yet stable) way to edit text files.

In the package are included also a lot of other packages and configurations, it’s an upstart for clean emacs installations, these configuration however are very similar to emacs-starter-kit and I suggest you to give it a try, emacs-for-python is designed to work with it (instruction below).

Hadoop Munging OSM

Posted on: Thu 21 February 2013

Michal Migurski crisply documents his Hadoop/Elastic Map Reduce process for working with Open Street Map data:

Usable line generalization for OSM roads and routes has been a hobby project of mine for several years now, since I argued for it at the first U.S. State of the Map conference in Atlanta, 2010. I’ve finally put the last piece of this project in place with the use of Hadoop to parallelize the geometry processing.

I’ve learned a lot about moving geographic data between Postgres and Hadoop. The result is available at Streets and Routes.

…

The end result is something I’m super happy about: a complete worldwide dataset of simplified roads and routes that’s suitable for high-quality labels and route shields.

A couple of punch lines to that final statement. First, Migurski was a relative newbie to Hadoop, and didn’t use anything more sophisticated than Hadoop Streaming and straight Python with Shapely. Second, his expensive run cost him $9.40 and 7 hours of waiting.

Posterous Done

Posted on: Wed 20 February 2013

Hey this blog has been around long enough to see even YCombinator grade companies be born, exit, and shutdown:

Posterous launched in 2008. Our mission was to make it easier to share photos and connect with your social networks. Since joining Twitter almost one year ago, we’ve been able to continue that journey, building features to help you discover and share what’s happening in the world – on an even larger scale.

On April 30th, we will turn off posterous.com and our mobile apps in order to focus 100% of our efforts on Twitter. This means that as of April 30, Posterous Spaces will no longer be available either to view or to edit.

Right now and over the next couple months until April 30th, you can download all of your Posterous Spaces including your photos, videos, and documents.

I always thought Posterous had promise, and even used it lightly for some personal side-blogging, but it never really quite clicked or demonstrated a “wow” feature moment. I think the two towers of WordPress and Tumblr have pretty much sucked up all the air of blogging tools.

A worthy attempt though!

MFin Flu

Posted on: Tue 19 February 2013

They were not kidding this year. This flu is kicking my ass, and I even had the dang’ preventative shot.

At least I’m not passing out on The Left Coast, and no, fingers crossed, pink eye.

Graph Based Recommendation

Posted on: Mon 18 February 2013

OEmbed Link rot on URL: https://twitter.com/andyhickl/status/303691512284327937

Color me skeptical, but Reco4j might be worth a test drive:

Reco4j is an open source project that aims at developing a recommendation framework based on graph data sources. We choose graph databases for several reasons. They are NoSQL databases that are “schemaless”. This means that it is possible to extend the basic data structure with intermediate information, i.e. similarity value between item and so on. Moreover, since every information are expressed with some properties, nodes and relations, the recommendation process can be customized to work on every graph.

Hmmmm. But are they high quality recommendations for every graph?

Snap Polling

Posted on: Sun 17 February 2013

Saw this post by Carl Franzen, a while ago on IdeaLab. Found it interesting that there maybe a new “polling science” that could emerge out of real-time social media analytics. The trick will be having some real science, backed by validation.

Social media is a big draw for over 2.3 billion users around the globe, but its true value is only starting to be unlocked, according to Poptip, a New York-based startup company has seen early success running quick, realtime text analysis of tweets on Twitter as a kind of hi-tech snap polling method.

Mainstream companies from Pepsi to People Magazine to ESPN have begun experimenting with Poptip’s first and to date, only product, an app that allows users to post their questions on Twitter, then tracks and displays responses from respondents in realtime.

Even curiouser, the original post seems to have disappeared from IdeaLab. Factually incorrect? Scammed? Weird. Need to do some digging.

Using psql

Posted on: Sat 16 February 2013

Craig Kerstiens lets loose with his preferred interface to Postgres:

Sometimes it leans more to, what is the Sequel Pro equivilant for Postgres. My default answer is I just use psql, though I do have to then go on to explain how I use it. For those just interested you can read more below or just get the highlights here:

There’s a few handy TILs in there.

Precog in PgSQL

Posted on: Fri 15 February 2013

Love to do as much as possible within Postgres:

Today, the Precog team has released a free implementation of Precog for PostgreSQL. Precog for PostgreSQL empowers users to easily perform data science on PostgreSQL.

Morphology Is King

Posted on: Thu 14 February 2013

Quoting Lucas Gonze, who’s always been a good read. Might be worth a sub:

I’m blogging about new form factors for computing on a Tumblr blog at formfactor.gonze.com. My topic there is post-mobile contexts like biohacking, head mounted displays, wearable computing, smart TV, haptics, and neural interfaces. The subtext is new applications enabled by changes in the physical shape of devices.

Pandas Whirlwind

Posted on: Wed 13 February 2013

A 10-minute whirlwind tour of pandas, presented from the #IPython notebook http://t.co/wL2hWJiP #pydata
— Wes McKinney (@wesmckinn) February 11, 2013

Gotta go easy tonight. Some bug has gotten hold of me.

Data Science Toolkit Details

Posted on: Tue 12 February 2013

I sort of already knew of Pete Warden’s Data Science Toolkit, but Ajay Ohri gives a nice overview of what’s actually inside:

The Data Science Toolkit is a collection of the best open data sets and open-source tools for data science, wrapped in an easy-to-use REST/JSON API with command line, Python and Javascript interfaces. Available as a self-contained Vagrant VM or EC2 AMI that you can deploy yourself.

The Data Science Toolkit is essentially a specialized Linux distribution, with a lot of useful data software pre-installed and exposing a simple interface. Developer documentation is quite nicely done.

Urban Coyotes Quick Thoughts

Posted on: Mon 11 February 2013

So I’ve finally been able to give Mark Farina’s Urban Coyotes mix a few listens. Enjoying the blend although my current stance is that it’s at the quality of some of his Great Lakes Audio podcast mixes. This isn’t to say that Urban Coyotes is bad, just not super, duper, mega, head nodding awesome like some other Farina House music mixes. In particular, the only peak for me is Flapjackers Candy.

So far, not at the top of the pantheon, but it I need to give it some more wear for fit. I’m definitely not sending it back though.

Am I True Chicago (™) though in immediately trainspotting Ron Magers on the voice overs? The “coyotes” pronunciation bugs me however. Should be three syllables, not two, with a long e.

Will It Python?

Posted on: Sun 10 February 2013

A series on data analysis, from R to Python, definitely worth tracking:

Will it Python? posts are my attempts to port data analyses originally done in R into Python.

The objective isn’t to just make a key that translates functions and methods in R into Python equivalents. Instead, the goal is to reproduce the results and insights of the analysis in idiomatic Python (to the extent I’m qualified to judge such a thing). Sometimes there will be a direct translation from a line of R to a line of Python; other times Python will suggest an altogether different approach to the problem.

Revisiting Prismatic

Posted on: Sat 09 February 2013

Previously, Prismatic was pretty low on the Yet Another Place To Find News depth chart. Feeds are Number 1, Twitter is Number 2, Hacker News 3, a few sub-Reddits 4, and nothing else really mattered.

However I’m getting disillusioned with Hacker News and need a break. Not enough good, core tech nuggets, and too much aspirational and political content reaching the front page. And I also trawl the new submission page but that’s pretty hit or miss.

So I’m going to provisionally bump Prismatic up to the three hole, and try to replace my Hacker a news with it. I think Prismatic’s link saving/stashing mechanism has greatly improved since my last serious usage, so maybe it’ll feel much more useful

Update Forgot about one reason I stopped using Prismatic regularly. Whatever JavaScript Fu it’s doing, Prismatic routinely crashes Safari on my iPad. Downer.

Wiz Nets

Posted on: Fri 08 February 2013

Wiz wun? Wuttttttt
— Wiz Win Last Night? (@DidTheWizWin) February 9, 2013

That’s 3 home wins in 1 week, including taking out Manhattan and Brooklyn. Got any more boroughs for us to beat NYC?

Other Python and Geo Playthings

Posted on: Thu 07 February 2013

Some other things that caught my fancy recently

Geobases

This project provides tools to play with geographical data. It also works with non-geographical data, except for map visualizations :).

Glue

Glue is a Python library to explore relationships within and among related datasets.

On the current state of reverse geocoding:

At Flickr we spent a while working on turning the point where a photo was set on map (or whose GPS coordinates were shoved into the EXIF) into a place. The work of reverse geocoding is about taking a point, and finding out which polygon its in. This is a well solved problem. With two caveats:

places don’t have neat boundaries, but overlap all over each other. And people disagree about the overlaps

even if places had neat boundaries, and people agreed on them, availability of information about those boundaries is variable at best.

Once I made a request for a pure geo indexing capability. Ask and ye shall receive.

A Java reverse geocoder that uses Geotools and JTS to index Flickr Shapefiles to map (latitude, longitude) to WOEID.

The code has no external server dependencies and runs in memory. You may have to specify a larger max heap size if you use the largest files (e.g. the localities dataset uses around 450Mb on my Mac).

Mongo Geo Indexes

Posted on: Thu 07 February 2013

New Geospatial Indexes with GeoJSON and Improved Spherical Geometry

The 2.3 series adds a new type of geospatial index that supports improved spherical queries and GeoJSON.

MongoDB may have its failings and critics, but it’s here, working and getting better. New geospatial indexing features might make me take another look for some ad hoc hackery.

Fast, Cheap, and Persistent

Posted on: Wed 06 February 2013

NVRAM could be revolutionary rather than evolutionary: I'm looking 4 folks 2 talk w/ re nonvolatile memory, thoughts? http://t.co/l0n5zBC2
— Ben Lorica 罗瑞卡 (@bigdata) February 1, 2013

Operating System Implications of Fast, Cheap, Non-Volatile Memory

The existence of two basic levels of storage (fast/volatile and slow/non-volatile) has been a long-standing premise of most computer systems, influencing the design of OS components, including file systems, virtual memory, scheduling, execution models, and even their APIs. Emerging resistive memory technologies – such as phase-change memory (PCM) and memristors – have the potential to provide large, fast, non-volatile memory systems, changing the assumptions that motivated the design of current operating systems. This paper examines the implications of non-volatile memories on a number of OS mechanisms, functions, and properties.

Starting to think UltraRAM rolls off the tongue better than my tongue in cheek moniker BigRAM. Besides being cheap enough to splurge on, RAM will be different in the future.

Big Data Etymology

Posted on: Tue 05 February 2013

This is just of general interest to me, but there’s also the specific angle that I’ve personally met the guy who trademarked “bigdata”. Smart guy. Was wondering where that term actually originated. Will be stuff of lore for years to come.

The term Big Data is so generic that the hunt for its origin was not just an effort to find an early reference to those two words being used together. Instead, the goal was the early use of the term that suggests its present connotation — that is, not just a lot of data, but different types of data handled in new ways.

The Funniest Tweet

Posted on: Mon 04 February 2013

PEOPLE OF GOTHAM
— Spencer Ackerman (@attackerman) February 4, 2013

… of last night, Super Bowl Sunday, at least for me. Then again it was timed perfectly for the blackout, I’m a big Dark Knight fan, and I’m really easily amused.

A Chihuahua in Your Pig

Posted on: Sun 03 February 2013

That would be Pig UDF for GraphChi. A complete hack of the best kind:

Pig is a powerful query language for Hadoop commonly used for large scale data processing. Now it is possible to run GraphChi programs as parts of Pig-scripts, with just one line of script! This allows easy huge scale graph computation with data stored in HDFS (Hadoop File System). As GraphChi will ultimately execute only on a single Hadoop machine (see HowGraphChiForPigWorks), the size of the Hadoop Cluster is not a limiting factor.

GraphChi for Pig is a viable alternative to Giraph, which is a distributed graph engine built on top of Hadoop. With GraphChi, you can develop your algorithms on your laptop (with realistically sized data) and then deploy them to run the big cluster. GraphChi will also often run faster and uses much less resources than alternatives.

Stuper Sunday

Posted on: Sun 03 February 2013

From the best (only?!) post to American McCarver in a while:

But football is a contest and picking sides is what competition is about. So who does a disinterested party root for this weekend?

Both teams are populated with terrible human beings, so that doesn’t help. (I’m tempted to give the edge to the Ravens, because at least Ray Lewis knew what he was doing when he denied it later.) The normally reliable heuristic of rooting against a Harbaugh doesn’t work. Colin Kaepernick is the obvious hero from every football movie ever, but nobody should have to cheer for someone with a chin beard. Seriously, Colin, stop taking grooming tips from Brian Wilson.

This is the least interest I’ve had in a Super Bowl since that Denver-Atlanta stinker in 1999. I absentmindedly listened to that one on radio and slept through part of it. In 2013, I have no sympathy for any of the players. As an East Bay aficionado, I can’t stand the ’Niners and their fans. Not to mention their jerk of a former Stanford coach. Even though I have Baltimore ties, the Ravens are the Browns, not the Colts. And Ray Ray’s act and history irk me to no end.

Guess I’ll hold my nose and go for the Ravens just for Ozzie Newsome’s sake.

Bonus. Sapp before Haley? I’m really just starting to dislike the NFL.

Mission Accomplished

Posted on: Sat 02 February 2013

With the help of an iFixit YouTube video and repair Guide, I managed to complete the upgrading of Ye Olde MacBook as previously advertised. Unfortunately, the guide delivered with the MCE OptiBay seems specific to MacBook Pros, with no info for us old MacBook peasants. The iFixit material did in a pinch though. Way more tiny screws were involved than I planned, but the machine currently seems none the worse for the experience.

Now I need to put the HD through it’s paces a little, making it the home of my iTunes Library. But I feel good this little MacBook can make it to its fifth anniversary and well beyond.

TIL EIN

Posted on: Fri 01 February 2013

I’ve been looking for a decent mash up of IPython and Emacs. Will definitely test drive EIN:

Emacs IPython Notebook (EIN) provides a IPython Notebook client and integrated REPL (like SLIME) in Emacs. While EIN makes notebook editing very powerful by allowing you to use any Emacs features, it also expose IPython features such as code evaluation, object inspection and code completion to the Emacs side. These features can be accessed anywhere in Emacs and improve Python code editing and reading in Emacs.

Via John D. Cook

A Mallet in Your Pig

Posted on: Fri 01 February 2013

Diving deeper into the Hadoop ecosystem, I’m starting to appreciate languages like Hive and Pig much more. The ability to extend each with UDFs really expands their possibilities.

Here’s an old, but still neat, hack from Jacob Perkins, The Data Chef. He’s folding the capabilities of Mallet, an open source topic modeling toolkit, into the Pig language:

I’m going to use Apache Pig and Mallet, a java based machine learning and natural language processing library to discover topics in the 20 newsgroups data set. This corpus is nice since each document already belongs to a newsgroup (a topic) and so it gives us a way of checking how well our topic discovery is doing. …

So it’s clear that we’re going to need a java udf to do the actual topic clustering. Right? This udf will operate on a DataBag of documents and return a DataBag containing the discovered topics.

Extensibility For The Win.

44

Posted on: Thu 31 January 2013

Finally broke the 40 post in a month barrier, this being the 44th, so I thought it only appropriate to give a nod to the reelection and inauguration of the 44th President of the United States.

One thing I can say is, thank God all that political advertising has gone away for a while here in battleground Virginia. Still traumatized from the bombardment.

Wiz Logo

Posted on: Thu 31 January 2013

If nothing else, the Wizards have a cool logo this year. Good enough that I bought my first piece of sports logo apparel in years. A winter hat with this logo as a patch, for the rest of DC’s wintry days. Nice bonus that it’s not a walking brand advertisement to boot, shades of Cayce Pollard.

Indiepocalpyse Now

Posted on: Thu 31 January 2013

Andy Baio thinks we’ve seen a phase change for creative endeavors:

Digital distribution subverted the monopolies held by physical distribution, bypassing distribution deals with record stores entirely, allowing artists to sell directly to fans. Social media and online music services changed the way people discover music, making the payola systems of MTV and radio airplay feel quaint. Production costs dropped dramatically as computers became more powerful and audio editing software got dirt cheap, along with new services for printing on demand. And, finally, Kickstarter and other crowdfunding platforms offset the financial risk to artists.

Most importantly, each new platform let artists find, communicate, and sell directly to their fans.

I want to believe, but whenever these transformational moments involve transition for mega-corporate concerns (if not outright downfall), I have this thought that there’s still an Empire Strikes Back chapter to be written.