home ¦ Archives ¦ Atom ¦ RSS

Sparkin’ EMR

Link parkin’. How to layer Spark within an Amazon Elastic MapReduce cluster. Basic idea is to use a bootstrap script to deploy the toolkit.

In this article, we’ll explain how to install Shark and Spark on a cluster managed by Amazon EMR. By combining these technologies, you’ll be able to enjoy the speed enhancements of the Shark data warehouse as well as the operational and financial advantages of running your cluster on Amazon EMR.


Bad Ball

NBA Logo Small The Sports Gods had a bad night last night. Of the seven NBA games on Saturday, March 9, six had a victory margin of 10 or more points. That’s the league’s definition of a blowout I believe. The Knicks-Utah game was over in 6 minutes. The Wizards even got a laugher over the Bobcats, who look horribly awful.

Plus the Caps got blown out by the Islanders. Georgetown crushed Syracuse. North Carolina stunk it up, on their home court, on Senior Night, against Duke. Man City won 5-1 in their FA cup match. Tiger Woods is starting to pull away in the PGA Tour stop.

Sunday was looking sort of weak as well, with Man United going up 2-0 early in the first half against Chelsea. At least The Blues got their act together, turned it on in the second half, and put a lot of exciting pressure on the Red Devils.

Thankful for a taste of quality this weekend. Maybe they’re saving up for NCAA Conference tournaments and March Madness.


NetflixGraph

Interesting. If you have relatively small and relatively static graph data, you can easily ship it around a distributed processing platform thanks to NetflixGraph

NetflixGraph is a compact in-memory data structure used to represent directed graph data. You can use NetflixGraph to vastly reduce the size of your application’s memory footprint, potentially by an order of magnitude or more. If your application is I/O bound, you may be able to remove that bottleneck by holding your entire dataset in RAM. This may be possible with NetflixGraph; you’ll likely be very surprised by how little memory is actually required to represent your data.

NetflixGraph provides an API to translate your data into a graph format, compress that data in memory, then serialize the compressed in-memory representation of the data so that it may be easily transported across your infrastructure.


Beware The Ides of Data

An interesting interview with Kate Crawford

Kate Crawford: I’m currently researching how big data practices are affecting different industries, from news to crisis recovery to urban design. This talk was based on that upcoming work, touching on questions of smartphones as sensors, on dealing with disasters (like Hurricane Sandy), and new epistemologies — or ways we understand knowledge — in an era of big data.

When “Six Provocations for Big Data” came out in 2011, we were critiquing the very early stages of big data and social media. In the two years since, the issues we raised are even more prominent.

I’m now looking beyond social media to a range of other areas where big data is raising questions of social justice and privacy. I’m also editing a special issue on critiques of big data, which will be coming out later this year in the International Journal of Communications.

Bonus: Nassim Nicholas Taleb’s take on Big Data.


Only In Berkeley

Nefeli Caffe

[embed]https://twitter.com/joe_hellerstein/status/309768472823488513[/embed]

One Turing Award winner, Karp, reading reviewing another, Valiant, at a friendly little neighborhood cafe. Of course, I have a soft spot for Nefeli, since I used to work there. Wondering if I’m the only Computer Science Division student to have that privilege. I’ll have to ask Nasos the next time I’m in town.

Hellerstein is no slouch either.


Reinout’s vagrant setup

More vagrant-fu from Reinout van Rees:

I said in using vagrant for developing on OSX: why? that I chose vagrant for setting up my development environment. Now it’s time for some specifics.

I’ll also point out that vagrant’s pretty good at multivm specification, provisioning, and booting. After you get the hang of it, it’s not too bad setting up a virtualized Hadoop cluster.


basebox

I’ve managed to use veewee to successfully build baseboxes for vagrant, but didn’t exactly find the experience pleasant. Mainly because the veewee post install scripts are totally isolated within the vm. I really wanted to replace the default ssh keys for the default vagrant account which are hardwired to be downloaded from a public url on the net. Feels like a recipe for disaster to me, but I could only come up with kludgy post basebox build fixup-script to solve the problem.

Maybe the Python based basebox can make this a little cleaner:

Basebox is a small Python library for building and interacting with Vagrant boxes using Fabric. Its goals are somewhat similar to the veewee project, but is specifically geared toward developing and testing Fabric deployments.


Definitely On The DL

Unlike this guy, I’m not seriously considering dumping the NFL, but I’m definitely creeping with The Premiership. And The Champions League. La Liga every now and then. Plus macking on the FA Cup once in a while.

(6) A team owned by an insane Russian oligarch, who considers money no object to pursuing success (I think this is Chelsea, though it might be Man City. In any case the other one is owned by a sheik who consider money no object etc.)

Hey, at least I know Man City is owned by the Sheik who considers money no object to pursuing success.

The best part of world football is that the dang games are done in two hours, guaranteed. Plus the time zone shift means the slate is essentially done by 2 PM Eastern, so you really can’t blow an entire day laying on the couch watching live game action.

Now if I could only get the Bundesliga’s phone number.


Strata Trip Reports

Michael Malak does yeoman’s work writing up his observations (Days 0 & 1, Day 2, Day 3) of the Strata Conference Santa Clara. First knock on Storm I’ve heard of:

Spark streaming, according to the presented, beats its competitor Storm at calculating metrics on the fly as they come off a queue like Kafka or Flume because Spark has fault-tolerance through node redundancy and because Spark avoids Storm’s problem of double-counting events by maintaining full historical data in memory for the specified desired window (e.g. 10 minutes). He said there is a layer over Storm that can prevent double-counting, but it achieves it by wrapping each individual event in its own transaction, and most users just abandon that solution for being non-performant.

In a barely audible aside during the presentation, they confirmed the weakness of Storm that was stated during the previous day’s Spark Streaming presentation, which is that the layer on top of Storm, Trident, that prevents double-counting is not performant.


Vagrant DSTK

Pete Warden built a Data Science Toolkit Vagrant basebox:

I have fallen in love with Vagrant over the last year, it turns an entire logical computer as a single unit of software. In simple terms, you can easily set up, run, and maintain a virtual machine image with all the frameworks and data dependencies pre-installed. You can wipe it, copy it to a different system, branch it to run experimental changes, keep multiple versions around, easily share it with other people, and quickly deploy multiple copies when you need to scale up. It’s as revolutionary as the introduction of distributed source control systems, you’re suddenly free to innovate because mistakes can be painlessly rolled back, and you can collaborate other people without worrying that anything will be overwritten.

Before I discovered Vagrant, I’d attempted to do something similar with my Data Science Toolkit package, distributing a VMware image of a full linux system with all the software and data it required pre-installed. It was a large download, and a lot of people used it, but the setup took more work than I liked. Vagrant solved a lot of the usability problems around downloading VMs, so I’ve been eager to create a compatible version of the DSTK image. I finally had a chance to get that working over the weekend, so you can create your own local geocoding server just by running:

vagrant box add dstk http://static.datasciencetoolkit.org/dstk_0.41.box

vagrant init

Cool! I’m becoming more of a fan of vagrant as well. This may have to be the first basebox I try out on Ye ’Olde MacBook. I was thinking a CartoDB 2.0 basebox build would be fun to do, but someone already beat me to it.


Harlem Shake-Off

[embed]http://www.youtube.com/watch?v=Ir2TdfSwH8g[/embed]

Speaking of the Miami Heat, I’m usually immune to Internet memes, but I got sucked in by the Miami Heat’s version of the Harlem Shake. My Little Guy (™) had to watch it 20 times in a row. Where’s Andy Baio when we need him?

That’s a lot of high-priced talent having a big old goof. LeBron James seems to really get into it. But what I really want to know is who’s wearing the championship belt? My guess is Joel Anthony, but more investigation is needed.

However, I’m actually somewhat partial to the Kansas Jayhawks edition which I suspect might have inspired the Heat. via Mario Chalmers? Could be a skoosh longer but has the upside of a Bill Self appearance:

[embed]http://www.youtube.com/watch?v=SbGFYjCbpRs[/embed]


Strata Observations

[embed]https://twitter.com/bigdata/status/308266084656619520[/embed]

More Ben Lorica with some observations coming out of the recent O’Reilly Strata Conference:

Here are a few observations based on conversations I had during the just concluded Strata Santa Clara conference.

Spark is attracting attention

I’ve written numerous times about components of the Berkeley Data Analytics Stack (Spark, Shark, MLbase). Two Spark-related sessions at Strata were packed (slides here and here) and I talked to many people who were itching to try the BDAS stack. Being able to combine batch, real-time, and interactive analytics in a framework that uses a simple programming model is very attractive. The release of version 0.7 adds a Python API to Spark’s native Scala interface and Java API.

I’m already in the tank for Spark, but Lorica’s got a couple of other interesting observations to add.


Top N Best Ever

NBA Logo Small Too often I hear pro sports commentators cut loose with “Flavor of the Month is one of the top n best ever…”. Drives me batty.

Jeff van Gundy, in today’s Heat-Knicks broadcast, let fly with “LeBron James is one of the 10 best NBA players ever.” Oh really Jeff? Here’s ten names of NBA greats, in no particular order:

  • Bill Russell
  • Wilt Chamberlain
  • Kareem Abdul-Jabbar
  • Shaquille O’Neal
  • Michael Jordan
  • Larry Bird
  • Magic Johnson
  • Jerry West
  • Oscar Robertson
  • Kobe Bryant
  • Honorable mention to: Tim Duncan, David Robinson, Isaiah Thomas, Elgin Baylor, Hakeem Olajuwon, Karl Malone, Charles Barkley, and Pete Maravich

I love James’ physical gifts and talents, but right now, which one of those names are you replacing with LeBron James?

Oh, and could some destitute NBA team hire van Gundy so I don’t have to listen to him on the TV anymore?


BDAS Tutorial

[embed]https://twitter.com/bigdata/status/306810999380516864[/embed]

This tutorial-the first of a two-part series-will provide an introduction to BDAS, the Berkeley Data Analytics Stack. BDAS is an open source, next-generation data analytics stack under development at the UC Berkeley AMPLab whose current components include Spark, Shark and Mesos. We will start by covering Spark, a high-speed cluster computing system compatible with Hadoop that can outperform it by up to 100x thanks to its ability to perform computations in memory. Spark provides concise, high-level APIs in both Scala and Java, and is in use at Foursquare, Conviva, Klout, Quantifind, and other companies. We will provide an overview of the Spark architecture, typical data analytics workflows (e.g., loading data from HDFS into memory and interactively querying it), and how users are applying Spark. In addition, we will also introduce Shark, a port of Apache Hive onto Spark that is compatible with existing Hive warehouses and queries. Shark can answer HiveQL queries up to 100x faster than Hive without modification to the data and queries, and is also open source as part of BDAS.

Tutorial Part 1 (with PowerPoint slides) and Part 2.


AtBat ’13

MLB AtBat Logo
Erica Ogg at GigaOM did a quick review of MLB’s AtBat ’13 mobile app:

Opening Day of the Major League Baseball season is still about a month away. But the best app for following all the games is here already.

I agree and have already anted up my for my season pass. It was well worth it last year. By far the best and most useful of the professional sports apps.


API Monetizing

Jacob Perkins describes turning a side hacking project into something “profitable”:

Mashape was just what I needed to monetize the text-processing API, and it’s improved tremendously since I started using it. They handle all the necessary details, plus a lot more, like usage charts, latency & uptime measurements, and automatic client library generation. This last is one of my favorite features, because the client libraries are generated using your API documentation, which provides a great incentive to accurately document the ins & outs of your API. Once you’ve documented your API, downloadable libraries in 5 different programming languages are immediately available, making it that much easier for new users to consume your API. As of this writing, those languages are Java, PHP, Python, Ruby, and Objective C.


Spark and Amazon EMR

Howto on deploying Spark into Amazon’s elastic environment:

A common business scenario is the need to store and query large data sets. You can do this by running a data warehouse on a cluster of computers. By distributing the data over many computers, you return results quickly because the computers share the load of processing the query. One limitation on the speed at which queries can be returned, however, is the time it takes to retrieve the data from disk.

You can increase the speed of queries returned from a data warehouse by using the Shark data warehouse system. Shark runs on top of Spark, an open-source cluster computing system optimized for speed. Spark speeds up data analytics by loading data into memory, providing much faster performance than a disk-based system like Hadoop. For more information on Spark, see http://spark-project.org/.


Spark and Python

[embed]https://twitter.com/juliaferraioli/status/306844794636869634[/embed]

Confirmed. PySpark and Streaming Spark in the same release? I may have died and gone to heaven. Go Bears!

Updated link to point to Spark release notes


LibShortText

Looks like this might be a useful library for processing social media:

LibShortText is an open source tool for short-text classification and analysis. It can handle the classification of, for example, titles, questions, sentences, and short messages.


Poppin’ Soda Can

GLA Mark Farina Soda Pop Can Cover

Sometimes you just know.

I’ve only given it two spins on the iPhone, but Mark Farina’s Soda Pop Can edition of the Great Lakes Audio podcast, is the most listenable thing I’ve got my hands on approaching TWO YEARS!! The closest recently comparable mix I’ve acquired was Evol Intent’s Us Against the World, which I got back in late April of 2011.

Soda Pop Can starts out deceptively languorous and down tempo, then maintains pretty close to the same BPM just about all the way throughout. This would typically be a recipe for monotony, but Farina is at the top of his selectin’ game, so the entire mix is an end-to-end toe tapping, head nodder. Peak moment is reserved for K&B’s Let’s Stay Together (a B-side no less), promptly followed by Ben La Desh’s hypnotic Motion.

One hour, seventeen minutes of pure house?, downtempo? bliss.

You… must…download…now!


vagrant and veewee

As promised, I’m getting up to speed on vagrant at work. It’s working out about as well as expected, with an anticipated steep learning curve. The big hurdle is having to learn a bit of automated configuration management using Puppet at the same time. All in all I’m surprised at how quickly I’ve managed to get a multi-VM Hadoop virtual cluster going.

There is one concern though, adequately stated, and solved by the veewee project:

Vagrant is a great tool to test out new things or changes in a Virtual Machine (Virtualbox) using either chef or puppet.

The first step to build a new Virtual Machine is to download an existing ‘base box’.

I believe this scares a lot of people as they don’t know who or how this box was built. Therefore lots of people end up first building their own base box which is time consuming and often cumbersome.

Veewee aims to automate all the steps for building base boxes and to collect best practices in a transparent way.

Veewee is next up on the stack of tools to learn.


Waiting For NetNewsWire

Sigh. Took a look at the last release date for NetNewsWire in the App Store. Nov 12th, 2010. The product still works for me but it’s feeling a but pre-Retina at least, and it would be nice to see a couple of new features. With the potential GReaderpocalypse, may have to cast around for a backup iOs feedreader.


Columnar Storage Overview

A one post review of some of the principles behind columnar storage formats which underly some of the advanced Big Data query engines:

You’re going to hear a lot about columnar storage formats in the next few months, as a variety of distributed execution engines are beginning to consider them for their IO efficiency, and the optimisations that they open up for query execution. In this post, I’ll explain why we care so much about IO efficiency and show how columnar storage – which is a simple idea – can drastically improve performance for certain workloads.


New D3 Book

Interactive Data Visualization for the Web Cover Scott Murray is winding down on Interactive Data Visualization for the Web. Will make a nice partner on my Kindle shelf for my copy of Getting Started With D3 which I, uhhh, really need to get work on reading.

Murray’s book started out as a collection of D3 Tutorials.


Emacs For Python

Python logo Emacs-for-python is a new toolkit/package/collection of Python Emacs-lisp. Purports to have an in-development mode for virtualenvs, which the world really needs and deserves:

I’m collecting and customizing the perfect environment for python developement, using the most beautiful emacs customization to obtain a really modern and exciting (yet stable) way to edit text files.

In the package are included also a lot of other packages and configurations, it’s an upstart for clean emacs installations, these configuration however are very similar to emacs-starter-kit and I suggest you to give it a try, emacs-for-python is designed to work with it (instruction below).


Hadoop Munging OSM

Michal Migurski crisply documents his Hadoop/Elastic Map Reduce process for working with Open Street Map data:

Usable line generalization for OSM roads and routes has been a hobby project of mine for several years now, since I argued for it at the first U.S. State of the Map conference in Atlanta, 2010. I’ve finally put the last piece of this project in place with the use of Hadoop to parallelize the geometry processing.

I’ve learned a lot about moving geographic data between Postgres and Hadoop. The result is available at Streets and Routes.

The end result is something I’m super happy about: a complete worldwide dataset of simplified roads and routes that’s suitable for high-quality labels and route shields.

A couple of punch lines to that final statement. First, Migurski was a relative newbie to Hadoop, and didn’t use anything more sophisticated than Hadoop Streaming and straight Python with Shapely. Second, his expensive run cost him $9.40 and 7 hours of waiting.


Posterous Done

Hey this blog has been around long enough to see even YCombinator grade companies be born, exit, and shutdown:

Posterous launched in 2008. Our mission was to make it easier to share photos and connect with your social networks. Since joining Twitter almost one year ago, we’ve been able to continue that journey, building features to help you discover and share what’s happening in the world – on an even larger scale.

On April 30th, we will turn off posterous.com and our mobile apps in order to focus 100% of our efforts on Twitter. This means that as of April 30, Posterous Spaces will no longer be available either to view or to edit.

Right now and over the next couple months until April 30th, you can download all of your Posterous Spaces including your photos, videos, and documents.

I always thought Posterous had promise, and even used it lightly for some personal side-blogging, but it never really quite clicked or demonstrated a “wow” feature moment. I think the two towers of WordPress and Tumblr have pretty much sucked up all the air of blogging tools.

A worthy attempt though!


M*F*in Flu

They were not kidding this year. This flu is kicking my ass, and I even had the dang’ preventative shot.

At least I’m not passing out on The Left Coast, and no, fingers crossed, pink eye.


Graph Based Recommendation

[embed]https://twitter.com/andyhickl/status/303691512284327937[/embed]

Color me skeptical, but Reco4j might be worth a test drive:

Reco4j is an open source project that aims at developing a recommendation framework based on graph data sources. We choose graph databases for several reasons. They are NoSQL databases that are “schemaless”. This means that it is possible to extend the basic data structure with intermediate information, i.e. similarity value between item and so on. Moreover, since every information are expressed with some properties, nodes and relations, the recommendation process can be customized to work on every graph.

Hmmmm. But are they high quality recommendations for every graph?


Snap Polling

Saw this post by Carl Franzen, a while ago on IdeaLab. Found it interesting that there maybe a new “polling science” that could emerge out of real-time social media analytics. The trick will be having some real science, backed by validation.

Social media is a big draw for over 2.3 billion users around the globe, but its true value is only starting to be unlocked, according to Poptip, a New York-based startup company has seen early success running quick, realtime text analysis of tweets on Twitter as a kind of hi-tech snap polling method.

Mainstream companies from Pepsi to People Magazine to ESPN have begun experimenting with Poptip’s first and to date, only product, an app that allows users to post their questions on Twitter, then tracks and displays responses from respondents in realtime.

Even curiouser, the original post seems to have disappeared from IdeaLab. Factually incorrect? Scammed? Weird. Need to do some digging.


Using psql

Craig Kerstiens lets loose with his preferred interface to Postgres:

Sometimes it leans more to, what is the Sequel Pro equivilant for Postgres. My default answer is I just use psql, though I do have to then go on to explain how I use it. For those just interested you can read more below or just get the highlights here:

There’s a few handy TILs in there.


Precog in PgSQL

Love to do as much as possible within Postgres:

Today, the Precog team has released a free implementation of Precog for PostgreSQL. Precog for PostgreSQL empowers users to easily perform data science on PostgreSQL.


Morphology Is King

Quoting Lucas Gonze, who’s always been a good read. Might be worth a sub:

I’m blogging about new form factors for computing on a Tumblr blog at formfactor.gonze.com. My topic there is post-mobile contexts like biohacking, head mounted displays, wearable computing, smart TV, haptics, and neural interfaces. The subtext is new applications enabled by changes in the physical shape of devices.


Pandas Whirlwind

[embed]https://twitter.com/wesmckinn/status/301057668938887168[/embed]

Gotta go easy tonight. Some bug has gotten hold of me.


Data Science Toolkit Details

I sort of already knew of Pete Warden’s Data Science Toolkit, but Ajay Ohri gives a nice overview of what’s actually inside:

The Data Science Toolkit is a collection of the best open data sets and open-source tools for data science, wrapped in an easy-to-use REST/JSON API with command line, Python and Javascript interfaces. Available as a self-contained Vagrant VM or EC2 AMI that you can deploy yourself.

The Data Science Toolkit is essentially a specialized Linux distribution, with a lot of useful data software pre-installed and exposing a simple interface. Developer documentation is quite nicely done.


Urban Coyotes Quick Thoughts

Mark Farina Urban Coyotes Cover So I’ve finally been able to give Mark Farina’s Urban Coyotes mix a few listens. Enjoying the blend although my current stance is that it’s at the quality of some of his Great Lakes Audio podcast mixes. This isn’t to say that Urban Coyotes is bad, just not super, duper, mega, head nodding awesome like some other Farina House music mixes. In particular, the only peak for me is Flapjackers Candy.

So far, not at the top of the pantheon, but it I need to give it some more wear for fit. I’m definitely not sending it back though.

Am I True Chicago (™) though in immediately trainspotting Ron Magers on the voice overs? The “coyotes” pronunciation bugs me however. Should be three syllables, not two, with a long e.


Will It Python?

A series on data analysis, from R to Python, definitely worth tracking:

Will it Python? posts are my attempts to port data analyses originally done in R into Python.

The objective isn’t to just make a key that translates functions and methods in R into Python equivalents. Instead, the goal is to reproduce the results and insights of the analysis in idiomatic Python (to the extent I’m qualified to judge such a thing). Sometimes there will be a direct translation from a line of R to a line of Python; other times Python will suggest an altogether different approach to the problem.


More Linden Links

Greg Linden is back with another collection of interesting links. A skoosh more business oriented than is my want, and probably more than any one human should read in a single sitting, but still some good nuggets to be had:

The future of maps on smartphones: “It’ll be like you’re a local everywhere you go. You’ll know your way through the back alleys and hutongs of Beijing, you’ll know your way all around Paris even if you’ve never been before. Signs will seem to translate themselves for you. This kind of extra-smartness is coming to people.” (1)


Revisiting Prismatic

Previously, Prismatic was pretty low on the Yet Another Place To Find News depth chart. Feeds are Number 1, Twitter is Number 2, Hacker News 3, a few sub-Reddits 4, and nothing else really mattered.

However I’m getting disillusioned with Hacker News and need a break. Not enough good, core tech nuggets, and too much aspirational and political content reaching the front page. And I also trawl the new submission page but that’s pretty hit or miss.

So I’m going to provisionally bump Prismatic up to the three hole, and try to replace my Hacker a news with it. I think Prismatic’s link saving/stashing mechanism has greatly improved since my last serious usage, so maybe it’ll feel much more useful

Update Forgot about one reason I stopped using Prismatic regularly. Whatever JavaScript Fu it’s doing, Prismatic routinely crashes Safari on my iPad. Downer.


Wiz Nets

[embed]https://twitter.com/DidTheWizWin/status/300068006921379840[/embed]

That’s 3 home wins in 1 week, including taking out Manhattan and Brooklyn. Got any more boroughs for us to beat NYC?

© 2008-2024 C. Ross Jam. Built using Pelican. Theme based upon Giulio Fidente’s original svbhack, and slightly modified by crossjam.