home ¦ Archives ¦ Atom ¦ RSS

Ricon East Talks

Wow! Basho’s Ricon East conference was a little more diverse and wide ranging than I anticipated. This was evidenced by Anders Pearson’s summary of the talks he attended. For example, this lede on ZooKeeper for the Skeptical Architect by Camille Fournier, VP of Technical Architecture, Rent the Runway:

Camille presented ZooKeeper from the perspective of an architect who is a ZooKeeper committer, has done large deployments of it at her previous employer (Goldman Sachs), left to start her own company, and that company doesn’t use ZooKeeper. In other words, taking a very balanced engineering view of what ZooKeeper is appropriate for and where you might not want to use it.

Of the talks Pearson summarized, only two were by Bash employees while the rest were by some pretty serious distributed folks such as Margo Seltzer and Theo Schlossnagle. Plus there was a healthy dose of industry war story experience at scale.

Good on Basho!

Via John Daily


HDFS Gets Snakebitten

OEmbed Link rot on URL: https://twitter.com/pypi/status/335412456396558336

Another good find from the PyPi Twitter stream. Had to do a quick Google search to get the real details on snakebite, a pure Python library for interacting with Hadoop’s HDFS:

Another annoyance we had with Hadoop (and in particular HDFS) is that interacting with it is quite slow. For example, when you run hadoop fs -ls /, a Java virtual machine is started, a lot of Hadoop JARs are loaded and the communication with the NameNode is done, before displaying the result. This takes at least a couple of seconds and can become slightly annoying. This gets even worse when you do a lot of existence checks on HDFS; something we do a lot with luigi, to see if output of a jobs exist.

So, to circumvent slow interaction with HDFS and having a native solution for Python, we’ve created Snakebite, a pure Python HDFS client that only uses Protocol Buffers to communicate with HDFS. And since this might be interesting for others, we decided to Open Source it at http://github.com/spotify/snakebite.

Roger that on the annoyingly slow response of hadoop fs. Thanks Spotify.


Jepp, CPython and Java

TIL about Jepp:

Jepp embeds CPython in Java. It is safe to use in a heavily threaded environment, it is quite fast and its stability is a main feature and goal.

Could be handy for cutting down performance overhead at some points in the Hadoop stack where Python and Java come together. I’m looking at you Hadoop Streaming. Also for helping Python out with the myriad of serialization formats that Java does oh so well.

Via Morten Petersen


Praising Data Engineering

Metamarkets’ M. E. Driscoll gives a shout out to those mucking about with the bits:

A stark but recurring reality in the business world is this: when it comes to working with data, statistics and mathematics are rarely the rate-limiting elements in moving the needle of value. Most firms’ unwashed masses of data sit far lower on Maslow’s hierarchy at the level of basic nurture and shelter. What is needed for this data isn’t philosophy, religion, or science — what’s needed is basic, scalable infrastructure.

The more data analysis I do, the more plain ’ole wrestling with the data becomes critical. And figuring out the plumbing and tools to make that happen becomes more interesting.

Via Rafe Colburn


EC2 Instance Primer

Amazon EC2 is a great service but sometimes it’s hard to keep track of all the virtual machine types that are provided. Jeff Barr put together a handy comprehensive backgrounder to Amazon EC2 instance families and types:

Over the past six or seven years I have had the opportunity to see customers of all sizes use Amazon EC2 to power their applications, including high traffic web sites, Genome analysis platforms, and SAP applications. I have learned that the developers of the most successful applications and services use a rigorous performance testing and optimization process to choose the right instance type(s) for their application.

In order to help you to do this for your own applications, I’d like to review some important EC2 concepts and then take a look at each of the instance types that make up the EC2 instance family.

Even better he covers the intended use cases for each family and their designed performance tradeoffs. Keep it in your back pocket if you’re an EC2 hacker.


GraphLab Inc.

I’ve mentioned GraphLab and have been toying with it since before it’s 1.0 release. Now the stakes have been raised with a de-cloaking and a heap of venture capital. Good luck to Professor Geustrin and crew.


Truer Words

OEmbed Link rot on URL: https://twitter.com/UnlikelyWorlds/status/334384901547757568

Truer words were never spoken of your humble narrator. Would that he could get his outer pedant under control.


Python XML Processing

The Discogs.com data is in some humongous XML files, which is a little unruly for many data hacking tasks. Python has some great XML processing modules, but it’s always good to have a little guidance. Enter this oldie but goodie from Eli Bendersky on Processing XML in Python with ElementTree:

As I mentioned in the beginning of this article, XML documents tend to get huge and libraries that read them wholly into memory may have a problem when parsing such documents is required. This is one of the reasons to use the SAX API as an alternative to DOM.

We’ve just learned how to use ET to easily read XML into a in-memory tree and manipulate it. But doesn’t it suffer from the same memory hogging problem as DOM when parsing huge documents? Yes, it does. This is why the package provides a special tool for SAX-like, on the fly parsing of XML. This tool is iterparse.

I will now use a complete example to demonstrate both how iterparse may be used, and also measure how it fares against standard tree parsing.

If I was going to update Bendersky’s post, I wouldn’t change much, other than to mention lxml and lxml.etree which provide high-performance streaming XML processing.


More Git Material

Haven’t finished working through them, but these git intros feel pretty useful. Slideshare alert if you’re allergic.

Introduction to git is by the venerable Randal Schwartz. Got a little dust. If up to typical Schwartz standards, still well worth reading.

Lemi Orhan Ergin’s Git branching Model might be overly stylish, but looks like it goes into detail on merging in addition to branching.

Via Rajiv Pant.


SourceTree

Link parkin’: SourceTree, Atlassian’s desktop GUI DVCS client:

Full-powered DVCS

Say goodbye to the command line – use the full capability of Git and Mercurial in the SourceTree desktop app. Manage all your repositories, hosted or local, through SourceTree’s simple interface.


Mission Accomplished

Still checking for consistency, but it looks like I’ve completed my mission of grabbing all the currently available Discogs.com data dumps. Have one more to grab and verify the checksum. Then I should be good to go. 45+ Gb (compressed) to romp through.

Oddly, it looks like we’re only getting releases updated for the month of May. Curious.


Emacs Temp Files in Their Place

Really handy tip from Emacs Redux:

Auto-backup is triggered when you save a file - it will keep the old version of the file around, adding a ~ to its name. So if you saved the file foo, you’d get foo~ as well.

auto-save-mode auto-saves a file every few seconds or every few characters …

Even though I’ve never actually had any use of those backups, I still think it’s a bad idea to disable them (most backups are eventually useful). I find it much more prudent to simply get them out of sight by storing them in the OS’s tmp directory instead.

I find the biggest pain with autosave files is getting git to ignore their existence. Yeah, I can fiddle around with .gitignore files, but that never quite seems to be universally applied correctly for me. Not even having emacs temp files in project directories makes the whole issue go away.


GLA38

Go get your latest Mark Farina podcast, NOW!


Continuous Partial Insanity

Playing off of continuous partial attention, a particularly bad patch of TV convinced me it’s just a medium for “continuous partial insanity”. Between The News, “reality shows”, the fictional programming, and the advertising the only intent is to keep you in a state of intense emotional elation or despair. Mostly despair since fear drives sales.

Criminy! Sports is a relative island of rationality, structure, and order.

Interestingly, a Google search for “continuous partial insanity” currently only brings up a long abandoned blog, parked on it as a tagline. Seems like an opportunity.


A Week of Google Glass

Luke Wroblewski takes interface design and user experience in a serious fashion. So his Google Glass experience was the first commentary I took seriously:

Almost a week ago I picked up my Glass explorer edition on Google’s campus in Mountain View. Since then I’ve it put into real-world use in a variety of places. I wore the device in three different airports, busy city streets, several restaurants, a secure federal building, and even a casino floor in Las Vegas. My goal was to try out Glass in as many different situations as possible to see how I would or could use the device.

During that time, Scott Jenson’s concise mandate of user experience came to mind a lot. As Scott puts it “value must be greater than pain.” That is, in order for someone to use a product, it must be more valuable to them than the effort required to use it. Create enough value and pain can be high. But if you don’t create a lot of value, the pain of using something has to be really low. It’s through this lens, that I can best describe Google Glass in it’s current state.

Definitely worth a full read, especially for the punch line.


Tell Us How You Really Feel

Like I said, I enjoy a good curmudgeonly rant. Stephen Few has not been having a good couple of months with publishers.

When I fell in love with words as a young man, I developed a respect for publishers that was born mostly of fantasy. I imagined venerable institutions filled with people of great intellect, integrity, and respect for ideas. I’m sure many people who fit this description still work for publishers, but my personal experience has mostly involved those who couldn’t think their way out of a wet paper bag and apparently have no desire to try.

Said most recent experience involves a bait and switch by Taylor & Francis (the publisher) on rights to some material Few was providing to an academic journal. Guy goes out of his way to put something together, I’m sure of high quality, and they want to reserve the right to modify his work. After they agreed in principle to his terms.

Something similar happened to Danah Boyd and I notice a pattern. Good intentioned journal editor from academia agrees to reasonable terms from fellow academic. Publisher waits until last minute to pull the okee-doke “Well, we can’t really do that. If you don’t agree to our onerous terms we’ll have to pull your article.” If these guys didn’t have their hooks so tightly intertwined with the tenure process, this behavior would be so over.


Datastructures With Norman

OEmbed Link rot on URL: https://twitter.com/pypi/status/301268885045379072

Speaking of finding interesting things on @PyPi, here’s Norman

Norman is a framework for advanced data structures in python using an database-like approach. The range of potential applications is wide, for example in-memory databases, multi-keyed dictionaries or node graphs.

For the longest time I’ve been thinking one could transliterate prefuse into Python to enable interactive visualization programming at a high level. The critical hurdle was prefuse’s table oriented datastructures and queries. In-memory sqlite could probably do the trick, but then you’ve got to deal with serialization and deserialization of Python objects.

Norman looks like it might fit the bill better for a prefuse knockoff.


So That Explains It

I follow @PyPi on Twitter, which just streams Python package announcements. It’s a cheap way to get exposure to new and interesting modules. But everyday it seems like there a couple of newly minted 0.1 packages for “printing nested lists”. Curious but not worth investigating.

Curiosity satisfied thanks to the Python Reddit, another useful place for Python developments:

They’re generated by people following along an example in the book Head First Python.

The book’s author has amended the lesson (through errata and next edition I guess) to point learners at testpypi.python.org (which didn’t exist at the time the book was written).

I run a cleanup script that deletes them every now and then. I haven’t run it for a while… I’ll put it on my looong TODO list…


IPython mini-Book

Will definitely have to shell out for Cyrille Rossant’s Learning IPython for Interactive Computing and Data Visualization

This book is a beginner-level introduction to IPython for interactive Python programming, high-performance numerical computing, and data visualization. It assumes nothing more than familiarity with Python. It targets developers, students, teachers, hobbyists who know Python a bit, and who want to learn IPython for the extended console, the Notebook, and for more advanced scientific applications.

Too much good e-book tech material at a good price these days.


CloudPull

From GoldenHill Software

CloudPull seamlessly backs up your Google account to your Mac. It supports Gmail, Google Contacts, Google Calendar, Google Drive (formerly Docs), and Google Reader. By default, the app backs up your accounts every hour and maintains old point-in-time snapshots of your accounts for 90 days.

Emphasis mine. Gonna’ try this out over the weekend.


Top Casts

Although I’ve fallen off the film viewing wagon, I’m always intrigued by movies with “all-star” casts. For example, Pulp Fiction has Travolta, Jackson, Thurman, Willis, Roth, Plummer, Rhames, Walken, Buscemi, Keitel, and of course Tarantino as actors. I’ve never seriously sat down and tried to quantify what this meant, but 10 “big time” stars seems like a reasonable threshold.

Then of course, the question is what’s “big time”? And there is the sticking point.

Today I had the brilliant idea that you could, relatively easily, define “top billing” based upon IMDB movie data. If an actor is listed as say one of the top 5 for their gender in the credits (for a few years?) call them an All-Star. Still a little squishy but firmer. Then you can quantitatively evaluate each film, rank, and decide.

Interesting challenge, and I wonder how it could apply to major league sports teams?


Making Data Progress

Discogs Logo Slowly making headway downloading the Discogs data dumps. Got 19 complete months in hand. Now into the era of no masters files and release files less than 1GB. Current total storage is roughly 29Gb.

Looking forward to some serious data hacking.


The Next Tachyon

@bigdata goes deeper into Tachyon:

A release slated for the summer will include features2 that enable data sharing (users will be able to do memory-speed writes to Tachyon). With Tachyon, Spark users will have for the first time, a high throughput way of reliably sharing files with other users. Moreover, despite being an external storage system Tachyon is comparable to Spark’s internal cache. Throughput tests on a cluster showed that Tachyon can read 200x and write 300x faster than HDFS. (Tachyon can read and write 30x faster than FDS’ reported throughput.)

Similar to the resilient distributed datasets (RDD) fundamental within Spark, fault-tolerance in Tachyon also relies3 on the concept of lineage – logging the transformations used to build a dataset, and using those logs to rebuild datasets when needed. Additionally as an external storage system Tachyon also keeps tracks of binary programs used to generate datasets, and the input datasets required by those programs.

Terabyte scale analytics at interactive speeds. Coming soon to a laptop near you.


Why Tuples?

Steve Holden, who know’s a bit or two about Python, gives his explanation of the existence of the tuple datatype in the programming language:

And that, best beloved, is what tuples are for: they are ordered collections of objects, and each of the objects has, according to its position, a specific meaning (sometimes referred to as its semantics). If no behaviors are required then a tuple is “about the simplest thing that could work.”

Has some good insights, but I think tuple immutability and hashability is vastly undersold.


Scoffing

I’m back in Philadelphia. Hotel Wi-Fi, I scoff at you again with your $12 (!!) a night charge. With Verizon and AT&T’s LTE on my side, I surf without fear. Unlike last time, didn’t even have to leave the room.


Scaling Mining Analytics

Just a quick scan of Jimmy Lin’s paper (PDF Warning) hints that there are some useful insights regarding logging at scale, which is currently an interest of mine:

A little about our backgrounds: The first author is an Associate Professor at the University of Maryland who spent an extended sabbatical from 2010 to 2012 at Twitter, primarily working on relevance algorithms and analytics infrastructure. The second author joined Twitter in early 2010 and was first a tech lead, then the engineering manager of the analytics infrastructure team. Together, we hope to provide a blend of the academic and industrial perspectives—a bit of ivory tower musings mixed with “in the trenches” practical advice. Although this paper describes the path we have taken at Twitter and is only one case study, we believe our recommendations align with industry consensus on how to approach a particular set of big data challenges.


Safari To Go

TIL there’s an iPad App for O’Reilly’s Safari Online library of books:

Now available for iOS and Android devices. Safari To Go is available for free and delivers full access to thousands of technology, digital media, business and personal development books and training videos from more than 100 of the world’s most trusted publishers. Search, navigate and organize your content on any WiFi or 3G/4G connection. Plus, cache up to three books to your offline bookbag to read when you can’t connect!

Works great for me since my employer provides Safari accounts!


Vincent Pandas

Two great tastes, that taste great together:

The Pandas Time Series/Date tools and Vega visualizations are a great match; Pandas does the heavy lifting of manipulating the data, and the Vega backend creates nicely formatted axes and plots. Vincent is the glue that makes the two play nice, and provides a number of conveniences for making plot building simple.

Useful examples ensue.


Fun With Discogs Data

Discogs Logo I’ve decided to try and pick up a “datadidact” habit, by regularly working with a large dataset. Even if it’s doing lowly basic characterization, this should force me to hone various skills and brush up on some basic knowledge.

Having spoken before of the Discogs.com dataset, their repository would appear to be a treasure trove and is completely unrelated to anything at work to boot. Thought siccing wget on http://www.discogs.com/data/ would be a no-brainer and a quick start. Except it’s blocked to crawlers based upon their robots.txt. ’Twould be nice if the HTTP Response for the data URL actually was more informative than a 500 error, but I can understand where Discogs is coming from.

However, I WILL NOT BE DENIED! Just have to do it tediously by hand through my browser. So be it. The longitudinal analysis possibilities are too intriguing.

Already have some initial data in hand. Not looking forward to dealing with 11Gb XML files, though First item might be to convert the data into a record/line oriented format.


Ascription, Anathema, Enthusiasm

I am definitely glad that Ben Hyde has gotten back to posting at “Ascription is Anathema to Enthusiasm”:

I think the blog is back now. Six months ago it got hacked, in a very minor way; but I was very busy at the time. When my repair attempts stumbled it went onto the back burner. Something weird about mysql character set encodings, or word press evolution’s in that area. In short when I exported the database and then initialized a new one with the exported data things got wonky.

Most noticably sequences between sentences, but other things as well.

One way to address the very busy problem is to lose your job (yes I’ve checked under the bed – thank God it’s not there!).

Sorry about the job, but glad to see an old grizzled Lisp veteran bringing his perspective again.


Wallowing In Data

Matthew Hurst brings an interesting perspective to the intersection of agile development and data products:

The product of a data team in the context of a product like local search is somewhat specialized within the broader scope of ‘big data’. Data is our product (we create a model of a specific part of the real world - those places where you can peform on-site transactions), and we leverage large scale data assets to make that data product better.

The agile framework uses the limited time horizon (the ‘sprint’ or ‘iteration’) to ensure that unknowns are reduced appropriately and that real work is done in a manner aligned with what the customer wants. …

In a number of projects where we are being agile, we have modified the framework with a couple of new elements.

Love the concept of “the data wallow”, which is scheduled team time to deeply dig into the collected data. I’d be interested to hear about specific activities of the wallow and how to make that time productive.


Emacs Custom Shell Config

Another great tip, that I wasn’t aware of, from Emacs Redux:

Emacs sends the new shell the contents of the file ~/.emacs_shellname as input, if it exists, where shellname is the name of the name if your shell - bash, zsh, etc. For example, if you use bash, the file sent to it is ~/.emacs_bash. If this file is not found, Emacs tries with ~/.emacs.d/init_shellname.sh.


Vega and Vincent

Link parkin’: Vincent

The folks at Trifacta are making it easy to build visualizations on top of D3 with Vega. Vincent makes it easy to build Vega with Python.


venv-0

Apparently, I’ve been bootstrapping virtualenv incorrectly. Eli Bendersky does it quite elegantly:

I had to install some packages (Sphinx and related tools) on a new machine into a virtualenv. But the machine only had a basic Python installation, without setuptools or distribute, and without virtualenv. These aren’t hard to install, but I wondered if there’s an easy way to avoid installing anything. Turns out there is.

The idea is to create a “bootstrap” virtual environment that would have all the required tools to create additional virtual environments. It turns out to be quite easy with the following script (inspired by the answer in this SO discussion):

My shtick was to install setuptools, pip, then virtualenv. Bendersky gets them all in one clean shot.


The Tachyon Filesystem

Go Bears! Link parkin’:

Tachyon

Tachyon is a fault tolerant distributed file system enabling reliable file sharing at memory-speed across cluster frameworks, such as Spark and MapReduce. It achieves its high performance by leveraging lineage information and using memory aggressively. Tachyon caches working set files in memory, and enables different jobs/queries and frameworks to access cached files at memory speed. Thus, Tachyon avoids going to disk to load datasets that are frequently read.

On April 10th, 2013, we put out a soft release of Tachyon 0.2.0 Alpha. The new version is more stable and also features significant performance improvements. It is, however, a soft release and we are working full time towards a full hard release that contains a stable version of the features that we expect will be core to Tachyon. Stay tuned.

And

Compatibility

Hadoop MapReduce and Spark can run with Tachyon without any modifications.


The Impala Inhale

Cloudera’s Impala sounds like an exciting way to query HDFS and HBase data at interactive speeds. But the installation dependencies are sort of painful, basically forcing either the use of Cloudera Manager or Cloudera’s packages for RedHat Enterprise Linux. Checking out the fun-filled requirements even includes this gem:

Impala creates and uses a user and group named impala. Do not delete this account or group and do not modify the account’s or group’s permissions and rights. Ensure no existing systems obstruct the functioning of these accounts and groups. For example, if you have scripts that delete user accounts not in a white-list, add these accounts to the list of permitted accounts.

So now the user accounts space gets littered on along with a bunch of other config files across the filesystem. Yick!

Makes Shark look a lot more attractive from a small scale applied research project level. I think you can do Shark, and Spark, pretty much from tar balls and at user level. And avoiding that enterprisey inhale is a good way to reduce complexity.


Wiz Finale

A bit out of gas tonight, but couldn’t let the end of the Wizards season go past without at least some mention. Maybe a longer post to come.

I don’t know if it was a good season for the Wiz, but it was an interesting one. A horrific start, a bumpy middle with some great highlights, and back to the losing to close out the season. Injuries were the big story of the season, followed by glimpses of potential from a backcourt of Wall and Beal.

Next season looks promising, but there’s some trepidation. The GM? Well he’s still the GM. Made some good moves this past year, but the past choices still glare, read Jan Vesely.

Props though to Randy Wittman for a good coaching job this year, under tough circumstances.


The EPL on NBC

Premier League Logo I got stuck in my car for the commute this afternoon, and wound up catching a few segments with the local sports radio yakker. For one chunk, they had this guy Richard Deitsch talking about some big announcement NBC recently had about the English (err Barclay’s) Premier League.

What the? NBC? EPL? How did I miss that.

Turns out NBC Universal scooped Fox and ESPN for the US broadcast rights back in October and I flat out missed it. Then yesterday NBC showcased how they’re going to broadcast all of the games live, albeit across all platforms including the Internet. Hey, I get it. With 380 matches, exactly, you gotta find a place for those QPR-Wigan tilts.

This is interesting on a number of angles. First, as one who’s got international football on the downlow, looks like I’ll be getting comprehensive coverage with a more engaged broadcast partner. The Fox/ESPN combo was doing okay, but ESPN never quite seemed committed despite getting some top notch matches.

Second, someone actually outbid ESPN for a significant sports property. And here I was thinking they’d make a big push into soccer because it seems to me they’re in need of a programming hit. I don’t mean to slight women’s basketball, but the audience for high-school all-star games is probably tiny. And you can only have so many screaming head shows. Plus, ESPN did an admirable job with coverage of the last World Cup. Besides a lot of rich people around the world are really into this sport and they probably haven’t hit the point of ESPN overload we here in the States are experiencing.

Finally, the end result will only be as good as the quality of the announcers, analysts, and production. I’m not too worried as NBC takes the MLS seriously and looks like they’re attracting legitimate football announcers respectable in the UK.

So I’m looking forward to August 2013 and the start of the next Premiership campaign, especially seeing as how Manchester United has all but tied off this one.

Now what’s the deal with the Bundesliga?


“The Human Division”, Done

Over the weekend I finished John Scalzi’s The Human Division in its serial form. In general I enjoyed it, although my complaints about it being quite talky still hold. Actually my biggest complaint is that Amazon couldn’t collapse my one click purchases, so my credit card statement is littered with a lot of noise. There were also quite a few loose threads.

If there’s one standout episode for me, it was This Must Be The Place. For some reason I really empathized with Hart Schmidt even though my family situation in no way resembled his. Maybe I’m misremembering the prior three Old Man’s War books, but there seems to be a much higher level of character development.


Unconscious Therapy

From 5 Magazine, Chicago’s own House Music chronicle, comes news of a newly released documentary

Centuries (okay, a few years) ago, Steven Harnell began recording footage for something tentatively given the very tentative title of An Untitled Documentary About House Music. A few people asked me about it, but the project seemed to be sliding into oblivion – until a couple of weeks ago, the site hadn’t been updated since around 2008.

So it was with some surprise that I realized that the Untitled Doc and Unconscious Therapy, a documentary debuting at the Chicago International Movies & Music Festival this Saturday, were one and the same.

Well gosh darn if dreams sometimes don’t come (mostly) true. Still need to see if it’s any good, and requires some sort of DC release, but the film is past a hurdle 90% of film projects don’t cross. So good on ’ya Steve Harnell.

And today all of us could use a little unconscious therapy. As he goes to put on a Mark Farina mix.

© 2008-2024 C. Ross Jam. Built using Pelican. Theme based upon Giulio Fidente’s original svbhack, and slightly modified by crossjam.