home ¦ Archives ¦ Atom ¦ RSS > Category: Uncategorized

Slowing Down

Posted on: Sun 26 May 2013

This post is of a pair with 540. And due to Hokey Smokes, there’s even a little more spice. I finally get a link and then I’m going to put on the brakes!

Since today is my birthday, I try and reflect on things I can readily change up to stay out of unhealthy ruts or just to keep myself fresh. 540 days in a row is more than enough to prove that I can keep a posting streak alive. The conjunction of birthday, nice round number, and national holiday seems more than auspicious timing to give up that streak. Plus, I’ve been at this blogging thing off and on for well over 10 years. (Remember when it was all about “social software”?)

Even though I disagree a bit with the whole post, Greg Linden recently captured a bit of where I’m at:

I find my blogging here to be too useful to me to stop doing it. I have also embraced microblogging in its many forms. Yet I am left wondering if there is something we are all missing, something shorter than blogging and longer than tweets and different than both, that would encourage thoughtful, useful, relevant mass communication.

We are still far from ideal. A few years ago, it used to be that millions of blog and press articles flew past, some of which might pile up in an RSS reader, a few of which might get read. Now, millions of tweets, thousands of Facebook posts, and millions of articles fly past, some of which might be seen in an app, a few of which might get read. Attention is random; being seen is luck of the draw. We are far from ideal.

I don’t think blogging is dead. I’m not sure blogging was always about journalism. And I personally haven’t embraced microblogging, although Twitter makes for a great link stream. But blogging is too useful, and fun!, for me to stop cold. I will however, be slowing down a bit. Might be a couple of posts a week but probably no less than once per. I will, however, feel no obligation to any given frequency. So if you’ve been using this blog as your daily hit of excitement, I thank you for your attention, but encourage you to add another source or two as a replacement.

And even though the posting streak wasn’t particularly onerous in terms of time, I’ll be trying to turn the same habits of mind to side projects involving coding and data analysis. This also means my content should trend to more technical topics but we’ll see. With this year’s #3 pick in the NBA draft, the Washington Wizards luck is looking up, so they might be even more interesting to talk about in 2013-14. I also have this half-assed idea to do a series of 10 REM posts, reminiscing on 10 years of blogging, by trawling through the archives of Mass Programming Resistance, New Media Hack, and out into the wider web.

Be seeing you!

P.S. Feels like Greg has another start-up thread within him even though there’s already a clear direction for Geeky Ventures!

540

Posted on: Sun 26 May 2013

Check this out

Enthought Python Distribution -- www.enthought.com
Version: 7.3-2 (32-bit)

Python 2.7.3 |EPD 7.3-2 (32-bit)| (default, Apr 12 2012, 11:28:34) 
[GCC 4.0.1 (Apple Inc. build 5493)] on darwin
Type "credits", "demo" or "enthought" for more information.
>>> import datetime
>>> datetime.date(2013,05,26) - datetime.date(2011,12,3)
datetime.timedelta(540)
>>>

That’s my way of saying I’ve posted for 540 days straight. Also on the order of 599 out of 600. Yeah me!

In fall of 2011, just on a lark and as an experiment in behavior modification, I set a goal of posting for 365 days straight. I slipped up, got back on the horse and never looked back. Mission accomplished.

Got a few other things to say, so more after the break:

It’s fairly amazing some of the things I’ve managed to push past:

Holidays and holiday travel
“Vacations”
Crunch times at work
Work travel
Long days of work + kid + social
Personal illness
Oral surgery
When my wife was in the emergency room after a car accident
When I was in the emergency room after passing out from dehydration
Days when I thought I had nothing to say
Days when I really didn’t want to say anything
Days when the world had more important things going on

I learned a lot on this road. Every day I tried to push out some content at least of interest to me, and maybe to someone else. If you’re not in it for the money or the fame, that’s the best you can do. Getting worked up about some fictitious “audience” doesn’t do you any good.

A wide, but not overwhelming, variety of source feeds is necessary. Particular tools don’t make much difference, although posting from the iPhone came in handy a few times. I occasionally did the prepared in advance post thing, but 90% of them were fresh baked that day. Queueing ’em up always seemed a bit like cheating.

One final, interesting observation about a time oriented goal like this one, is that you can’t make it go any slower or faster. No way to speed it up and get it over with quicker. No way to put it on pause just because you’ve got an issue in your life. It’s all about grinding out the clock. But in a twist, you start to crave the daily achievement. You wake up thinking about what today’s post should be. And you get a bit fidgety late in the evening if you haven’t yet met deadline. It becomes an addiction, but in this case a good one.

Hokey Smokes!

Posted on: Sat 25 May 2013

Yowsa! I actually got a link-out from dangerousmeta! I’m showing my blogging age here, but I’ve noted in the past my admiration for the site. Meanwhile, I don’t do any audience tracking or visit analytics at all for MPR. Pretty much have no idea who’s actually reading this stuff, if anyone. So it’s one of those old time, early ’oughts (yup, I go back that far) thrills to see the site title pop up in another feed.

Essential Formulae

Posted on: Fri 24 May 2013

Evan Miller’s statistical material for programmers might come in handy:

As my modest contribution to developer-kind, I’ve collected together the statistical formulas that I find to be most useful; this page presents them all in one place, a sort of statistical cheat-sheet for the practicing programmer.

Most of these formulas can be found in Wikipedia, but others are buried in journal articles or in professors’ web pages. They are all classical (not Bayesian), and to motivate them I have added concise commentary. I’ve also added links and references, so that even if you’re unfamiliar with the underlying concepts, you can go out and learn more. Wearing a red cape is optional.

Deep Into Partitions

Posted on: Thu 23 May 2013

Network partitions that is, and their implications for some common, popular, open source datastores. Kyle Kingsbury has cooked up “Call Me Maybe”

This article is part of Jepsen, a series on network partitions. We’re going to learn about distributed consensus, discuss the CAP theorem’s implications, and demonstrate how different databases behave under partition.

In-depth technical content on the Web. Who knew! You have been warned.

CommaFeed As Backup Plan

Posted on: Wed 22 May 2013

I was all set to put CommaFeed on the list of potential GReader replacements after seeing a mention coming across the MetaFilter feed. Then I started reading the MeFi comments and this one from Rhaomi really hit home:

It’s not just the interface and UI, which is pretty easy to clone. It’s the staggering infrastructure that powers it — the sophisticated search crawlers scouring the web and delivering near-real-time updates, the industrial-scale server farms that store untold petabytes of searchable text and images relevant to you (much of it from long-vanished sources), the ubiquitous Google name that makes the service a popular platform for innumerable third-party apps, scripts, and extensions.

It’s possible to code up something that looks and feels a lot like Reader in three months, with the same view types and shortcuts. But to replicate its core functionality — fast updates, archive search, stability, universal access, wide interoperability — takes Google-scale engineering I doubt anybody short of Micosoft/Yahoo can emulate. It was very nearly a public service, and its going to be frustrating trying to downsize expectations for such a core web service to what a startup — even a subscription-backed one — can accomplish.

Not to mention the current CommaFeed landing page annoyingly doesn’t have any type of “About” page, just a force funnel to registration. Hey, I like to at least be sweet talked a little before wasting a password!

NewsBlur Recco

Posted on: Tue 21 May 2013

Rafe Colburn is bullish on NewsBlur as a replacement for Google Reader, especially after the recent redesign:

The upside of NewsBlur has been that it works really well at its core purpose, fetching feeds and displaying them for the user. It also has solid native mobile clients, enabling you to keep read status in sync across devices.

That’s a good enough endorsement for me. With the clock ticking on the GReader shutdown, I’ll give NewsBlur the first crack at filling the void for me.

Cal Berkeley GraphX

Posted on: Mon 20 May 2013

C’mon Bears, cut it out. It’s getting embarrassing how much Spark related output there has been recently. In a good way!

From social networks to targeted advertising, big graphs capture the structure in data and are central to recent advances in machine learning and data mining. Unfortunately, directly applying existing data-parallel tools to graph computation tasks can be cumbersome and inefficient. The need for intuitive, scalable tools for graph computation has lead to the development of new graph-parallel systems (e.g. Pregel, PowerGraph) which are designed to efficiently execute graph algorithms. Unfortunately, these new graph-parallel systems do not address the challenges of graph construction and transformation which are often just as problematic as the subsequent computation. Furthermore, existing graph-parallel systems provide limited fault-tolerance and support for interactive data mining.

We introduce GraphX, which combines the advantages of both data-parallel and graph-parallel systems by efficiently expressing graph computation within the Spark data-parallel framework. We leverage new ideas in distributed graph representation to efficiently distribute graphs as tabular data-structures. Similarly, we leverage advances in data-flow systems to exploit in-memory computation and fault-tolerance. We provide powerful new operations to simplify graph construction and transformation. Using these primitives we implement the PowerGraph and Pregel abstractions in less than 20 lines of code. Finally, by exploiting the Scala foundation of Spark, we enable users to interactively load, transform, and compute on massive graphs.

Need to drill in to see how GraphX stacks up to the current spate of “big data” graph toolkits, especially GraphLab. Ben Lorica reports that GraphX is more oriented towards to programmer productivity as opposed to raw performance:

GraphX is a new, fault-tolerant, framework that runs within Spark. Its core data structure is an immutable graph5 (Resilient Distributed Graph – or RDG), and GraphX programs are a sequence of transformations on RDG’s (with each transformation yielding a new RDG). Transformations on RDG’s can affect nodes, edges, or both (depending on the state of neighboring edges and nodes). GraphX greatly enhances productivity by simplifying a range of tasks (graph loading, construction, transformation, and computations). But it does so at the expense of performance: early prototype algorithms written in GraphX were slower than those written in GraphLab/PowerGraph.

Ricon East Talks

Posted on: Sun 19 May 2013

Wow! Basho’s Ricon East conference was a little more diverse and wide ranging than I anticipated. This was evidenced by Anders Pearson’s summary of the talks he attended. For example, this lede on ZooKeeper for the Skeptical Architect by Camille Fournier, VP of Technical Architecture, Rent the Runway:

Camille presented ZooKeeper from the perspective of an architect who is a ZooKeeper committer, has done large deployments of it at her previous employer (Goldman Sachs), left to start her own company, and that company doesn’t use ZooKeeper. In other words, taking a very balanced engineering view of what ZooKeeper is appropriate for and where you might not want to use it.

Of the talks Pearson summarized, only two were by Bash employees while the rest were by some pretty serious distributed folks such as Margo Seltzer and Theo Schlossnagle. Plus there was a healthy dose of industry war story experience at scale.

Good on Basho!

Via John Daily

HDFS Gets Snakebitten

Posted on: Sat 18 May 2013

OEmbed Link rot on URL: https://twitter.com/pypi/status/335412456396558336

Another good find from the PyPi Twitter stream. Had to do a quick Google search to get the real details on snakebite, a pure Python library for interacting with Hadoop’s HDFS:

Another annoyance we had with Hadoop (and in particular HDFS) is that interacting with it is quite slow. For example, when you run hadoop fs -ls /, a Java virtual machine is started, a lot of Hadoop JARs are loaded and the communication with the NameNode is done, before displaying the result. This takes at least a couple of seconds and can become slightly annoying. This gets even worse when you do a lot of existence checks on HDFS; something we do a lot with luigi, to see if output of a jobs exist.

…

So, to circumvent slow interaction with HDFS and having a native solution for Python, we’ve created Snakebite, a pure Python HDFS client that only uses Protocol Buffers to communicate with HDFS. And since this might be interesting for others, we decided to Open Source it at http://github.com/spotify/snakebite.

Roger that on the annoyingly slow response of hadoop fs. Thanks Spotify.

Jepp, CPython and Java

Posted on: Fri 17 May 2013

TIL about Jepp:

Jepp embeds CPython in Java. It is safe to use in a heavily threaded environment, it is quite fast and its stability is a main feature and goal.

Could be handy for cutting down performance overhead at some points in the Hadoop stack where Python and Java come together. I’m looking at you Hadoop Streaming. Also for helping Python out with the myriad of serialization formats that Java does oh so well.

Via Morten Petersen

Praising Data Engineering

Posted on: Thu 16 May 2013

Metamarkets’ M. E. Driscoll gives a shout out to those mucking about with the bits:

A stark but recurring reality in the business world is this: when it comes to working with data, statistics and mathematics are rarely the rate-limiting elements in moving the needle of value. Most firms’ unwashed masses of data sit far lower on Maslow’s hierarchy at the level of basic nurture and shelter. What is needed for this data isn’t philosophy, religion, or science — what’s needed is basic, scalable infrastructure.

The more data analysis I do, the more plain ’ole wrestling with the data becomes critical. And figuring out the plumbing and tools to make that happen becomes more interesting.

Via Rafe Colburn

EC2 Instance Primer

Posted on: Wed 15 May 2013

Amazon EC2 is a great service but sometimes it’s hard to keep track of all the virtual machine types that are provided. Jeff Barr put together a handy comprehensive backgrounder to Amazon EC2 instance families and types:

Over the past six or seven years I have had the opportunity to see customers of all sizes use Amazon EC2 to power their applications, including high traffic web sites, Genome analysis platforms, and SAP applications. I have learned that the developers of the most successful applications and services use a rigorous performance testing and optimization process to choose the right instance type(s) for their application.

In order to help you to do this for your own applications, I’d like to review some important EC2 concepts and then take a look at each of the instance types that make up the EC2 instance family.

Even better he covers the intended use cases for each family and their designed performance tradeoffs. Keep it in your back pocket if you’re an EC2 hacker.

GraphLab Inc.

Posted on: Tue 14 May 2013

Introducing … GraphLab the company: congrats to the founders of GraphLab Inc. on their $6.75 million Series A http://t.co/jvVyjZnzsu
— Ben Lorica 罗瑞卡 (@bigdata) May 14, 2013

I’ve mentioned GraphLab and have been toying with it since before it’s 1.0 release. Now the stakes have been raised with a de-cloaking and a heap of venture capital. Good luck to Professor Geustrin and crew.

Truer Words

Posted on: Tue 14 May 2013

OEmbed Link rot on URL: https://twitter.com/UnlikelyWorlds/status/334384901547757568

Truer words were never spoken of your humble narrator. Would that he could get his outer pedant under control.

Python XML Processing

Posted on: Mon 13 May 2013

The Discogs.com data is in some humongous XML files, which is a little unruly for many data hacking tasks. Python has some great XML processing modules, but it’s always good to have a little guidance. Enter this oldie but goodie from Eli Bendersky on Processing XML in Python with ElementTree:

As I mentioned in the beginning of this article, XML documents tend to get huge and libraries that read them wholly into memory may have a problem when parsing such documents is required. This is one of the reasons to use the SAX API as an alternative to DOM.

We’ve just learned how to use ET to easily read XML into a in-memory tree and manipulate it. But doesn’t it suffer from the same memory hogging problem as DOM when parsing huge documents? Yes, it does. This is why the package provides a special tool for SAX-like, on the fly parsing of XML. This tool is iterparse.

I will now use a complete example to demonstrate both how iterparse may be used, and also measure how it fares against standard tree parsing.

If I was going to update Bendersky’s post, I wouldn’t change much, other than to mention lxml and lxml.etree which provide high-performance streaming XML processing.

More Git Material

Posted on: Sun 12 May 2013

Haven’t finished working through them, but these git intros feel pretty useful. Slideshare alert if you’re allergic.

Introduction to git is by the venerable Randal Schwartz. Got a little dust. If up to typical Schwartz standards, still well worth reading.

Lemi Orhan Ergin’s Git branching Model might be overly stylish, but looks like it goes into detail on merging in addition to branching.

Via Rajiv Pant.

SourceTree

Posted on: Sat 11 May 2013

Link parkin’: SourceTree, Atlassian’s desktop GUI DVCS client:

Full-powered DVCS

Say goodbye to the command line – use the full capability of Git and Mercurial in the SourceTree desktop app. Manage all your repositories, hosted or local, through SourceTree’s simple interface.

Mission Accomplished

Posted on: Fri 10 May 2013

Still checking for consistency, but it looks like I’ve completed my mission of grabbing all the currently available Discogs.com data dumps. Have one more to grab and verify the checksum. Then I should be good to go. 45+ Gb (compressed) to romp through.

Oddly, it looks like we’re only getting releases updated for the month of May. Curious.

Emacs Temp Files in Their Place

Posted on: Thu 09 May 2013

Really handy tip from Emacs Redux:

Auto-backup is triggered when you save a file - it will keep the old version of the file around, adding a ~ to its name. So if you saved the file foo, you’d get foo~ as well.

auto-save-mode auto-saves a file every few seconds or every few characters …

Even though I’ve never actually had any use of those backups, I still think it’s a bad idea to disable them (most backups are eventually useful). I find it much more prudent to simply get them out of sight by storing them in the OS’s tmp directory instead.

I find the biggest pain with autosave files is getting git to ignore their existence. Yeah, I can fiddle around with .gitignore files, but that never quite seems to be universally applied correctly for me. Not even having emacs temp files in project directories makes the whole issue go away.

GLA38

Posted on: Wed 08 May 2013

New Podcast Out! GLA38 | Old Vinyl, New Days Part 2 http://t.co/i1t61Dmtt0 #podcast #audiomontage #ciudadfelize
— Mark Farina (@djmarkfarina) May 3, 2013

Go get your latest Mark Farina podcast, NOW!

Continuous Partial Insanity

Posted on: Tue 07 May 2013

Playing off of continuous partial attention, a particularly bad patch of TV convinced me it’s just a medium for “continuous partial insanity”. Between The News, “reality shows”, the fictional programming, and the advertising the only intent is to keep you in a state of intense emotional elation or despair. Mostly despair since fear drives sales.

Criminy! Sports is a relative island of rationality, structure, and order.

Interestingly, a Google search for “continuous partial insanity” currently only brings up a long abandoned blog, parked on it as a tagline. Seems like an opportunity.

A Week of Google Glass

Posted on: Tue 07 May 2013

Luke Wroblewski takes interface design and user experience in a serious fashion. So his Google Glass experience was the first commentary I took seriously:

Almost a week ago I picked up my Glass explorer edition on Google’s campus in Mountain View. Since then I’ve it put into real-world use in a variety of places. I wore the device in three different airports, busy city streets, several restaurants, a secure federal building, and even a casino floor in Las Vegas. My goal was to try out Glass in as many different situations as possible to see how I would or could use the device.

During that time, Scott Jenson’s concise mandate of user experience came to mind a lot. As Scott puts it “value must be greater than pain.” That is, in order for someone to use a product, it must be more valuable to them than the effort required to use it. Create enough value and pain can be high. But if you don’t create a lot of value, the pain of using something has to be really low. It’s through this lens, that I can best describe Google Glass in it’s current state.

Definitely worth a full read, especially for the punch line.

Tell Us How You Really Feel

Posted on: Mon 06 May 2013

Like I said, I enjoy a good curmudgeonly rant. Stephen Few has not been having a good couple of months with publishers.

When I fell in love with words as a young man, I developed a respect for publishers that was born mostly of fantasy. I imagined venerable institutions filled with people of great intellect, integrity, and respect for ideas. I’m sure many people who fit this description still work for publishers, but my personal experience has mostly involved those who couldn’t think their way out of a wet paper bag and apparently have no desire to try.

Said most recent experience involves a bait and switch by Taylor & Francis (the publisher) on rights to some material Few was providing to an academic journal. Guy goes out of his way to put something together, I’m sure of high quality, and they want to reserve the right to modify his work. After they agreed in principle to his terms.

Something similar happened to Danah Boyd and I notice a pattern. Good intentioned journal editor from academia agrees to reasonable terms from fellow academic. Publisher waits until last minute to pull the okee-doke “Well, we can’t really do that. If you don’t agree to our onerous terms we’ll have to pull your article.” If these guys didn’t have their hooks so tightly intertwined with the tenure process, this behavior would be so over.

Datastructures With Norman

Posted on: Sun 05 May 2013

OEmbed Link rot on URL: https://twitter.com/pypi/status/301268885045379072

Speaking of finding interesting things on @PyPi, here’s Norman

Norman is a framework for advanced data structures in python using an database-like approach. The range of potential applications is wide, for example in-memory databases, multi-keyed dictionaries or node graphs.

For the longest time I’ve been thinking one could transliterate prefuse into Python to enable interactive visualization programming at a high level. The critical hurdle was prefuse’s table oriented datastructures and queries. In-memory sqlite could probably do the trick, but then you’ve got to deal with serialization and deserialization of Python objects.

Norman looks like it might fit the bill better for a prefuse knockoff.

So That Explains It

Posted on: Sat 04 May 2013

I follow @PyPi on Twitter, which just streams Python package announcements. It’s a cheap way to get exposure to new and interesting modules. But everyday it seems like there a couple of newly minted 0.1 packages for “printing nested lists”. Curious but not worth investigating.

Curiosity satisfied thanks to the Python Reddit, another useful place for Python developments:

They’re generated by people following along an example in the book Head First Python.

The book’s author has amended the lesson (through errata and next edition I guess) to point learners at testpypi.python.org (which didn’t exist at the time the book was written).

I run a cleanup script that deletes them every now and then. I haven’t run it for a while… I’ll put it on my looong TODO list…

IPython mini-Book

Posted on: Fri 03 May 2013

Will definitely have to shell out for Cyrille Rossant’s Learning IPython for Interactive Computing and Data Visualization

This book is a beginner-level introduction to IPython for interactive Python programming, high-performance numerical computing, and data visualization. It assumes nothing more than familiarity with Python. It targets developers, students, teachers, hobbyists who know Python a bit, and who want to learn IPython for the extended console, the Notebook, and for more advanced scientific applications.

Too much good e-book tech material at a good price these days.

CloudPull

Posted on: Thu 02 May 2013

From GoldenHill Software

CloudPull seamlessly backs up your Google account to your Mac. It supports Gmail, Google Contacts, Google Calendar, Google Drive (formerly Docs), and Google Reader. By default, the app backs up your accounts every hour and maintains old point-in-time snapshots of your accounts for 90 days.

Emphasis mine. Gonna’ try this out over the weekend.

Top Casts

Posted on: Wed 01 May 2013

Although I’ve fallen off the film viewing wagon, I’m always intrigued by movies with “all-star” casts. For example, Pulp Fiction has Travolta, Jackson, Thurman, Willis, Roth, Plummer, Rhames, Walken, Buscemi, Keitel, and of course Tarantino as actors. I’ve never seriously sat down and tried to quantify what this meant, but 10 “big time” stars seems like a reasonable threshold.

Then of course, the question is what’s “big time”? And there is the sticking point.

Today I had the brilliant idea that you could, relatively easily, define “top billing” based upon IMDB movie data. If an actor is listed as say one of the top 5 for their gender in the credits (for a few years?) call them an All-Star. Still a little squishy but firmer. Then you can quantitatively evaluate each film, rank, and decide.

Interesting challenge, and I wonder how it could apply to major league sports teams?

Making Data Progress

Posted on: Tue 30 April 2013

Slowly making headway downloading the Discogs data dumps. Got 19 complete months in hand. Now into the era of no masters files and release files less than 1GB. Current total storage is roughly 29Gb.

Looking forward to some serious data hacking.

The Next Tachyon

Posted on: Mon 29 April 2013

Looking forward to seeing popular data science tools support Tachyon (in-memory, fault-tolerant file system) http://t.co/jQoL7f5rNN #sasgf13
— Ben Lorica 罗瑞卡 (@bigdata) April 29, 2013

@bigdata goes deeper into Tachyon:

A release slated for the summer will include features2 that enable data sharing (users will be able to do memory-speed writes to Tachyon). With Tachyon, Spark users will have for the first time, a high throughput way of reliably sharing files with other users. Moreover, despite being an external storage system Tachyon is comparable to Spark’s internal cache. Throughput tests on a cluster showed that Tachyon can read 200x and write 300x faster than HDFS. (Tachyon can read and write 30x faster than FDS’ reported throughput.)

Similar to the resilient distributed datasets (RDD) fundamental within Spark, fault-tolerance in Tachyon also relies3 on the concept of lineage – logging the transformations used to build a dataset, and using those logs to rebuild datasets when needed. Additionally as an external storage system Tachyon also keeps tracks of binary programs used to generate datasets, and the input datasets required by those programs.

Terabyte scale analytics at interactive speeds. Coming soon to a laptop near you.

Why Tuples?

Posted on: Sun 28 April 2013

Steve Holden, who know’s a bit or two about Python, gives his explanation of the existence of the tuple datatype in the programming language:

And that, best beloved, is what tuples are for: they are ordered collections of objects, and each of the objects has, according to its position, a specific meaning (sometimes referred to as its semantics). If no behaviors are required then a tuple is “about the simplest thing that could work.”

Has some good insights, but I think tuple immutability and hashability is vastly undersold.

Scoffing

Posted on: Sat 27 April 2013

I’m back in Philadelphia. Hotel Wi-Fi, I scoff at you again with your $12 (!!) a night charge. With Verizon and AT&T’s LTE on my side, I surf without fear. Unlike last time, didn’t even have to leave the room.

Scaling Mining Analytics

Posted on: Fri 26 April 2013

“Scaling Big Data Mining Infrastructure” w/ @squarecog http://t.co/Kxxcpcml9B our recent Hadoop Summit Europe talk was based on this paper.
— Jimmy Lin (@lintool) April 26, 2013

Just a quick scan of Jimmy Lin’s paper (PDF Warning) hints that there are some useful insights regarding logging at scale, which is currently an interest of mine:

A little about our backgrounds: The first author is an Associate Professor at the University of Maryland who spent an extended sabbatical from 2010 to 2012 at Twitter, primarily working on relevance algorithms and analytics infrastructure. The second author joined Twitter in early 2010 and was first a tech lead, then the engineering manager of the analytics infrastructure team. Together, we hope to provide a blend of the academic and industrial perspectives—a bit of ivory tower musings mixed with “in the trenches” practical advice. Although this paper describes the path we have taken at Twitter and is only one case study, we believe our recommendations align with industry consensus on how to approach a particular set of big data challenges.

Safari To Go

Posted on: Thu 25 April 2013

TIL there’s an iPad App for O’Reilly’s Safari Online library of books:

Now available for iOS and Android devices. Safari To Go is available for free and delivers full access to thousands of technology, digital media, business and personal development books and training videos from more than 100 of the world’s most trusted publishers. Search, navigate and organize your content on any WiFi or 3G/4G connection. Plus, cache up to three books to your offline bookbag to read when you can’t connect!

Works great for me since my employer provides Safari accounts!

Vincent Pandas

Posted on: Wed 24 April 2013

Two great tastes, that taste great together:

The Pandas Time Series/Date tools and Vega visualizations are a great match; Pandas does the heavy lifting of manipulating the data, and the Vega backend creates nicely formatted axes and plots. Vincent is the glue that makes the two play nice, and provides a number of conveniences for making plot building simple.

Useful examples ensue.

Fun With Discogs Data

Posted on: Tue 23 April 2013

I’ve decided to try and pick up a “datadidact” habit, by regularly working with a large dataset. Even if it’s doing lowly basic characterization, this should force me to hone various skills and brush up on some basic knowledge.

Having spoken before of the Discogs.com dataset, their repository would appear to be a treasure trove and is completely unrelated to anything at work to boot. Thought siccing wget on http://www.discogs.com/data/ would be a no-brainer and a quick start. Except it’s blocked to crawlers based upon their robots.txt. ’Twould be nice if the HTTP Response for the data URL actually was more informative than a 500 error, but I can understand where Discogs is coming from.

However, I WILL NOT BE DENIED! Just have to do it tediously by hand through my browser. So be it. The longitudinal analysis possibilities are too intriguing.

Already have some initial data in hand. Not looking forward to dealing with 11Gb XML files, though First item might be to convert the data into a record/line oriented format.

Ascription, Anathema, Enthusiasm

Posted on: Mon 22 April 2013

I am definitely glad that Ben Hyde has gotten back to posting at “Ascription is Anathema to Enthusiasm”:

I think the blog is back now. Six months ago it got hacked, in a very minor way; but I was very busy at the time. When my repair attempts stumbled it went onto the back burner. Something weird about mysql character set encodings, or word press evolution’s in that area. In short when I exported the database and then initialized a new one with the exported data things got wonky.

Most noticably sequences between sentences, but other things as well.

One way to address the very busy problem is to lose your job (yes I’ve checked under the bed – thank God it’s not there!).

Sorry about the job, but glad to see an old grizzled Lisp veteran bringing his perspective again.

Wallowing In Data

Posted on: Sun 21 April 2013

Matthew Hurst brings an interesting perspective to the intersection of agile development and data products:

The product of a data team in the context of a product like local search is somewhat specialized within the broader scope of ‘big data’. Data is our product (we create a model of a specific part of the real world - those places where you can peform on-site transactions), and we leverage large scale data assets to make that data product better.

The agile framework uses the limited time horizon (the ‘sprint’ or ‘iteration’) to ensure that unknowns are reduced appropriately and that real work is done in a manner aligned with what the customer wants. …

In a number of projects where we are being agile, we have modified the framework with a couple of new elements.

Love the concept of “the data wallow”, which is scheduled team time to deeply dig into the collected data. I’d be interested to hear about specific activities of the wallow and how to make that time productive.

Emacs Custom Shell Config

Posted on: Sun 21 April 2013

Another great tip, that I wasn’t aware of, from Emacs Redux:

Emacs sends the new shell the contents of the file ~/.emacs_shellname as input, if it exists, where shellname is the name of the name if your shell - bash, zsh, etc. For example, if you use bash, the file sent to it is ~/.emacs_bash. If this file is not found, Emacs tries with ~/.emacs.d/init_shellname.sh.