Scaling Mining Analytics

Posted on: Fri 26 April 2013

“Scaling Big Data Mining Infrastructure” w/ @squarecog http://t.co/Kxxcpcml9B our recent Hadoop Summit Europe talk was based on this paper.
— Jimmy Lin (@lintool) April 26, 2013

Just a quick scan of Jimmy Lin’s paper (PDF Warning) hints that there are some useful insights regarding logging at scale, which is currently an interest of mine:

A little about our backgrounds: The first author is an Associate Professor at the University of Maryland who spent an extended sabbatical from 2010 to 2012 at Twitter, primarily working on relevance algorithms and analytics infrastructure. The second author joined Twitter in early 2010 and was first a tech lead, then the engineering manager of the analytics infrastructure team. Together, we hope to provide a blend of the academic and industrial perspectives—a bit of ivory tower musings mixed with “in the trenches” practical advice. Although this paper describes the path we have taken at Twitter and is only one case study, we believe our recommendations align with industry consensus on how to approach a particular set of big data challenges.

Safari To Go

Posted on: Thu 25 April 2013

TIL there’s an iPad App for O’Reilly’s Safari Online library of books:

Now available for iOS and Android devices. Safari To Go is available for free and delivers full access to thousands of technology, digital media, business and personal development books and training videos from more than 100 of the world’s most trusted publishers. Search, navigate and organize your content on any WiFi or 3G/4G connection. Plus, cache up to three books to your offline bookbag to read when you can’t connect!

Works great for me since my employer provides Safari accounts!

Vincent Pandas

Posted on: Wed 24 April 2013

Two great tastes, that taste great together:

The Pandas Time Series/Date tools and Vega visualizations are a great match; Pandas does the heavy lifting of manipulating the data, and the Vega backend creates nicely formatted axes and plots. Vincent is the glue that makes the two play nice, and provides a number of conveniences for making plot building simple.

Useful examples ensue.

Fun With Discogs Data

Posted on: Tue 23 April 2013

I’ve decided to try and pick up a “datadidact” habit, by regularly working with a large dataset. Even if it’s doing lowly basic characterization, this should force me to hone various skills and brush up on some basic knowledge.

Having spoken before of the Discogs.com dataset, their repository would appear to be a treasure trove and is completely unrelated to anything at work to boot. Thought siccing wget on http://www.discogs.com/data/ would be a no-brainer and a quick start. Except it’s blocked to crawlers based upon their robots.txt. ’Twould be nice if the HTTP Response for the data URL actually was more informative than a 500 error, but I can understand where Discogs is coming from.

However, I WILL NOT BE DENIED! Just have to do it tediously by hand through my browser. So be it. The longitudinal analysis possibilities are too intriguing.

Already have some initial data in hand. Not looking forward to dealing with 11Gb XML files, though First item might be to convert the data into a record/line oriented format.

Ascription, Anathema, Enthusiasm

Posted on: Mon 22 April 2013

I am definitely glad that Ben Hyde has gotten back to posting at “Ascription is Anathema to Enthusiasm”:

I think the blog is back now. Six months ago it got hacked, in a very minor way; but I was very busy at the time. When my repair attempts stumbled it went onto the back burner. Something weird about mysql character set encodings, or word press evolution’s in that area. In short when I exported the database and then initialized a new one with the exported data things got wonky.

Most noticably sequences between sentences, but other things as well.

One way to address the very busy problem is to lose your job (yes I’ve checked under the bed – thank God it’s not there!).

Sorry about the job, but glad to see an old grizzled Lisp veteran bringing his perspective again.

Wallowing In Data

Posted on: Sun 21 April 2013

Matthew Hurst brings an interesting perspective to the intersection of agile development and data products:

The product of a data team in the context of a product like local search is somewhat specialized within the broader scope of ‘big data’. Data is our product (we create a model of a specific part of the real world - those places where you can peform on-site transactions), and we leverage large scale data assets to make that data product better.

The agile framework uses the limited time horizon (the ‘sprint’ or ‘iteration’) to ensure that unknowns are reduced appropriately and that real work is done in a manner aligned with what the customer wants. …

In a number of projects where we are being agile, we have modified the framework with a couple of new elements.

Love the concept of “the data wallow”, which is scheduled team time to deeply dig into the collected data. I’d be interested to hear about specific activities of the wallow and how to make that time productive.

Emacs Custom Shell Config

Posted on: Sun 21 April 2013

Another great tip, that I wasn’t aware of, from Emacs Redux:

Emacs sends the new shell the contents of the file ~/.emacs_shellname as input, if it exists, where shellname is the name of the name if your shell - bash, zsh, etc. For example, if you use bash, the file sent to it is ~/.emacs_bash. If this file is not found, Emacs tries with ~/.emacs.d/init_shellname.sh.

Vega and Vincent

Posted on: Sat 20 April 2013

Link parkin’: Vincent

The folks at Trifacta are making it easy to build visualizations on top of D3 with Vega. Vincent makes it easy to build Vega with Python.

venv-0

Posted on: Sat 20 April 2013

Apparently, I’ve been bootstrapping virtualenv incorrectly. Eli Bendersky does it quite elegantly:

I had to install some packages (Sphinx and related tools) on a new machine into a virtualenv. But the machine only had a basic Python installation, without setuptools or distribute, and without virtualenv. These aren’t hard to install, but I wondered if there’s an easy way to avoid installing anything. Turns out there is.

The idea is to create a “bootstrap” virtual environment that would have all the required tools to create additional virtual environments. It turns out to be quite easy with the following script (inspired by the answer in this SO discussion):

My shtick was to install setuptools, pip, then virtualenv. Bendersky gets them all in one clean shot.

The Tachyon Filesystem

Posted on: Sat 20 April 2013

Go Bears! Link parkin’:

Tachyon

Tachyon is a fault tolerant distributed file system enabling reliable file sharing at memory-speed across cluster frameworks, such as Spark and MapReduce. It achieves its high performance by leveraging lineage information and using memory aggressively. Tachyon caches working set files in memory, and enables different jobs/queries and frameworks to access cached files at memory speed. Thus, Tachyon avoids going to disk to load datasets that are frequently read.

On April 10th, 2013, we put out a soft release of Tachyon 0.2.0 Alpha. The new version is more stable and also features significant performance improvements. It is, however, a soft release and we are working full time towards a full hard release that contains a stable version of the features that we expect will be core to Tachyon. Stay tuned.

And

Compatibility

Hadoop MapReduce and Spark can run with Tachyon without any modifications.

The Impala Inhale

Posted on: Fri 19 April 2013

Cloudera’s Impala sounds like an exciting way to query HDFS and HBase data at interactive speeds. But the installation dependencies are sort of painful, basically forcing either the use of Cloudera Manager or Cloudera’s packages for RedHat Enterprise Linux. Checking out the fun-filled requirements even includes this gem:

Impala creates and uses a user and group named impala. Do not delete this account or group and do not modify the account’s or group’s permissions and rights. Ensure no existing systems obstruct the functioning of these accounts and groups. For example, if you have scripts that delete user accounts not in a white-list, add these accounts to the list of permitted accounts.

So now the user accounts space gets littered on along with a bunch of other config files across the filesystem. Yick!

Makes Shark look a lot more attractive from a small scale applied research project level. I think you can do Shark, and Spark, pretty much from tar balls and at user level. And avoiding that enterprisey inhale is a good way to reduce complexity.

Wiz Finale

Posted on: Thu 18 April 2013

Naw.
— Wiz Win Last Night? (@DidTheWizWin) April 18, 2013

A bit out of gas tonight, but couldn’t let the end of the Wizards season go past without at least some mention. Maybe a longer post to come.

I don’t know if it was a good season for the Wiz, but it was an interesting one. A horrific start, a bumpy middle with some great highlights, and back to the losing to close out the season. Injuries were the big story of the season, followed by glimpses of potential from a backcourt of Wall and Beal.

Next season looks promising, but there’s some trepidation. The GM? Well he’s still the GM. Made some good moves this past year, but the past choices still glare, read Jan Vesely.

Props though to Randy Wittman for a good coaching job this year, under tough circumstances.

The EPL on NBC

Posted on: Wed 17 April 2013

I got stuck in my car for the commute this afternoon, and wound up catching a few segments with the local sports radio yakker. For one chunk, they had this guy Richard Deitsch talking about some big announcement NBC recently had about the English (err Barclay’s) Premier League.

What the? NBC? EPL? How did I miss that.

Turns out NBC Universal scooped Fox and ESPN for the US broadcast rights back in October and I flat out missed it. Then yesterday NBC showcased how they’re going to broadcast all of the games live, albeit across all platforms including the Internet. Hey, I get it. With 380 matches, exactly, you gotta find a place for those QPR-Wigan tilts.

This is interesting on a number of angles. First, as one who’s got international football on the downlow, looks like I’ll be getting comprehensive coverage with a more engaged broadcast partner. The Fox/ESPN combo was doing okay, but ESPN never quite seemed committed despite getting some top notch matches.

Second, someone actually outbid ESPN for a significant sports property. And here I was thinking they’d make a big push into soccer because it seems to me they’re in need of a programming hit. I don’t mean to slight women’s basketball, but the audience for high-school all-star games is probably tiny. And you can only have so many screaming head shows. Plus, ESPN did an admirable job with coverage of the last World Cup. Besides a lot of rich people around the world are really into this sport and they probably haven’t hit the point of ESPN overload we here in the States are experiencing.

Finally, the end result will only be as good as the quality of the announcers, analysts, and production. I’m not too worried as NBC takes the MLS seriously and looks like they’re attracting legitimate football announcers respectable in the UK.

So I’m looking forward to August 2013 and the start of the next Premiership campaign, especially seeing as how Manchester United has all but tied off this one.

Now what’s the deal with the Bundesliga?

“The Human Division”, Done

Posted on: Tue 16 April 2013

Over the weekend I finished John Scalzi’s The Human Division in its serial form. In general I enjoyed it, although my complaints about it being quite talky still hold. Actually my biggest complaint is that Amazon couldn’t collapse my one click purchases, so my credit card statement is littered with a lot of noise. There were also quite a few loose threads.

If there’s one standout episode for me, it was This Must Be The Place. For some reason I really empathized with Hart Schmidt even though my family situation in no way resembled his. Maybe I’m misremembering the prior three Old Man’s War books, but there seems to be a much higher level of character development.

Unconscious Therapy

Posted on: Mon 15 April 2013

Unconscious Therapy, New House Music Documentary, Debuts in Chicago (Video) // http://t.co/tbgeSrjeAN
— 5 Mag Dot Net (@5Magazine) April 15, 2013

From 5 Magazine, Chicago’s own House Music chronicle, comes news of a newly released documentary

Centuries (okay, a few years) ago, Steven Harnell began recording footage for something tentatively given the very tentative title of An Untitled Documentary About House Music. A few people asked me about it, but the project seemed to be sliding into oblivion – until a couple of weeks ago, the site hadn’t been updated since around 2008.

So it was with some surprise that I realized that the Untitled Doc and Unconscious Therapy, a documentary debuting at the Chicago International Movies & Music Festival this Saturday, were one and the same.

Well gosh darn if dreams sometimes don’t come (mostly) true. Still need to see if it’s any good, and requires some sort of DC release, but the film is past a hurdle 90% of film projects don’t cross. So good on ’ya Steve Harnell.

And today all of us could use a little unconscious therapy. As he goes to put on a Mark Farina mix.

O’Reilly’s Graph Databases

Posted on: Sun 14 April 2013

O’Reilly Media has a pre-release version of a book on graph database technology:

Graph Databases, published by O’Reilly Media, discusses the problems that are well aligned with graph databases, with examples drawn from practical, real-world use cases. This book also looks at the ecosystem of complementary technologies, highlighting what differentiates graph databases from other database technologies, both relational and NOSQL.

Graph Databases is written by Ian Robinson, Jim Webber, and Emil Eifrém, graph experts and enthusiasts at Neo Technology, creators of Neo4j, the world’s leading graph database.

I’ll probably eventually pick up a copy, but I’m waiting for it to bake a little bit. Also wondering if it’s mainly going to be a promotional for Neo4j.

Sensoring The News

Posted on: Sat 13 April 2013

An interesting trend of news orgs exploiting low cost sensors, but I really like the play on words:

When I went to the 2013 SXSW Interactive Festival to host a conversation with NPR’s Javaun Moradi about sensors, society and the media, I thought we would be talking about the future of data journalism. By the time I left the event, I’d learned that sensor journalism had long since arrived and been applied. Today, inexpensive, easy-to-use open source hardware is making it easier for media outlets to create data themselves.

GitX-dev

Posted on: Fri 12 April 2013

My git-Fu is getting better, but I’m using it enough to start thinking about using a graphical client:

GitX-dev is a fork (variant) of GitX, a long-defunct GUI for the git version-control system. It has been maintained and enhanced with productivity and friendliness oriented changes, with effort focused on making a first-class, maintainable tool for today’s active developers.

Per usual anything that helps me deal with merges more efficiently and professionally gets my +1.

Diggin’ On Kafka

Posted on: Thu 11 April 2013

Putting Apache Kafka through it’s paces at the office and I’m starting to like what I see after some initial confusion.

Why we built this

Kafka is a messaging system that was originally developed at LinkedIn to serve as the foundation for LinkedIn’s activity stream and operational data processing pipeline. It is now used at a variety of different companies for various data pipeline and messaging uses.

Activity stream data is a normal part of any website for reporting on usage of the site. Activity data is things like page views, information about what content was shown, searches, etc. This kind of thing is usually handled by logging the activity out to some kind of file and then periodically aggregating these files for analysis. Operational data is data about the performance of servers (CPU, IO usage, request times, service logs, etc) and a variety of different approaches to aggregating operational data are used.

In recent years, activity and operational data has become a critical part of the production features of websites, and a slightly more sophisticated set of infrastructure is needed.

I like their particular set of design choices. The client consumer side takes a little getting used to if you’re a lazy pub/sub guy like me. Kafka makes the client do a little bit more work and manage its own topic state. The upside is good to great performance, horizontal scaling, and a client can implement the message delivery semantics (at most once, at least once, exactly once) the client needs, without spoiling it for everyone else.

Great to see logical message offsets in the upcoming 0.8.x releases to make life a little easier.

The Glove and The Wall

Posted on: Wed 10 April 2013

Gary “The Glove” Payton was recently elected into the Naismith Basketball Hall of Fame. I saw a clip right after the announcement and Payton said a couple of interesting things. First, he said he thought of the current crop of players, John Wall had the most potential to equal The Glove defensively. Yow!

Second, Payton said he’d actually been dong some work with Wall. Never quite know what that really means, but it’s a good sign. And I think it’s starting to show in Wall’s game.

The 1TB SSD

Posted on: Tue 09 April 2013

ArsTechnica with the overview, and AnandTech with the details on the latest increment on consumer grade price and performance for SSDs. 960 Gb useable is just a bit short of 1 Tb, but give it year. Then we’ll be well over 1 Tb for 30% less cost.

Just in time for my next laptop purchase ;-)

And UltraRAM (PDF warning) is just off the horizon.

Python and vSphere

Posted on: Tue 09 April 2013

Link parkin’: Patrick Dunnigan kicks the tires on a number of Python modules that encapsulate the VMWare vSphere API.

I recently had the need to manage VMware vSphere from Python code so I went about looking for examples and open source libraries. The vCenter management server has a SOAP web service that exposes most (if not all) of the administrative capabilities that you can perform on vSphere. At first thought, this seemed like a simple endeavor. However once I got into it I found it not as simple as interacting with a RESTful web service.

…

Next I went searching for Python client libraries for the vSphere SOAP web service.

Here is a summary of what I found:

Thanks Patrick, ’cuz I’m about to have a close encounter with VMWare’s vSphere. Not looking forward to a reintroduction to SOAP though. Hope these libraries really hide the hurt.

Scalzi on Higher Learning

Posted on: Tue 09 April 2013

I swear I was just about to compose a reader request to Jon Scalzi regarding his thoughts on being from an “elite of the elite” university and advice he’d give to his daughter in these days of MOOCs etc. From my thoughts to Scalzi’s keyboard, although I would have been more entertaining than Steve, throwing in lots of Ivy class bonhomie and all that and dropping in a “where fun goes to die” or two:

Presuming my kid has the chops to get in where she wants to go — which I find a reasonable presumption, all things concerned — what I am likely to tell her is this: I’m willing to pay for an elite private institution (think generally but not exclusively the top 25 colleges and the top 25 universities in the US) because their reputations/networks are worth the additional expense in long run. But outside of those schools, why would I pay $40,000+ for a private school when I can pay $10,000 for Ohio State or Ohio University, or only slightly more for Miami University? The value add — the reputation/network — isn’t there in almost all those cases.

I strongly agree with this assessment and keep in mind I’ve been at just about every stage of the pipe: undergrad, grad, faculty, washed up faculty. Haven’t been an educational administrator at any level but I’m not sure I’d wish that on anyone ;-) (Ha! Ha! Only serious)

And we both believe that college can still have a lot of value for a lot of people in this age. I bring this up because there seems to be this virulent dismissive current in the tech and business communities that “higher learning is completely busted, worthless, and mega-disruptive entrepreneurialism is gonna save the day”. I get that there’s a lot of overpriced product out there, but if I see one more link about how a PhD is meaningless, I’m gonna barf. Even if it’s in Critical Literature, just because you can’t reap some huge financial windfall doesn’t mean the result has no value. If there’s a true research oriented dissertation, than at least the ball of humanity’s knowledge has been pushed forward the tiniest bit.

A Week With Feedly

Posted on: Mon 08 April 2013

I’m starting to keep an eye out for good GReader diaspora stories. Evan Dashevsky spent a week with Feedly:

To that end, I decided to try my luck with Feedly, a service that has quickly grown to prominence as the go-to replacement for Reader. After a week of test-driving the service in its Web and app versions, I have found it to be a serviceable—and in many ways, superior—replacement for my soon-to-expire Reader.

The one thing I’m looking for is the cross-client synching capabilities.

Emacs Redux Redux

Posted on: Mon 08 April 2013

Okay, Emacs Redux has been better than I expected, with lots of little tips that can greatly improve one’s Emacs utility, such as:

I’m fond of seeing somewhere the full path to the file I’m currently editing(as opposed to just the file name displayed in modeline). Emacs’s frame title seems like a good place to display such information, since by default it doesn’t show anything interesting. Here’s how we can achieve this:

and:

Emacs does not have a command backward-kill-line (which would kill the text from the point to the beginning of the line), but it doesn’t really need one anyways. Why so? Simple enough - invoking kill-line with a prefix argument 0 does exactly the same thing!

At least one good nugget per day.

I

Posted on: Sun 07 April 2013

The situation.

Disk A. 95% full with Postgres data.

Disk B. 75% full with archival data.

What to do about Disk A?

Move data from disk B to another drive, getting it down to 20% full. Use Postgres’ CREATE TABLESPACE data definition statement to make a new one on Disk B. Run a series of ALTER TABLEs and ALTER INDEXs to safely move half of the Postgres data to Drive B.

Now both drives are at about 60% full and I’m a happy camper with plenty of disk headroom going forward. Bonus, both A and B are SSDs, so with an hour or so of effort, all my data stays fast and my queries might even get a little boost from disk parallelism to boot.

JZ Spoiled By SSD

Posted on: Sun 07 April 2013

JZ would be one Jeremy Zawodny, engineer at Craigslist.

But this particular task involves slurping ALL the data out of that cluster and onto a cluster of sharded Sphinx servers so I can re-index the roughly 3 billion documents. That’s all well and good, but since our MongoDB cluster isn’t terribly performance sensitive, it is built on old-fashioned (am I allowed to use that phrase?) spinning disks. And you know what that means, right?

Yeah, seek time matters. A lot.

If this was hitting our production MySQL clusters, I wouldn’t care nearly as much. Those all use one flavor or another of flash stoarge. In fact, we’ve been using SSDs long enough and in enough places that I’m spoiled at this point. I sort of cringe every time I have to deal with disk seeks. That’s so five years ago.

Zawodny does this at real scale for real money So if Jeremy’s spoiled by SSD performance, the rest of us can start just follow the trend.

Artisanal War

Posted on: Sat 06 April 2013

In tweets from @GreatDismal

First the digital came for the fighter pilot, then for the sniper. #artisanalwarfighting
— William Gibson (@GreatDismal) April 5, 2013

Bespoke sniping: You've a man in Mayfair who does it for you, the old-fashioned way. Let governments use their tacky automatons.
— William Gibson (@GreatDismal) April 5, 2013

First the digital came for the fighter pilot, then for the sniper. #artisanalwarfighting

Slouching towards Armageddon, you might say.

A MarsEdit Update

Posted on: Sat 06 April 2013

Daniel Jalkut has released a new version of MarsEdit, the tool I’m using at this very moment to post with:

MarsEdit 3.5.9 is now available. This is a free update for licensed MarsEdit customers. The update will be submitted to the Mac App Store today and will be available there when Apple approves the update.

This is quite a significant update, in spite of it being entirely composed of “bug fixes.” I’m still working a major update to MarsEdit that will accommodate the Mac App Store’s sandboxing requirements. Until that is ready, I’ll keep fixing bugs in the app but will not be able to add significant features.

I’m grateful to Red Sweater software for picking up and maintaining MarsEdit. It’s been in my toolbox for years and is coming back into prominence now that I’m using Ye Olde MacBook more. If you’ve noticed more original text here, it’s partially because I’m back to using a keyboard and MarsEdit.

Avoiding Facebook: Part N

Posted on: Fri 05 April 2013

Where N is quite large. From Sophos’ NakedSecurity blog

Even if you have rejected particular apps from connecting from your Facebook profile, you have no control over what apps your friends and family have chosen to connect to their profiles.

Your friends and family may not be being as cautious as you are about Facebook apps - and you may not realise that when other Facebook users choose to install apps they can then share the information they can see about you with those apps.

Facebook argues that allowing other people to share your info with third-party apps makes the “experience better and more social”. Your opinion may vary from theirs, however.

Yes, my opinion definitely does. I’d normally just look the other way, but Facebook has become increasingly disturbing. He says with a surly frown, looking askance at Quora and other Facebook login miscreants.

Petrel

Posted on: Fri 05 April 2013

At work, I’m being overtaken by the Storm. But I’ll be damned if I get sucked back into Java programming without some kicking and screaming. I’ll definitely be giving Petrel a whirl

Tools for writing, submitting, debugging, and monitoring Storm topologies in pure Python.

…

Petrel offers some important improvements over the storm.py module provided with Storm:

Topologies are implemented in 100% Python

Petrel’s packaging support automatically sets up a Python virtual environment for your topology and makes it easy to install additional Python packages.

“petrel.mock” allows testing of single components or single chains of related components.

Petrel automatically sets up logging for every spout or bolt and logs a stack trace on unhandled errors.

And just in case you’re looking for an alternative to Storm, don’t forget Yahoo!’s, now Apache’s, S4. While not as fashionable, S4 is definitely comparable.

Diggin’ On “The Human Division”

Posted on: Thu 04 April 2013

The Human Division Cover Considering how much I enjoyed Old Man’s War, you’d figure I’d be a sucker for John Scalzi’s The Human Division.

What does “not strictly a novel” mean here? Well, The Human Division is probably best described as an “episodic narrative” — it’s a collection of individual episodes that each tell a complete story, arranged chronologically, so that if you read them all in sequence you get a larger narrative arc. The closest analogy would be a season of a television show, and indeed The Human Division is arranged into thirteen “episodes,” including a double-length “pilot episode,” entitled “The ‘B’ Team.”

However, I’ve only just now started reading/buying the “episodes”. The question is, can I catch up with the release schedule? Through two and half chapters, so far, so good, although I find it a touch heavy on the expository dialogue, enough to be uncomfortably noticeable. Also, the dialogue in Walk The Plank, was a little too clean for me. If you’re traumatized from a crash landing, injured from an animal bite, and infected with an explosive microbe, I don’t expect you to string together sentences that well.

But they’re well paced with Scalzi’s good old sense of humor, and intriguing from a sci-fi perspective. So I will forge on.

Vega

Posted on: Wed 03 April 2013

Vega is a new browser-side visualization toolkit from Jeff Heer. Why?

Vega provides a higher-level visualization specification language on top of D3. By design, D3 will maintain an “expressivity advantage” and in many cases will be better suited for novel, highly interactive graphics. On the other hand, we hope that Vega will be convenient for a wide range of common yet customizable visualizations. Vega’s design builds on concepts we developed in both Protovis and D3, and is informed by multiple years of research at Stanford and our experiences building data-driven applications at Trifacta.

Still waiting for prefuse to rise from the grave and reclaim its rightful place in the visualization firmament.

Saddle Up

Posted on: Tue 02 April 2013

Pandas contributor Adam Klein has released a kindred in spirit toolkit, Saddle, for the Scala language:

Saddle is a high-performance data manipulation library for Scala.

Saddle provides array-backed, indexed, one- and two-dimensional data structures; vectorized numerical calculations; automatic data alignment; and robustness to missing values.

Scala, Pandas, Spark. Are you thinking what I’m thinking?

The Tipping Point?

Posted on: Mon 01 April 2013

I’ve been holding out on new traditional home computing gear (e.g. laptop or desktop) since I really haven’t had a good need, despite my lust for the finer things. I’ve even gone so far as to jack-up the hard drives in Ye Olde MacBook to good effect. Such good effect that it could easily be my primary personal laptop for at least another year and a half. I mean really, if MS Office 2011 runs decently on this old kit, what more can you ask for?

Except for that pesky desire to run virtual machines. I picked up a copy of Parallels Desktop in a recent MacUpdate promotion. Then I wanted to start kicking around some Hadoop virtual machines from vendors like Cloudera and Hortonworks. Guess what? You really need a free and clear 4GB of RAM to get started with these vms. My poor old machine only has 4GB total.

So Ye Olde MacBook is fine for the standard Ubuntu or CentOS distro, but the heavy duty data sciency type stuff is probably beyond it’s meager capabilities.

Something to chew on.

Towards Production

Posted on: Mon 01 April 2013

Another insightful gem from Rafe Colburn on taking a quick “hack” and getting the result into a production system:

Why write this up? It’s to point out that for most projects, getting something to work is just a small, small part of building a production service. Exporting a CSV file from a database query and uploading it to an FTP server takes just a few minutes. Converting that into a service that runs within the standard infrastructure, and handles failure conditions smoothly takes hours.

Echoes of the challenge of building capabilities.

Also, I’ve learned through hard won experience, don’t trust the environment cron provides your script at all. Make everything as explicit as possible. Even then it’s a difficult debugging process. Good luck!

Hedlund on Unlearning

Posted on: Sun 31 March 2013

Good personal development bit by Marc Hedlund on leaving behind career tools that have been overtaken by events:

Advancement is not a collection of skills. Advancement is an awesome ability to adapt to a new situation. Experience is a toolbox to help you do that – but any new job, any new role, any environmental change should make you question whether the tool you know and rely on is still the right tool for the job. Think of your skills as disposable, and actively work on unlearning the ones that were right once but aren’t right now.

This also seems like a good feature, maybe even interview question, to seek in candidates to join your team. “Talk about a skill you really relied on in one position but had to give up or adapt as you moved into a new position. Why?”

python-mode.el

Posted on: Sun 31 March 2013

Sometimes going back to the basics is the best path. I just grabbed the python-mode.el 6.1.1 release from the python-mode.el launchpad site, which is maintained by some folks who use Python on a frequent basis. Fired up the mode and the current implementation seems much more consistent than some of the current new kids on the block, including Emacs 24.3’s built-in Python mode.

Even better 6.1.1 has support for virtualenvs.

emacsredux

Posted on: Sat 30 March 2013

Link parkin’: emacsredux, tagline: “Return to the Essence of Text Editing”. Subscribed.

Also of note, emacsrocks.

Titan 0.3

Posted on: Sat 30 March 2013

Keeping an eye on Titan, the Hadoop friendly graph data repository, they’ve version bumped to 0.3, with some interesting new features:

Titan 0.3.0 has been released and is ready for download. This release provides a complete performance-driven redesign of many core components. Furthermore, the primary outward facing feature is advanced indexing. The new indexing features are itemized below:

Geo: Search for elements using shape primitives within a 2D plane.

Full-text: Search elements for matching string and text properties.

Numeric range: Search for elements with numeric property values using intervals.

Edge: Edges can be indexed as well as vertices.

Related discussion over at Hacker News.

Via @gnat