home ¦ Archives ¦ Atom ¦ RSS

pandas 0.8

The Wes McKinney led pandas project has just hit the 0.8 release point. I’ve been fiending for an excuse to use pandas and now an opportunity at work has popped up to do some timeseries-ish analysis. If for nothing else I’m looking forward to being able to easily generate sequences of timestamps:

New DatetimeIndex class supports both fixed frequency and irregular time series. Replaces now deprecated DateRange class


Crowning Champs

Now that the BCS has ended, with Big Time Football somehow looking slimier, I like what Chris Brown had to say about playoffs:

So what does a playoff give you, and why is it probably a better solution for crowning a National Champion? Let me say first that I think it would be a better system than the current BCS morass. But the advantage the playoff gives you is not anything metaphysically correct. It probably does not crown the best team. And it does not reward the best season (sorry Utah).

It merely gives you relative certitude. It’s not perfect — some clunker teams can be crowned, some historically great teams will get the relative shaft — but, before the season, during the season, and in the playoffs, everyone knows what it takes to be the champion: you must get into the playoffs, and you must win every game once you’re there. The Patriots couldn’t lobby for votes, they couldn’t say that they got jerked around, and they even couldn’t say that they didn’t get their chance. They played and they lost. They were probably better, they might only have had a bad day, but hey, you knew what you were getting into.

Emphasis mine. Everyone knows what’s going to be on the exam. Either you pass or you don’t and your answers are fully visible to everybody. No lyin’, no cryin’. Perfect? No. But it’s better than what we had.

My only suggestion is to go for broke and have a twelve team playoff. Six automatic spots for conference winners. Six at large bids. Top four get a bye. Could be done in a month. You could even have a four game New Year’s Day bowl fest. Everybody wins!!


csvkit

How have I gone so long without knowing about Python’s csvkit?

csvkit is a suite of utilities for converting to and working with CSV, the king of tabular file formats. …

csvkit is to tabular data what the standard Unix text processing suite (grep, sed, cut, sort) is to text. As such, csvkit adheres to the Unix philosophy.

csvkit usefully replaces the built-in Python csv module (pretty useful in and of itself) and also provides a really nice set of command line utility for creating, slicing, and dicing csv files.


A Book I’d Buy

The Scientific Practice of Large Scale Data Analytics

At work I’m seeing too many people getting a Heap ’O Data (TM), and then not being systematic about how they manage, process, and analyze those precious bits. Your’s truly is a culprit, but I’m trying to get better. Even a basic primer on how to document your data sets would be helpful.

I have to imagine the Business Intelligence and Data Warehouse guys must have some recorded literature along with the DevOps, Scientific Computing, and Quantitative Finance communities. Probably where I need to start diggin’. Really, this process can’t be as haphazard as I’m seeing on a daily basis.

Feels like a good opportunity for O’Reilly Media


MBP Retina Review Revue

Link parkin’: TidBITS has collected a number of reviews of the Retina Display MacBook Pro.


Nice Time Capsule

The film Phone Booth, starring Colin Farrell has been knocking about HBO. I remember inadvertently seeing it on an airplane flight and being surprised at how enjoyable it was. The film’s stood up pretty well over a decade. In addition to Farrell, it also stars Kiefer Sutherland, Forrest Whitaker, Radha Mitchell, Katie Holmes, a bunch of quality character actors, and a somewhat forgotten seamy side of New York City. Joel Schumacher directs for a taught, tight 81 minutes.

Most notable might be that Phone Booth straddled a time when phone booths weren’t quite dead, cell phones weren’t quite dominant, and Manhattan hadn’t been completely scrubbed clean. Yet even though it captures a moment in time, the overall themes, and tension, are actually quite timeless. A nice little gem of a movie, that people will be watching for years to come.


Hollywood Narcissism

Well, Walter Jon Williams The Fourth Wall wasn’t exactly what I was expecting. Our intrepid heroine Dagmar is replaced as primary “protagonist” by Sean Makin. I use protagonist loosely as Makin is a stupefyingly self-absorbed former child acting star. The narrative devolves into a murder mystery regarding the Hollywood-based transmedia production Dagmar is helming with Makin in the lead role.

I think Williams was trying to explore the intersection of mass global entertainment and technology, but there wasn’t enough tech for me. And Makin is not a particularly good person but we have to spend 90% of the book locked into his inner dialogue.

Might be time to retire Dagmar and her crew.


Browser Proliferation

Tweetbot Logo Boy there are a lot of Web browsers embedded in various applications on my iPad. There’s Mobile Safari of course, a straight up Web browser. Then there’s a nice one hiding in 1Password for iPad. Mr. Reader, my RSS aggregator has one along with the IMDb app. The Google Search app has a browser embedded as well as Tweetbot.

Just an observation. Not sure if it’s good or bad. On the one hand, they’re all relatively consistent. On the other, my Web browsing get’s sprawled all over the place.


An Air Review

Not just any MacBook Air review, but an in-depth Jacqui Cheng, ArsTechnica look at Apple’s latest slim and thin laptops. Bottom line, the container is pretty much the same but the engine is vastly improved.


Et Tu Twitter

So now Twitter has joined Facebook in bombarding my inbox with useless notices about people I already track. Frankly, my email has pretty much become a cesspool of solicitation from corporate entities that I’ve had a prior “relationship” with. I’m almost surprised when I receive a human generated message directly for me. It’s just about enough to make an old UNIXhead give up the darn medium.

Time to go on a filtering rampage.


Thanksgiving In June

Been a rough patch on the home front for the past few days, but things are looking up. I highly recommend staying awake when driving up I-95. The guard rail alarm is really unpleasant.

Gotta give thanks though for all the family and friends providing support and well wishes. And thanks that the accident wasn’t worse. Fingers crossed we’ll get over the hump.


A New Contender

Given this, admittedly limited, comparison between the 13” MacBook Pro and 13” MacBook Air, I may have to mix it into my thought process about a new laptop. More horsepower and more memory expansion is tempting to tradeoff for the 50% weight and better resolution. Then again, it feels a lot like settling for the middle of the road. But it’s a personal machine I’m not trying to make money with, so why not do something different and work through the challenges? Something to consider.


Father’s Day

If you’ve got kids and you’re positively impacting their lives, good in ’ya. I can verify it’s rewarding, but demanding work.

Can’t say as how I got to take it easy and relax, due to a number of issues, but wanted to acknowledge all the positive male role models out there.


Titan Graph Store

Link parkin’: Titan

Titan is a distributed graph database optimized for storing and processing large-scale graphs within a multi-machine cluster.

Even better, it comes with a reasonably designed query API.


Build From Source

There will come a time when I need to build NumPy, and other packages on top of it, from source. On Mac OS X, it’s those finicky compiler options that get you. Thankfully, Jeet Sukumaran has written them down for when I need to look them up.


Linden’s Links

Great pile of links from Greg Linden:

Talk about geniuses at Facebook ignores the big problem that no one — not Google, Yahoo, Microsoft, Facebook, or any of the newspapers — knows how to solve this problem of making advertising relevant, effective, and lucrative without immediate purchase intent, despite years of work by thousands of brilliant people ([1] [2])


Memo to Zuck

Facebook Logo Zuck, no matter how many times you tell me a friend has done posted a new photo or added a new friend on Facebook, I’m not coming back. And I already hit the opt out link once before, which clearly didn’t take, so I’m not falling for that again. So please, just leave it alone, and stop with the e-mails. I mean you’re averaging close to one a day.

Desperation is unbecoming.

Yours. The Management.


The eyeo Festival

Nelson Minar has some high praise for the eyeo Festival. Back when she was less well known, I had the pleasure of meeting Fernanda Viégas at a session of the Hawaii International Conference on Systems Science. Tough work if you can get it. Fernanda in the house is a sign of a top notch event. eyeo sounds like a confab I should make it a point to attend before it gets overrun.


Air vs Pro

So the announcement that I’d been pining for came today, with Apple unveiling new products in both the MacBook Air and MacBook Pro lines. The new hotness of the top of the line 15” MacBook Pro, with Retina display, is pretty much my dream replacement for Ye Olde White MacBook. Heck, it can even be bumped up to 16 Gb of RAM with 768 Gb of SSD, for a small fee.

The 13” MacBook Airs deserve serious consideration though. First, while the Pro has slimmed down, the Airs are still 50% lighter. These days I’m extremely desirous of being leaner and meaner. The Airs can have their RAM upgraded to 8 Gb. The kicker is a 13” MacBook Air kitted out with more RAM and 512 Gb of SSD ($500, yikes!!) is still about $800 less than the equivalent 15” MacBook Pro. Obviously you get more cores, more Ghz, and more screen real estate, but $800 is 80% of the way to a Cinema Display for the desktop.

Interestingly, Apple claims the same battery life for both lines, roughly 7 hours, despite what must be a significant difference due to the Pro’s advanced display.

So serious top of the line mobile horsepower vs smaller, lighter, capable and significantly cheaper. Tough decision. I’ll wait for the in-depth nerd reviews and hold on until the 4th Macaversary.


Kindle Backlog

A nice side effect of installing the Amazon Kindle app on my iPad is that I was reminded of a number of e-books that I had bought. There was a total of nine, seven of which I hadn’t even started, plus two barely begun. Good works including Ready Player One, The Restoration Game, and You Are Not So Smart. Now I don’t have to spend a bunch of bucks on summer reading material.

I also went back and revisited a bunch of book I had bought from O’Reilly and uploaded as PDFs. O’Reilly also supports Amazon’s .mobi format and they let you redownload if needed. Gotta say that while PDFs look really nice, .mobi books shine in a big way. Good on O’Reilly for quality support as well.


LDA Intro

LDA stands for Latent Dirichlet Allocation, a machine learning approach to extracting hidden structure from document collections. I’ve been touting and applying LDA at work for a while now and think I have a pretty good handle on it. However, it’s always good to review the understanding. Edwin Chen has a nice, approachable introduction to LDA with a worked out example as the highlight. Thanks Edwin!


Easy Peasy EMR

Holy smokes! I just ran my first ever Elastic MapReduce (EMR) job flow a few minutes ago. Surprisingly, it didn’t crash or fail to complete. Ran a bit slow, which had me thinking it was gonna bomb. But nope, my little hashtag extraction script, finished processing 100MB of data in about 7 minutes. Most of the time was spent shipping the 100MB up to the EMR Hadoop cluster once it got going

Key was the usage of Yelp!’s mrjob Python package. mrjob exploits Hadoop’s streaming mechanism to fit Python into the Java based processing pipeline. What that costs in performance is more than made up for in flexibility and accessibility. At least for this big data hacking noob.

And I’m waiting to check the charges, but those will probably be on the order of pennies. Gotta leave having your own personal on-demand, dirt cheap cluster.


TIL lxml.objectify

Today I Learned about Python’s lxml.objectify module, which makes navigating XML in-memory tree representations a lot easier. Shame on me, I’ve been using lxml forever.

Thanks to Mike Driscoll


Labeling Axes

Label Your Axes

Ha, ha, only serious, about one of my pet peeves.

Via fluff


0.5 Billion Rows

Typically I don’t find Hacker News discussion threads enlightening although I frequent the site. But this one on loading half a billion rows into MySQL was actually quite good. Mainly because there were a few links to other interesting technologies such as: Snowplow, Trecul, and crush-tools.

And the original link was useful as well even though I’m not a MySQL fan.


OK! OKC

OKC Thunder LogoWow! Someone gave the Oklahoma City Thunder a defensive transplant in their victory over the San Antonio Spurs last night. They needed to hit some clutch shots to close out the win, but there were some periods of the game where the Spurs looked inept offensively. And the Spurs were the league’s team scoring average leader!

A few folks have already, prematurely, anointed them as series victor, but it definitely feels like a changing of the guard in the Western Conference.


Further Intrigue

Link parkin’: Fredrik Håård highly recommends bpython as a Python REPL. Color me intriguied.

I’ve never quite cottoned to IPython, although the new HTML notebook is very sexy, because I’ve never gotten it to play well with emacs. Maybe bpython can be a better citizen.


Color Me Intrigued

PostgreSQL Logo PostgreSQL recently got some JSON object support baked in. It’s just a beginning though, with not much querying into JSON objects.

Jerry Sievert is attempting to replicate the features of MongoDB in Postgres. His first post just outlines how to implement the basic document insert and lookup features. If he can pull off the document query API of MongoDB, or even a reasonable subset, then things get interesting.

Still reserving judgement.


That’s No Toy

That’s no moon. It’s a space station.

Obi-Wan Kenobi

Black iPad Earlier I called my 2012 birthday present a toy. Well after a week and a day I can calmly declare that the 3rd Generation iPad is definitely not a toy.

For me at least, it doesn’t feel like a desktop replacement, but there are definitely some great features:

  • There’s enough screen real estate that Web apps are comfortable, although you may have to do a bit of zooming to make buttons big enough.

  • Mr. Reader, an iPad app for Google Reader, is delightful. Reasonably priced, a bit more functional, and definitely fresher, Mr. Reader is kicking NetNewsWire to the curb on the iPad.

  • Reading books, either through Amazon’s Kindle App or Apple’s iBooks, is gorgeous. Perusing PDFs is enjoyable, although I have to give iBooks bonus points for preserving URLs, whereas Kindle doesn’t.

  • Speaking of Kindle, the “Send to Kindle” app is very handy.

  • The device is a great size for watching videos.

  • Bonus I have the Verizon LTE edition, which allows you to use the iPad as a mobile hotspot for no extra charge.

I’m definitely starting to love this thing although DJay turned out to be a bit of a disappointment, largely owing to the fact that you need monitor speakers and a cable to handle pre-cueing. Sort of kills the “mobile” in mobile dj. And authoring posts will probably stay on the desktop, at least until I get to a higher level of performance with the keyboard. The iPad hasn’t found a creative niche for me, but it’s currently a great media consumption device. Of course that’s only after one week, plenty of room to grow.


Tab Colorization

Many of my favorite tools support multiple tabs, including iTerm2. On Firefox, there’s a plug-in, Colorful Tabs, that automatically colorizes the tabs making for nice visual distinction. Not sure it’s that big a deal cognitively, but at least it’s pleasant.

Mikko Ohtamaa has put together some Python code that implements tab colorization on ANSI-escape compatible terminal applications such as iTerm2:

OSX’s iTerm 2, and maybe some other terminal applications, support ANSI control sequence extensions which allow shell to set the color of the terminal tab.

Below is a Python script which

  • Randomizes a color based on the server host name. The same hostname always results to the same color.
  • The color is randomized in HSL color space, so that only the hue component varies and saturation and lightness are locked. This prevents the creation of ugly color combinations like black text on black tab background.

Very handy!


May Gone

What the hell?!? May’s already over? We’re into June, the sixth month of the year? Time is flying in 2012.


Prismatic GReading

As a long time ponderer of “The Daily Me” concept, prismatic née Woven has been very attractive. I’ve always been curious about how automated, personalized systems might play out, and I enjoyed Bradford Cross’ blogging when he had the time. But when they first launched needed access to one’s Twitter account to make recommendations. That was a non-starter for privacy reasons and the fact that I wasn’t really following anyone then. I do most of my infotrapping in Google Reader.

Fast forward a bit and now Prismatic can connect to your GReader account. So I’m on the bandwagon with yet another personalized news service. Maybe this one will work out better than the last such service did.


Clipboard Decloaks

If it wasn’t for the fact that Gary Flake is leading the team, I’d think Clipboard was a derivative, me-too product. His Going Public post makes it seem really interesting though. I’ll have to check it out, especially if they make an API available (mentioned by not discussed by Flake).


Force Multiplying

Once upon a time, I called MapReduce and Sawzall “major force multipliers”. At work, I’m learning the hard way about the Sawzall part of that combination. It’s great to have a scalable distributed programming model and massive data storage engines, but data querying and manipulation are the secret sauce where the magic happens. Trying to do querying type stuff at the Java API level is “teh suck”.

According to Wikipedia, Sawzall never hit the open source big time, but at least Pig, Hive, and Cascalog came to be. Pinin’ for a good open source, graph query language and runbtime baked into the Hadoop ecosystem though.


TIL resile

Today I Learned the word resile:

Verb: resile

  1. To start back; to recoil; to recede from a purpose.  

  2. To spring back; rebound; resume the original form or position, as an elastic body.

As in “Twitter Resiles From API-Driven Site”.

Via Nat Torkington.


@bigdata @ #hbasecon

Ben Lorica dropped in on HBaseCon and had a few cogent observations. Sounded like the tribal gathering of a smart, growing community around a core component of the Hadoop ecosystem:

HBase and HDFS: Past, Present, and Future: In a conference centered around a piece of technology, you need an overview of new and upcoming features. HBase and HDFS committer Todd Lipcon gave a good survey centered around reliability, availability and durability. I particularly appreciated the summary table at the end of his talk (I’ll post that table once Todd’s slides become available).


Rising

The Dark Knight Rises Poster Previously I had threatened a post on the prologue trailer for The Dark Knight Rises but never followed through. What with the promotional push now kicking up (cue irritating intersplices with NBA playoff footage), this is a good time to call shots on the themes of the movie.

My interpretation of Christopher Nolan’s Batman Begins is that of one man rising up against despair and corruption (and the arrogance of Ras Al-Ghul). The Dark Knight was one man fighting against a force of chaos. In both cases though, it was one man as savior.

In the prologue trailer, all about Bane the central villain of The Dark Knight Rises, it’s clear that Bane has built a movement. One of his henchman actually sacrifices himself for the cause at Bane’s behest. As he goes to this death, the henchman is clearly reveling in his martyrdom. Yeah Bane will be a physical and mental badass, but where he’ll have one-upped Batman is in motivating a disenfranchised underclass. And that’s something a loner cannot defeat no matter how powerful a symbol.

So picking up the tail end of the comic The Dark Knight Returns, Bruce Wayne will learn that enduring change is brought by building sustainable collective action. In so doing, after being broken by Bane, he will rise up from the darkness defeat his enemy, bring lasting peace to Gotham, and ride off into the night.

And the opening of The Dark Knight Rises will punk The Avengers. Can’t wait until July 20th.


5 Skills

I’ve been meaning to give Matthew Hurst a shout out as one of the original social media hackers. His posts over at Data Mining regarding d8taplex and other data hacking exploits definitely provide an interesting perspective.

As an apprentice mad data scientist, Hurst’s 5 Hidden Skills for Big Data Scientis resonates. To wit:

3. Invest in Interactive Analytics, not Reporting. When you construct reports about your data products, you are answering a fixed set of questions. This is useful for monitoring, but it doesn’t provide a way to get at the unknown unknowns. It is only through interactions with data (often called slicing and dicing) that pockets of interest (problems and opportunities) are discovered. Rich, interactive tools may be perceived as a low priority and never quite got to. Avoid this peril!


Intro to Data Science

Link parkin’: CS194-16: Introduction to Data Science. Openly available from my alma mater. Taught by Jeff Hammerbacher and Mike Franklin. Looks like a fairly complete collection of slides and videos are present including some top notch guest lectures.

Autodidacts unite!


The New Toy

Coming to you live from the new toy my wife got me for my birthday, a new, 3rd generation, iPad. Not quite sure what I’ll be primarily doing with it, but there is one must have …

Algoriddim’s DJay

© 2008-2024 C. Ross Jam. Built using Pelican. Theme based upon Giulio Fidente’s original svbhack, and slightly modified by crossjam.