home ¦ Archives ¦ Atom ¦ RSS

Thanksgiving In June

Been a rough patch on the home front for the past few days, but things are looking up. I highly recommend staying awake when driving up I-95. The guard rail alarm is really unpleasant.

Gotta give thanks though for all the family and friends providing support and well wishes. And thanks that the accident wasn’t worse. Fingers crossed we’ll get over the hump.


A New Contender

Given this, admittedly limited, comparison between the 13” MacBook Pro and 13” MacBook Air, I may have to mix it into my thought process about a new laptop. More horsepower and more memory expansion is tempting to tradeoff for the 50% weight and better resolution. Then again, it feels a lot like settling for the middle of the road. But it’s a personal machine I’m not trying to make money with, so why not do something different and work through the challenges? Something to consider.


Father’s Day

If you’ve got kids and you’re positively impacting their lives, good in ’ya. I can verify it’s rewarding, but demanding work.

Can’t say as how I got to take it easy and relax, due to a number of issues, but wanted to acknowledge all the positive male role models out there.


Titan Graph Store

Link parkin’: Titan

Titan is a distributed graph database optimized for storing and processing large-scale graphs within a multi-machine cluster.

Even better, it comes with a reasonably designed query API.


Build From Source

There will come a time when I need to build NumPy, and other packages on top of it, from source. On Mac OS X, it’s those finicky compiler options that get you. Thankfully, Jeet Sukumaran has written them down for when I need to look them up.


Linden’s Links

Great pile of links from Greg Linden:

Talk about geniuses at Facebook ignores the big problem that no one — not Google, Yahoo, Microsoft, Facebook, or any of the newspapers — knows how to solve this problem of making advertising relevant, effective, and lucrative without immediate purchase intent, despite years of work by thousands of brilliant people ([1] [2])


Memo to Zuck

Facebook Logo Zuck, no matter how many times you tell me a friend has done posted a new photo or added a new friend on Facebook, I’m not coming back. And I already hit the opt out link once before, which clearly didn’t take, so I’m not falling for that again. So please, just leave it alone, and stop with the e-mails. I mean you’re averaging close to one a day.

Desperation is unbecoming.

Yours. The Management.


The eyeo Festival

Nelson Minar has some high praise for the eyeo Festival. Back when she was less well known, I had the pleasure of meeting Fernanda Viégas at a session of the Hawaii International Conference on Systems Science. Tough work if you can get it. Fernanda in the house is a sign of a top notch event. eyeo sounds like a confab I should make it a point to attend before it gets overrun.


Air vs Pro

So the announcement that I’d been pining for came today, with Apple unveiling new products in both the MacBook Air and MacBook Pro lines. The new hotness of the top of the line 15” MacBook Pro, with Retina display, is pretty much my dream replacement for Ye Olde White MacBook. Heck, it can even be bumped up to 16 Gb of RAM with 768 Gb of SSD, for a small fee.

The 13” MacBook Airs deserve serious consideration though. First, while the Pro has slimmed down, the Airs are still 50% lighter. These days I’m extremely desirous of being leaner and meaner. The Airs can have their RAM upgraded to 8 Gb. The kicker is a 13” MacBook Air kitted out with more RAM and 512 Gb of SSD ($500, yikes!!) is still about $800 less than the equivalent 15” MacBook Pro. Obviously you get more cores, more Ghz, and more screen real estate, but $800 is 80% of the way to a Cinema Display for the desktop.

Interestingly, Apple claims the same battery life for both lines, roughly 7 hours, despite what must be a significant difference due to the Pro’s advanced display.

So serious top of the line mobile horsepower vs smaller, lighter, capable and significantly cheaper. Tough decision. I’ll wait for the in-depth nerd reviews and hold on until the 4th Macaversary.


Kindle Backlog

A nice side effect of installing the Amazon Kindle app on my iPad is that I was reminded of a number of e-books that I had bought. There was a total of nine, seven of which I hadn’t even started, plus two barely begun. Good works including Ready Player One, The Restoration Game, and You Are Not So Smart. Now I don’t have to spend a bunch of bucks on summer reading material.

I also went back and revisited a bunch of book I had bought from O’Reilly and uploaded as PDFs. O’Reilly also supports Amazon’s .mobi format and they let you redownload if needed. Gotta say that while PDFs look really nice, .mobi books shine in a big way. Good on O’Reilly for quality support as well.


LDA Intro

LDA stands for Latent Dirichlet Allocation, a machine learning approach to extracting hidden structure from document collections. I’ve been touting and applying LDA at work for a while now and think I have a pretty good handle on it. However, it’s always good to review the understanding. Edwin Chen has a nice, approachable introduction to LDA with a worked out example as the highlight. Thanks Edwin!


Easy Peasy EMR

Holy smokes! I just ran my first ever Elastic MapReduce (EMR) job flow a few minutes ago. Surprisingly, it didn’t crash or fail to complete. Ran a bit slow, which had me thinking it was gonna bomb. But nope, my little hashtag extraction script, finished processing 100MB of data in about 7 minutes. Most of the time was spent shipping the 100MB up to the EMR Hadoop cluster once it got going

Key was the usage of Yelp!’s mrjob Python package. mrjob exploits Hadoop’s streaming mechanism to fit Python into the Java based processing pipeline. What that costs in performance is more than made up for in flexibility and accessibility. At least for this big data hacking noob.

And I’m waiting to check the charges, but those will probably be on the order of pennies. Gotta leave having your own personal on-demand, dirt cheap cluster.


TIL lxml.objectify

Today I Learned about Python’s lxml.objectify module, which makes navigating XML in-memory tree representations a lot easier. Shame on me, I’ve been using lxml forever.

Thanks to Mike Driscoll


Labeling Axes

Label Your Axes

Ha, ha, only serious, about one of my pet peeves.

Via fluff


0.5 Billion Rows

Typically I don’t find Hacker News discussion threads enlightening although I frequent the site. But this one on loading half a billion rows into MySQL was actually quite good. Mainly because there were a few links to other interesting technologies such as: Snowplow, Trecul, and crush-tools.

And the original link was useful as well even though I’m not a MySQL fan.


OK! OKC

OKC Thunder LogoWow! Someone gave the Oklahoma City Thunder a defensive transplant in their victory over the San Antonio Spurs last night. They needed to hit some clutch shots to close out the win, but there were some periods of the game where the Spurs looked inept offensively. And the Spurs were the league’s team scoring average leader!

A few folks have already, prematurely, anointed them as series victor, but it definitely feels like a changing of the guard in the Western Conference.


Further Intrigue

Link parkin’: Fredrik Håård highly recommends bpython as a Python REPL. Color me intriguied.

I’ve never quite cottoned to IPython, although the new HTML notebook is very sexy, because I’ve never gotten it to play well with emacs. Maybe bpython can be a better citizen.


Color Me Intrigued

PostgreSQL Logo PostgreSQL recently got some JSON object support baked in. It’s just a beginning though, with not much querying into JSON objects.

Jerry Sievert is attempting to replicate the features of MongoDB in Postgres. His first post just outlines how to implement the basic document insert and lookup features. If he can pull off the document query API of MongoDB, or even a reasonable subset, then things get interesting.

Still reserving judgement.


That’s No Toy

That’s no moon. It’s a space station.

Obi-Wan Kenobi

Black iPad Earlier I called my 2012 birthday present a toy. Well after a week and a day I can calmly declare that the 3rd Generation iPad is definitely not a toy.

For me at least, it doesn’t feel like a desktop replacement, but there are definitely some great features:

  • There’s enough screen real estate that Web apps are comfortable, although you may have to do a bit of zooming to make buttons big enough.

  • Mr. Reader, an iPad app for Google Reader, is delightful. Reasonably priced, a bit more functional, and definitely fresher, Mr. Reader is kicking NetNewsWire to the curb on the iPad.

  • Reading books, either through Amazon’s Kindle App or Apple’s iBooks, is gorgeous. Perusing PDFs is enjoyable, although I have to give iBooks bonus points for preserving URLs, whereas Kindle doesn’t.

  • Speaking of Kindle, the “Send to Kindle” app is very handy.

  • The device is a great size for watching videos.

  • Bonus I have the Verizon LTE edition, which allows you to use the iPad as a mobile hotspot for no extra charge.

I’m definitely starting to love this thing although DJay turned out to be a bit of a disappointment, largely owing to the fact that you need monitor speakers and a cable to handle pre-cueing. Sort of kills the “mobile” in mobile dj. And authoring posts will probably stay on the desktop, at least until I get to a higher level of performance with the keyboard. The iPad hasn’t found a creative niche for me, but it’s currently a great media consumption device. Of course that’s only after one week, plenty of room to grow.


Tab Colorization

Many of my favorite tools support multiple tabs, including iTerm2. On Firefox, there’s a plug-in, Colorful Tabs, that automatically colorizes the tabs making for nice visual distinction. Not sure it’s that big a deal cognitively, but at least it’s pleasant.

Mikko Ohtamaa has put together some Python code that implements tab colorization on ANSI-escape compatible terminal applications such as iTerm2:

OSX’s iTerm 2, and maybe some other terminal applications, support ANSI control sequence extensions which allow shell to set the color of the terminal tab.

Below is a Python script which

  • Randomizes a color based on the server host name. The same hostname always results to the same color.
  • The color is randomized in HSL color space, so that only the hue component varies and saturation and lightness are locked. This prevents the creation of ugly color combinations like black text on black tab background.

Very handy!


May Gone

What the hell?!? May’s already over? We’re into June, the sixth month of the year? Time is flying in 2012.


Prismatic GReading

As a long time ponderer of “The Daily Me” concept, prismatic née Woven has been very attractive. I’ve always been curious about how automated, personalized systems might play out, and I enjoyed Bradford Cross’ blogging when he had the time. But when they first launched needed access to one’s Twitter account to make recommendations. That was a non-starter for privacy reasons and the fact that I wasn’t really following anyone then. I do most of my infotrapping in Google Reader.

Fast forward a bit and now Prismatic can connect to your GReader account. So I’m on the bandwagon with yet another personalized news service. Maybe this one will work out better than the last such service did.


Clipboard Decloaks

If it wasn’t for the fact that Gary Flake is leading the team, I’d think Clipboard was a derivative, me-too product. His Going Public post makes it seem really interesting though. I’ll have to check it out, especially if they make an API available (mentioned by not discussed by Flake).


Force Multiplying

Once upon a time, I called MapReduce and Sawzall “major force multipliers”. At work, I’m learning the hard way about the Sawzall part of that combination. It’s great to have a scalable distributed programming model and massive data storage engines, but data querying and manipulation are the secret sauce where the magic happens. Trying to do querying type stuff at the Java API level is “teh suck”.

According to Wikipedia, Sawzall never hit the open source big time, but at least Pig, Hive, and Cascalog came to be. Pinin’ for a good open source, graph query language and runbtime baked into the Hadoop ecosystem though.


TIL resile

Today I Learned the word resile:

Verb: resile

  1. To start back; to recoil; to recede from a purpose.  

  2. To spring back; rebound; resume the original form or position, as an elastic body.

As in “Twitter Resiles From API-Driven Site”.

Via Nat Torkington.


@bigdata @ #hbasecon

Ben Lorica dropped in on HBaseCon and had a few cogent observations. Sounded like the tribal gathering of a smart, growing community around a core component of the Hadoop ecosystem:

HBase and HDFS: Past, Present, and Future: In a conference centered around a piece of technology, you need an overview of new and upcoming features. HBase and HDFS committer Todd Lipcon gave a good survey centered around reliability, availability and durability. I particularly appreciated the summary table at the end of his talk (I’ll post that table once Todd’s slides become available).


Rising

The Dark Knight Rises Poster Previously I had threatened a post on the prologue trailer for The Dark Knight Rises but never followed through. What with the promotional push now kicking up (cue irritating intersplices with NBA playoff footage), this is a good time to call shots on the themes of the movie.

My interpretation of Christopher Nolan’s Batman Begins is that of one man rising up against despair and corruption (and the arrogance of Ras Al-Ghul). The Dark Knight was one man fighting against a force of chaos. In both cases though, it was one man as savior.

In the prologue trailer, all about Bane the central villain of The Dark Knight Rises, it’s clear that Bane has built a movement. One of his henchman actually sacrifices himself for the cause at Bane’s behest. As he goes to this death, the henchman is clearly reveling in his martyrdom. Yeah Bane will be a physical and mental badass, but where he’ll have one-upped Batman is in motivating a disenfranchised underclass. And that’s something a loner cannot defeat no matter how powerful a symbol.

So picking up the tail end of the comic The Dark Knight Returns, Bruce Wayne will learn that enduring change is brought by building sustainable collective action. In so doing, after being broken by Bane, he will rise up from the darkness defeat his enemy, bring lasting peace to Gotham, and ride off into the night.

And the opening of The Dark Knight Rises will punk The Avengers. Can’t wait until July 20th.


5 Skills

I’ve been meaning to give Matthew Hurst a shout out as one of the original social media hackers. His posts over at Data Mining regarding d8taplex and other data hacking exploits definitely provide an interesting perspective.

As an apprentice mad data scientist, Hurst’s 5 Hidden Skills for Big Data Scientis resonates. To wit:

3. Invest in Interactive Analytics, not Reporting. When you construct reports about your data products, you are answering a fixed set of questions. This is useful for monitoring, but it doesn’t provide a way to get at the unknown unknowns. It is only through interactions with data (often called slicing and dicing) that pockets of interest (problems and opportunities) are discovered. Rich, interactive tools may be perceived as a low priority and never quite got to. Avoid this peril!


Intro to Data Science

Link parkin’: CS194-16: Introduction to Data Science. Openly available from my alma mater. Taught by Jeff Hammerbacher and Mike Franklin. Looks like a fairly complete collection of slides and videos are present including some top notch guest lectures.

Autodidacts unite!


The New Toy

Coming to you live from the new toy my wife got me for my birthday, a new, 3rd generation, iPad. Not quite sure what I’ll be primarily doing with it, but there is one must have …

Algoriddim’s DJay


An Engineering Blogroll

Link parkin’: Nice passel of software engineering blogs collected by Rafe Colburn. Will have to add ’em all to the aggregator.


Scalzi’s Stuff

John Scalzi is one of my favorite authors. Recently visiting the DC metro area, he had a mental slip and lost track of his computer bag.

I’m glad to say the DC area came through, returning the bag, and full contents, to him. Good on ya’ DMV, there’s gotta be a karma bonus in there somewhere.


UDub CS On Fire

UW CSE Logo Back in 1988 when I was applying to graduate schools, one of my undergrad advisers suggested I check out the University of Washington. He called their Computer Science and Engineering up and coming. Took his advice and sent in an application.

I got into 2 of the 3 places I applied, UDub and UC Berkeley, and visited both. I came away pretty impressed, especially because of some discussion time I had with Tom Anderson, who was “just” a doctoral student at the time. He gave me some great advice on my decision and career, some of which I managed to follow.

Cal was my eventual choice, primarily on rep and the fact that I had a number of friends on campus and in the area, and I’ll never regret that decision. From afar though, I’ve admired the UW CSE program as it’s closed on the MIT/Berkeley/CMU/Stanford summit.

I haven’t kept up with graduate rankings since I exited my former life, but Jeff Heer and Daniela Rosner joining the UW faculty, not to mention attracting Carlos Guestrin, indicates their further push to reach the top. A key has been building productive relationships with Microsoft, Amazon, Google, and other locally based tech juggernauts.

Occasionally (but not often) makes me wonder “what if?”


Warden and Kelcey

Thanks to stumbling onto the Common Crawl blog, I’ve run across sites for Pete Warden and Mat Kelcey. Both look like interesting data hacking venues which update at a reasonable frequency. Particularly like Warden’s Five Short Links series. Subscribed. May even have to follow them on Twitter.


Twitter Platform Objects

Twitter Bird Small Link parkin’: A field guide to Twitter Platform objects

Like any ecosystem, the Twitter platform has a variety of flora and fauna. Use this field guide to better understand the most frequently observed wild objects.

Don’t know how long this has been in place, but now I don’t have to guess about various details of the Twitter API. At least as long as the docs stay in synch with the code.


How Web Maps Work

How Web Maps Work: Does what it says on the tin. Nice overview.

Via Nelson Minar’s linkblog


Drogba

Chelsea FC Logo I didn’t give Chelsea much of a chance to beat Bayern Munich in the UEFA Champions League Final, but they somehow managed to pull off the victory. Oddly they’ve captured a funky double also winning the FA Cup. Meanwhile, 6th place is the best The Blues could do in the Premiership.

Didier Drogba figured large, scoring an 88th minute goal to keep Chelsea alive, giving up a penalty kick that was eventually saved by Cech, and hitting the final and clinching PK. That’s a long way from the horrible injury Drogba suffered back in August.


Pandas Pre-Print

Wes McKinney’s Python for Data Analysis is in Early Release from O’Reilly. If I could carve out an extended period of time to get some initial experience with pandas I’d grab it. Might have to do so anyway.


Large-Scale In-Memory

Amund Tveit attended Accel’s Big Data Conference and came away with some interesting takeaways. This spurred him to do a mental exercise on what it would take to store a year’s worth of Twitter’s tweets in RAM:

Keeping 1 year worth of tweets (including metadata) and (a crude) index of them in-memory is costly, but not too bad. I.e. 1.36 Million USD to keep 1 years worth of tweets (124 billion tweets) for 1 year in an (distributed) in-memory hashtable (or the same amount of tweets stored in the same hashtable for one day costs approximately 3732 USD).

Q: So, is it time to reconsider using hard drives and SSDs and consider going for RAM instead

A: yes, at least consider it and combine with Hadoop. …

Obviously $1.36 million dollars is a heap ‘o kale, but there are probably businesses with that scale of data and a competitive need for that speed of processing. It‘s extreme now, but 15 years ago buying terabytes off the Costco shelf was unimaginable to most people.

Other nuggets that struck me between the two posts include the fact that Twitter sees 340 million tweets per day and that realtime processing is a hot topic. I personally had the feeling that “realtime at scale” is the new frontier but this is a shred of confirming evidence.

Don’t know about that “combine with Hadoop” comment though. The more I find out about Hadoop the less impressed I am.


5 Billion Pages

Ever since I stopped my personal Twitter data collection project, I’ve been mentally casting about for a new dataset to build my Mad Data Skillz ™ (Boyeeeee!). Obviously, just restarting the tweet inflow is an option, but something involving more scale with less work would be nice.

Enter Common Crawl, a non-profit making a large — 5 Billion Web page — crawl publicly and freely available on Amazon EC2. How juicy! A big dataset conveniently located within the premier, openly available, utility computing infrastructure in the world. Definitely has potential to put the Skillz to the test. Common Crawl even has a convenient series of blog posts instructing one on how to process their page repository.

© 2008-2024 C. Ross Jam. Built using Pelican. Theme based upon Giulio Fidente’s original svbhack, and slightly modified by crossjam.