home ¦ Archives ¦ Atom ¦ RSS

Why HDFS

Hadoop Logo Charles Zedlewski makes an interesting analogy between HDFS, the Hadoop Distributed File System, and the Linux operating system. Stealing the punchline:

It’s rare when you get to see history repeat itself so completely as it is with HDFS. Today HDFS may not be the best filesystem for content addressable storage or nearline archive. But then 15 years ago who would have thought Linux would find its way into laptops, routers, mobile phones and airport kiosks?

Linux drew us the map. The smart money is already following it.

We’ll see how it plays out, but given the entrenched nature of HDFS he might be right. HDFS’ open source nature, and maybe more importantly community, means just about any good distributed file system idea can be quickly embraced and extended.

There is probably one area where HDFS could be radically updated or face displacement. Real-time streaming datasets don’t fit the HDFS model particularly well. Doesn’t mean someone smart can’t come along and fix it up.

Zedlewski also heavily invokes the nice support for Map/Reduce processing that HDFS provides. Map/Reduce is clearly successful, but these other processing demands may eventually lead to other programming models that fit less well with HDFS.

But I’m of a mind that Zedlewski is mostly right, and that HDFS is a nice solid foundation to build on going forward.


Building REST APIs with Python

In my Copious Spare Time ™, I’m thinking about how one can rapidly build, lightweight, fast REST APIs using Python. I may revisit Django approaches again, but the heavy dependency on an RDBMS and a model-based Object-Relational Manager (ORM) makes it feel like I’m fighting against the toolkit. The key issue is that to get performance for some aspects, I’m anticipating the need for other data stores such as a full-text index or a NoSQL repository. Thinking one of the Python web micro-frameworks, possibly Flask, might be the right fit.

Wow! It’s been a long time since I talked about Redis. It’s up to version 2.4.x now.


Mac OS X Tweetbot

Tweetbot Logo Tapbots released an alpha version of Tweetbot for Mac OS X. Figured I’d give it a whirl since I use Tweetbot so much on the iPhone and iPad. Like the familiar look and haven’t run across any bugs so far. Here’s hoping the Tapbots have success with this product.


Django Project Start

Okay, I learned a lot from this post, Starting a Django Project the Right Way:

One of the things I wish I had known when starting my Django project for IllestRhyme was “How do I start a real Django project”. As in, one that’s actually going to be used and developed more, not the toy project from the (admittedly execellent) Django documentation.

Having just gone through this process again for my new site, I wanted to share the knowledge I’ve gained about how to properly start a project in Django.

In addition, the comments are well worth reading.

Via Nat Torkington


Roles of a Data Science Team

A Gnip Data Story, a.k.a interview, with bitly’s Hilary Mason provides the most substantive detail I’ve heard about what her team actually does:

My team plays a few roles within the company. We handle the business analytics, which can be answering very simple questions like, “How many new URLs did we see yesterday?” to complex questions like, “How do we value a URL being clicked from platform X vs platform Y over time?”.

… In summary, my team is responsible for pushing the boundaries of where bitly can go. It’s fun.

There’s plenty more where that came from.

Good stuff. Just always been curious.


Don’t Tempt Me

It’s a sad statement of mega-corporation life that while the tumblr for Today’s Corporate Meeting Challenge is really funny, I am actually tempted to try and rise to the challenge.

Even sadder is that I’d be lucky to actually be called out in any of the numerous meetings I’m asked, nay required, to attend. Half of these would just go over people’s heads and they’d just forge on.

There should be a separate category for teleconferences. I’m undecided if these monstrosities should garner bonus points for laugh multiplication, or be considered the putting green. Seriously, you can always say the most profoundly stupid things in a telecon if you follow it up with “Sorry. Thought I was on mute.”


Python Modules of Unusual Usage

Getting a lot of mileage out of:

  • requests. Wasn’t quite ready to believe the hype, but it does make executing HTTP requests a lot more humane
  • ConfigObj. Makes parsing Windows INI files as about as convenient as possible
  • argparse. Once you internalize the Tao of this module, combine with cliff, to make command-line Swiss army chainsaws.

Computational Politics

Choice quote from Nat Torkington:

If von Clauswitz were around today, he’d say drones are the computation of politics by other means.


+1 AtBat12

MLB’s AtBat 2012 is an iPad and iPhone app. I’ve been enjoying it on my iPad but that device is currently charging. So I fired up the iPhone version, hit the “Restore” button, logged into iTunes, and boom! Streaming ’Nats audio.

Nice work MLB!


Peet! Say It Ain’t So

As someone who’s first good cup of copy was brewed from Peet’s beans I’m sad to see the Berkeley based roaster absorbed into a massive corporate maw. They held out much longer than I thought. Fond memories of the eclectic customer stream going through the tiny ur-Peet’s at Shattuck and Rose.


Scary Formats

You know you’ve been hacking when you start to both hate and appreciate the intricacies of csv file field escaping because you’ve come to both hate and appreciate the intricacies of JSON’s UTF-8 encoding. Kids, don’t try to put a data format that’s hard to escape by itself inside another format that has, at best, loosely defined escaping mechanisms.

Two great tastes that taste horrible together.

But I did it all for the data, the data.


AtBat 2012

In an effort to cut way back on the Couch Potato action, I’m trying for more listening than watching. The iPad makes for a good music media player, but sports programming is the critical factor. I can cut back on the sports “news” and mindless chatter through self-discipline, but live games are my Kryptonite. At least when I’m listening to a live broadcast I can multitask on things like writing code. TV just pins me slug like on the couch.

So I’m giving MLB AtBat 12 a run (even as I post this!) as it allows you to listen to every Major League Baseball radio broadcast as a digital stream. Pair it with some Bluetooth speakers and you’ve got a modern day Rube Goldberg radio, complete with Retina Display!

So far the experience has been great, and the price right at $2.99 for my trial month. While, MLB seems to have their act together, we’ll see how things fare during football, football, basketball, and hockey seasons. My impression is that the NFL, NCAA, EPL, NBA, and, NHL aren’t quite at the same level of sophistication. But at least I can build up the the anti-TV habit.


Just Works

I had to take my wife to a doctor’s appointment today. As is typical, there was a lot of hurry up and wait. I had both my work laptop and iPad with me. The iPad is a Verizon LTE model, which means they throw in the mobile hotspot capability for no extra charge. Hadn’t used it yet so I figured I’d give it a whirl to see how painful the process could be, just for future reference.

Turned the hotspot on on the iPad. Connected the laptop to the appropriate network name. And it just worked! I was even pleasantly surprised by the responsiveness of the network. Didn’t feel cellular at all.

Good to know, especially when routing around ridiculous hotel rates for Wi-Fi.


Teach Data Science

Link parkin’: Teach Data Science

This is the companion site to the electronic textbook, Introduction to Data Science, by Jeffrey Stanton. This book provides non-technical readers with a gentle introduction to essential concepts and activities of data science. For more technical readers, the book provides explanations and code for a range of interesting applications using the open source R language for statistical computing and graphics.


Prismatic, +1, -1

I’m getting a lot out of Prismatic, yet another in a venerable line of personalized “Daily Me” news services. I’m a sucker for these things but the technology landscape is literally strewn with their corporate wreckage. I would like to love it, but there are some issues:

  • Using short links in e-mail sent from the site
  • Putting links to Python documentation in my stream
  • No help, or way to get to my profile while reading my stream
  • Too much space allocated to “Share this story with your friends…”
  • Occasionally crashes Mobile Safari on my iPad

The biggest downside is that Prismatic is Yet Another Place To Read News (YAPTRN), and right now it’s last on my list of stops. We’ll see how long it can survive there.

The big upside is that I typically run across at least a few items of interest whenever I have a Prismatic session. It’s much better than predecessors of its ilk. The good bits:

  • Clever trick to heavily use facial images, social outlinks, and embedded Tweets. Feels more human and engaging.
  • Although completely opaque, the big images nicely break up all the text.
  • Inertial scrolling on the iPad fits the Prismatic style well, or they’ve really optimized for the device.

Makin’ Maps

This is part one of a five-part series about our recent explorations making choropleth maps using PostGIS, TileMill, Mapnik and Google Maps.

Turns out to be total of six posts, but it’s still a handy, not too deep, dive into making maps. Forewarned, there’s a fair bit of command line tweakage and assumed familiarity with open source tech. (Python Inside!) Not for those used to a lot of desktop app handholding.

From The Chicago Tribune’s News Apps Team.


Leanin’

If I indeed upgrade my personal Apple machine, in celebration of my fourth Macaversary, I’m leaning towards a kitted-out 13” MacBook Air. You can do a lot of damage with an ultralight portable boasting a half terabyte of SSD. And it’s at a relatively affordable price.


Enduring Geotech

I was all set to get some deep insight from The Atlantic’s article “The Future of the Map Isn’t a Map at All—It’s Information”, but it turned out to be pretty shallow. Even the attendant video wasn’t much more than a promotional for some new Google tech.

But between the provocative title, and my noodling into GIS technologies, it got me thinking that geohacking is a great business for a tech oriented person, especially with today’s advances. Interface applications, Web and native? Check! Massive data processing? Check! Real time data processing? Mobile applications? Check! Relevance to problems that matter? Check! Open, gentle slope, avenues to learn and hack? Check!

Important enduring organizations and institutions, with big checkbooks, care about understanding and using geospatial data. And more technologies are making more of that data available, at lower cost, more regularly.

If I was still in my past life, I would urge every Computer Science student to take an intro Geospatial Information System course. Right up there with Compilers, Operating Systems, and Relational Database Systems. GIS techniques and issues are that enduring of a computational capability in milspeak.


cliff

In the past I’ve written my own Python command line processing module to emulate what I call command shell frameworks ala git, Mercurial, and Subversion. Sucked.

I tried the pyCLI module but it didn’t quite work for me.

After a few hitches, Doug Hellmann’s cliff module did the trick. Need a longer test drive, but so far it’s been highly useful. I don’t quite love the use of distribute hooks but I can live with it until I find a better solution. The baked in command REPL is a nice to have.

Using cliff has been a good way to paper over some fairly complex processing with a power user grade UI. Also quite easy to add new features with quick turnaround.


Common Crawl Contest

The Common Crawl folks put together a little video to better explain their purpose, aims, and goals. Very well executed and it also includes an announcement of their first hacking contest. Tempted to do some sideline hacking on that dataset just for the Big Data experience. Winning would just be serendipity.


Postgres.app

Link parkin’: Heroku’s nicely packaged for Mac OS X, Postgres.app

Postgres.app is the easiest way to get started with PostgreSQL on the Mac. Open the app, and you have a PostgreSQL server ready and awaiting new connections. Close the app, and the server shuts down.

Postgres.app will be distributed through the Mac App Store, with a separate build containing the latest PostgreSQL beta available for direct download from the website.

I love the fact that PostGIS 2.0 is baked in. But I’m also a little wary of how well it supports the building and installation of extensions. I needz my plpythonu.


The Blatche Era

Andray Blatche’s stint as a Washington Wizard has ended. The Basketball Jones, alomg with some mildly amusing commenters, makes appropriate fun of the occasion:

As you can see, the Andray Blatche market is pretty dried up. There are even rumors that the Bucks’ original bid would have included an additional three pounds of mozzarella string cheese, but they pulled that because they didn’t think he was worth it. Tough break, but I’m sure the $23 million will cheer him up.

Looking back I’m less disappointed with Blatche then the last remaining piece of failure, Ernie Grunfeld. Andray was just a 29th pick, straight out of high school, project that hasn’t panned out. It was Grunfeld who made the decision to ridiculously overpay him.

And let’s be clear, I was pretty disappointed with Andray.


Udvar-Hazy Rises

The Dark Knight Rises opens this week, including a run at the Udvar-Hazy Center. Longtime followers will know that this makes me really happy.

Real IMAX FTW.


Common Crawl 2012

Mmmmmmm, fresh, hot data! With instructions to boot:

I am very happy to announce that Common Crawl has released 2012 crawl data as well as a number of significant enhancements to our example library and help pages.

Along with this release, we’ve published an Amazon Machine Image (AMI) to help both new and experienced users get up and running quickly. The AMI includes a copy of our Common Crawl User Library, our Common Crawl Example Library, and launch scripts to show users how to analyze the Common Crawl corpus using either a local Hadoop cluster or Amazon Elastic MapReduce.


Social Networks Unprofitable?

Good thought experiment by Derek Powazek: “What if Social Networks Just Aren’t Profitable?”

Here’s the short version: Every community-based site in the history of the web has essentially been a stab at creating a social network. Most of them fail as businesses, with the rare exception of small, lucky communities that become self-sufficient but not exactly prosperous. What if that’s just the way it is?

I’d say there’s a lot of evidence for the thesis and not much agin’ it. There’s also “profitable” and “PROFITABLE”. Maybe the profits in social networks just don’t scale to the level of publicly traded corporations. Doesn’t mean a nice living can’t be made at the Main Street or regional level.


Goodbye and Good Riddance

Friday, July 13th, glad to see you go. Seems like every single complex system (natural or man made) decided to spend the day frustrating me. From organizational BS, to proposal rejections, to IT failures (special honors to MS Outlook 2007), to traffic, to kids just doing what they do, frustration abounded.

Well your worst has been done and no catastrophes occurred. Sayanora!


s3tools

Link parkin’: s3tools

S3cmd is a command line tool for uploading, retrieving and managing data in Amazon S3. It is best suited for power users who don’t fear command line. It is also ideal for scripts, automated backups triggered from cron, etc.

Just what the doctor ordered for the continuous Tweet collectin’, S3 storin’, time constrained home hacker. On Ubuntu, cloning the git repo, running a config, and accessing my S3 bucket works right out the box.


eero

As someone who wrote a dissertation on programming language design, it’s rare when I see an announcement for a new language and go “Wow! That looks cool!” I just did that for the first time in a long time when I read about eero

Eero is a fully binary- and header-compatible dialect of Objective-C, implemented with a modified version of the Apple-sponsored LLVM/clang open-source compiler. It features a streamlined syntax, Python-like indentation, and other features that improve readability and code safety. It is inspired by languages such as Smalltalk, Python, and Ruby.

Eero” is pronounced [ˈe-rō]‚ and is similar to the English word “aero”.

After a surface read, I actually thought programming for the Mac OS X and/or iOS system APIs might be fun.

Still a long slog to success and popularity, but a kid’s gotta start somewhere.


Exit GeoIQ

I’ve mentioned before that in my day job I tried to fire up some collaborations with GeoIQ (née FortiusOne). Never could get anything off the ground but enjoyed my interactions and visits to their Clarendon office space. Always admired their scrappiness from afar and the neogeographer community they built up around GeoCommons.

Yesterday GeoIQ announced their acquisition by Esri, the 900 pound gorilla of GIS systems. On the one hand, it’s a little sad to see the little local guys get gobbled up. On the other hand, I hope it created a reasonable exit for the folks I got to make a personal connection with. Probably wasn’t lifechanging but beats going bankrupt.

And I find it interesting that they’re going to establish a Research and Development center in the DC area. Between the rapid tech changes in massive data analytics, mobile development, and Web mapping, must be an interesting time for GIS folks. I sort of realized that DC was a bit of a geonerd center, but this is just another confirming datapoint. Makes complete sense what with the concentration of gov, mil, spook, sci, campaign, and NGO types in the DMV.

Good luck, Sean and crew!


PyCon 2013

Yeehaw! PyCon 2013 is officially on the docket. Back in sunny Santa Clara again, I’m mentally booking my return trip.

Without the distracting work related activities.

And dying from some virus.

I promise.


Installing Pandas

Recently I went through the process of installing pandas Mac OS X and had a similar experience to Grig Gheorghiu,

I tried to install the pandas Python library a while ago using easy_install/pip and I hit some roadblocks when it came to installing all the dependencies. So I tried it again, but this time I tried to install most of the required packages from source. Here are my notes, hopefully they’ll be useful to somebody out there.

It wasn’t a truly heinous effort, but a lot less clean than I expected. Like Grig, HDF5 and PyTables were the worst, being the only ones I couldn’t pip my way through. However, I already had gfortran installed.

I’m really looking forward to putting pandas to the test, but this exercise makes using something like the Enthought Python Edition really attractive.


The Human Division

Loved John Scalzi’s Old Man’s War trilogy so his upcoming project, The Human Division looks intriguing:

My next project from Tor is called The Human Division. It takes place in the “Old Man’s War” universe, after the events of The Last Colony and Zoe’s Tale. It is not, strictly speaking, a novel.


Serious Analysis

Data Analysis With Open Source Tools Cover

So for so good on the first two chapters of Phillip Janert’s Data Analysis with Open Source Tools. Actually better than good, it’s been great. A little statistics, a little graphics, a little math, and a little programming. All starting from an expectation that the reader is somewhat experienced and with a matching serious tone. Kernel density estimates were actually new to me. A number of O’Reilly books start off a bit breezy, but not this one. Well worth the money so far.


Streaming Spark

Cool! Scalable streamed data processing on top of Hadoop-like infrastructure, via Discretized Streams (PDF)

The key idea behind D-Streams is to treat a streaming computation as a series of deterministic batch computations on small time intervals. For example, we might place the data received each second into a new interval, and run a MapReduce operation on each interval to compute a count. Similarly, we can perform a running count over several intervals by adding the new counts from each interval to the old result. Two immediate advantages of the D-Stream model are that consistency is well-defined (each record is processed atomically with the interval in which it arrives), and that the processing model is easy to unify with batch systems. In addition, as we shall show, we can use similar recovery mechanisms to batch systems, albeit at a much smaller timescale, to mitigate failures more efficiently than existing streaming systems, i.e., recover data faster at a lower cost

That’s how the Cal CS Division rolls.

Via Ben Lorica


Return To Local?

As someone who’s pretty much hated retail shopping since he was old enough to be dragged to a mall by his mother, I’m probably a little too predisposed to Jeff Jordan’s analysis that e-commerce is killing physical retail outlets:

I believe we’re approaching a sea change in retail where physical retail is displaced by e-commerce in a multitude of categories. The argument at a high level:

  • Online retail is relentlessly taking share in many specialty retail categories, resulting in total dollars available to physical retailers stagnating or even declining. This is starting to put intense pressure on their top lines.
  • Physical retailers are very highly leveraged and often have narrow profit margins. Material declines in their top lines make them unprofitable and quickly bankrupt.
  • Online retail will benefit greatly from the elimination of their physical competition and their growth should accelerate.

Need to check out the contra-commentary, and the argument rests a little too much on the singularity of Amazon, but the premise feels sound to me.

If this pans out, my question is what happens to all the physical space and local talent that supports big box outlets and malls? Does Main Street make a comeback? What do you do with vast stretches of strip malls like Rockville Pike in Maryland? How about more open and green spaces as folks shift to even more knowledge work and personal services? More home-based businesses or co-working?

Living in the exurbs of Loudoun County, VA, I’m hard pressed to envision what the region will look like in a generation from a commerce perspective. Leesburg Premimum Outlets better look out.

Via Jenn Webb @ O’Reilly Radar


Hangouts Fandom

Recently in my feedflow I’ve noticed quite a few plaudits for Google+ Hangouts, e.g. Lucas Gonze, who I’ve followed for a quite a while, is a Hangouts fan. While I’ve enjoyed the television commercials, since I’m not a huge G+ user I haven’t really gotten into it. However, I’m intrigued that there might be something that shows a path to elimination of the execrable Microsoft LiveMeeting in the workplace.

The amount of expensive talent time wasted setting up collaboration with that hideous tool is stunning to me. Feels like Hangouts hits the sweet spot, while being Web native and thus cross platform, which is the other half of my gripe with LiveMeeting, being a Mac user.

Probably not in my career lifetime, but one can hope.


Python for Data Analysis

Python for Data Analysis Cover Got suckered by one of those O’Reilly 50% off daily deals on e-books and had to buy the Early Release of Python for Data Analysis. Looking forward to digging into some pandas on my iPad.

Couldn’t stop at just one though, and had to grab the 2nd edition of SQL and Relational Theory by C. J. Date along with Data Analysis with Open Source Tools. I was pleasantly surprised by the table of contents for the latter. A bit meatier than I anticipated.

If you’re reading this on July 3rd, 2012 you can still jump in on the deal. Clock’s ticking though. Midnight PDT is when it expires. Also, of note that it’s only on a particular 25 e-books.


eGenix PyRun

eGenix PyRun looks like it might be useful someday

Our new eGenix PyRun™ combines a Python interpreter with an almost complete Python standard library into a single easy-to-use executable, that does not require a system wide installation and is fully relocatable.

PyRun’s executable only needs 12MB, but still supports most Python application and scripts - and it can be further compressed to 3-4MB using gzexe or upx.


Tree Style Tabs

I have some serious tab proliferation in both Chrome and Firefox. I’ll have to check out the Tree Style Tab Firefox plug-in.

This provides tree-style tab bar, like a folder tree of Windows Explorer. New tabs opened from links (or etc.) are automatically attached to the current tab. If you often use many many tabs, it will help your web browsing because you can understand relations of tabs.

Via Matt Ryall


Twelve Factors

Link parkin’: The Twelve-Factor App, a manifesto and methodology for building modern, scalable web applications. We’re a long ways away from good ole’ CGI.

Via Rafe Colburn

© 2008-2024 C. Ross Jam. Built using Pelican. Theme based upon Giulio Fidente’s original svbhack, and slightly modified by crossjam.