home ¦ Archives ¦ Atom ¦ RSS

blekko izik

blekko unleashed a tablet search application. Might be worth checking out. Good to know blekko is still competing hard and creatively in the search space. Maybe their zag will be to make search a lot more usable everywhere instead of gobbling up every search in the world.

[embed]http://vimeo.com/56728379[/embed]

Friends of blekko!

We are very pleased to announce izik, our new tablet search app. We launched izik on Friday, and today is it currently the #3 free reference app in the Apple app store.

We believe that the move to the tablet from the desktop/laptop is an environmental shift in how people consume web content. We have developed a search product that addresses the following unique problems of tablet search:


New Bitly APIs

[embed]https://twitter.com/hmason/status/288722595606573056[/embed]

For whatever the reason, today was chock full of link fodder. Some of it I’ll be saving for later in the week, but Hilary Mason’s announcement of new, real-time, streaming APIs tickles my fancy.

We just released a bunch of social data analysis APIs over at bitly. I’m really excited about this, as it’s offering developers the power to use social data in a way that hasn’t been available before. There are three types of endpoints and each one is awesome for a different reason.

Already scheming some data collection and analysis projects.


Wiz Thunder

[embed]https://twitter.com/DidTheWizWin/status/288471697844236288[/embed]

I was there in person. Beal for 2 to take the lead @ 0.3 seconds in the fourth. Didn’t realize how undermanned the Wiz were. It was a bit more exciting than the BCS National Championship Game.


Python Hadooping

In my day job, I’ve been using Yelp!’s mrjob framework to run a lot of Hadoop Map-Reduce jobs. Takes a lot of the Java pain away. So it was with quite a bit of interest that I dug into Uri Laserson’s A Guide to Python Frameworks for Hadoop:

I recently joined Cloudera after working in computational biology/genomics for close to a decade. My analytical work is primarily performed in Python, along with its fantastic scientific stack. It was quite jarring to find out that the Apache Hadoop ecosystem is primarily written in/for Java. So my first order of business was to investigate some of the options that exist for working with Hadoop from Python.

In this post, I will provide an unscientific, ad hoc review of my experiences with some of the Python frameworks that exist for working with Hadoop, including:

His bottom line is that straight Hadoop Streaming with Python has the best performance. Meanwhile, mrjob is well maintained, active, and quite productive, but comes with a significant performance hit.

A small sample size, but worth reading to know all your options if you’re into mixing elephants and snakes.


A Decade of NFL Play-by-Play

This archive of NFL play-by-play data popped back up on Hacker News. More grist for the sports data hacking mill:

I’ve recently completed a project to compile publicly-available NFL play-by-play data. It took a while, but now it’s ready.

The resulting database comprises nearly all non-preseason games from the 2002 through [edit: 2012] seasons. I have not performed any analysis on the data, so what you’ll get are the only basics—time, down, distance, yard line, play description, and score. It’s almost exactly what I started with. I’ll leave any analysis up to you.


Radar Themes 2013

I get a lot of my good tech nuggets through the O’Reilly Radar blog. This year they’ve given a heads-up on their themes, and the process behind their selection, for 2013:

In our discussions we worked through about 35 potential themes and narrowed them down to 10 that we are actively working on in the first quarter of 2013. Here’s the list along with the Radar team member who is focused on it.

Topics like In-Memory Data Management, Consumer AI, and Future of Programming look really interesting, but I’m really curious to see how Ben Lorica and Nat Torkington tie in.


Scalable Message Queueing

At work I started digging into ActiveMQ, mostly because I wanted to whale on some folks who I thought had an unhealthy devotion to some mediocre, unscalable middleware. Then I stumbled upon the ActiveMQ Apollo project:

ActiveMQ Apollo is a faster, more reliable, easier to maintain messaging broker built from the foundations of the original ActiveMQ. It accomplishes this using a radically different threading and message dispatching architecture. Like ActiveMQ, Apollo is a multi-protocol broker and supports STOMP, Openwire, MQTT, SSL, and WebSockets.

If Apollo’s benchmarks prove out in actual deployment, holy smokes!! With 1.x release implying a certain amount of stability to boot. I would like to know how it scales with a high number of concurrent subscribers thou.

And ActiveMQ isn’t all that mediocre although it has some clear scaling limitations.


Slow Data

Stephen Few is one of those useful contrarians that actually proposes serious alternatives in addition to eloquently railing against the status quo. I actually believe there are some interesting technical trends and advances (e.g. newly accessible distributed programming) underlying the buzzword compliant “Big Data” movement. But Few’s Slow Data Movement is worth considering:

I’d like to introduce a set of goals that should sit alongside the 3Vs to keep us on course as we struggle to enter the information age—an era that remains elusive. May I present to you the 3Ss: small, slow, and sure.


Free Data Science Books

Carl Anderson pulls together a nice list of links to quality textbooks online regarding information retrieval and machine learning

I’ve been impressed in recent months by the number and quality of free datascience/machine learning books available online. I don’t mean free as in some guy paid for a PDF version of an O’Reilly book and then posted it online for others to use/steal, but I mean genuine published books with a free online version sanctioned by the publisher. That is, “the publisher has graciously agreed to allow a full, free version of my book to be available on this site.”

Here are a few in my collection:

Of course, be sure to help the authors out by picking up a dead trees version or two. Donate to some school in the developing world, or even the developed world, if you don’t really need the paper.


Where’s Mine?

@djmarkfarina urban coyotes could of been MJ8 :) — sick mix Mark! everyone should own. — http://t.co/BfZuHqPT #house http://t.co/UjenYV8p

Patiently awaiting my next Mark Farina mix CD. Sort of, grumble.


TIL pythonbrew

Today I learned of pythonbrew

[embed]https://twitter.com/pypi/status/286494923291770882[/embed]

pythonbrew is a program to automate the building and installation of Python in the users $HOME

How did I, of all people, not know of pythonbrew?!


Scala Vice Python

Jean Yang brings an interesting perspective on why one might want to learn Scala:

Scala is also relatively easy to learn, especially compared to other strongly statically typed languages (OCaml; Haskell). I have previously said that Python is the most marketable language for beginners, but I am beginning to change my mind. Like Python, Scala has clean syntax, nice libraries, good online documentation, and lots of people in industry using it. Unlike Python, Scala also has a static type system that can prevent you from doing bad things (whereas in Python the bad things just happen when you run the program).

You could say I’m somewhat biased towards PhDs and MIT folksen though, as they do tend to take their programming languages seriously.

Cribbed from the TypeSafe Blog.


@bigdata’s 2012 Toolset

[embed]https://twitter.com/bigdata/status/286161333005717504[/embed]

I’m slowly following in Ben Lorica’s big data footsteps. On his recommendation I’ve decided to jump into Scala, starting with Odersky et. al.’s Programming in Scala. Like what I’m seeing so far. Functional programming, with a somewhat Pythonic sensibility, and performant exploitation of the JVM.


Was It Achieved?

Before I generate a set of resolutions for 2103, I thought it would be worthwhile to check in on the results of last year’s goals:

  • Improve Physical Health. Wash. Lost some weight, pretty close to target, but didn’t really incorporate exercise. Improved the diet, but failed to play any ultimate.
  • Maintain Financial Health. Achieved. Hit all targets.
  • Shine The Skillz. Wash. Did not get into any real programmer events, but work offered a lot more hacking opportunity than I anticipated.
  • Expand The Network. Fail. I joined the MIT DC Alumni Club, but haven’t had a chance to attend any events. Missed on everything else.
  • Get Rid of Stuff. Complete Fail.
  • Build A Tribe. I’ll call it a fail, although the jury’s still out, pending my performance reviews at work. Even without that feedback, I’m still disappointed I haven’t been able to energize a really high performing group of people, with a good sense of camaraderie.
  • Less Watch, More Do. Achieved. Took a sabbatical from college football to good effect. Also severely cut down on the channel surfing.

A couple of wins, a couple of washes, and three fails, so an overall “enh” from me. These were intended to be stretch goals so I can’t complain too much.

Given a few other things that happened in my life in 2012, I’ll happily bid goodbye to the annum, but hold back on the good riddance.


40 Posts Per

Interesting. Despite being a pretty consistent poster over the last 15 months or so, I’ve never had a 40 post month. New goal for 2013.


Boinging the Common Crawl

Here’s an interesting hack I don’t really don’t have time to execute on. Take a BoingBoing data dump and dissect either references to Boing Boing pages or outlinks from the archive using the Common Crawl dataset. Might be some interesting intersections, or you could trace out the patterns of BoingBoing influence.

Tangential side project, build up a set of mrjob modules to work with either dataset. The BoingBoing stuff comes in one big file so it might be useful for someone else to bust it up into some smaller units and stuff onto Amazon S3, or otherwise make publicly available.


Wayne Enterprises Chronicles: Playoffs

The Dark Knight Logo Mini DEFEAT!! After a great run at the end of the season, I got hit with one of those vexing fantasy defeats to flush me out of title contention. Appropriate on this final weekend of the NFL regular season that I document my collapse.

On the one hand, I got beat by a sizeable margin, 122.7 to 85.30. So it would appear my roster choices really wouldn’t have made a difference. On the other hand, consider this:

  • I listened to the fantasy “experts” and at the last minute substituted the red hot Danario Alexander instead of my usual Eric Decker. Alexander goose egg, Decker 23 fantasy points, me as GM -23.
  • On the same expert advice I played the St. Louis Rams DEF for a grand total of 1 point. I could have easily played one or two other units for 6-9 more points
  • I stuck with Lawrence Tynes as my kicker, admittedly with no encouragement, for the only game where the Giants got shut out. That means Tynes got 0 points too.
  • Arian Foster came up short of his projection by 8 points.
  • I pooh poohed a roster move by my opponent benching Torrey Smith and playing Brandon Lloyd. +23 for him.
  • The opposition survived a 0.1 performance from his QB position.

So as I float into the fantasy off-season, I’ll console myself with the thought that my optimal lineup wouldn’t have won. But three self-inflicted zeros leaves a quite bitter taste. If I had added 30 points who knows what could have happened.


PgSQL Bulk Ingest of JSON

After months of sometimes painful experimentation here’s a technique others may find useful. This is for those who need to ingest millions of JSON objects, collected from the wild, stored as line ordered text files, into a PostgreSQL table.

  1. Transform your JSON lines into base64 hex encoded lines
  2. Start a transaction
  3. Create a temp table that deletes on commit
  4. COPY from your hex encoded file into the temp table
  5. Run an INSERT into your destination table that draws from a SELECT that base64 decodes the JSON column
  6. COMMIT, dropping the the temp table
  7. Profit!!
  8. For bonus points, use the temp table to delete duplicate records or other cleanup

The benefit of this approach is that it saves you from dealing with escaping and encoding horrors. Basically, JSON can contain any character so it’s almost impossible to safely outsmart stupid users and PgSQL strictness. Once you’re over a million records, you’ll get bit by one or the other.

Reliably applied to over 300 million Tweets from the Twitter streaming API.

Update. If you’re doing this with Python make sure to use psycopg2 since you can control the transaction isolation and it supports COPY.


For Budding Data Scientists

Hilary Mason has some advice:

I get quite a few e-mail messages from very smart people who are looking to get started in data science. Here’s what I usually tell them:

The best way to get started in data science is to DO data science!

First, data scientists do three fundamentally different things: math, code (and engineer systems), and communicate. Figure out which one of these you’re weakest at, and do a project that enhances your capabilities. Then figure out which one of these you’re best at, and pick a project which shows off your abilities.

Good input, although I’m still somewhat partial to Sean Taylor’s “Ask, Answer, Tell” for what data scientists do. I think coding is important, but while engineering systems can be scientific, the practice is much more pragmatic and provides much different challenges than scientific exploration. Probably a minor quibble but engineering is quite different from science.

Still, getting your hands dirty is better than noodling in the abstract on the sidelines.


Scala and IntelliJ

If I’m going to get serious about learning Scala, I’m going to need a serious development environment. My default would be to figure something out with Emacs, but it’s time to check out the modern kit.

Jeethu Rao documented getting IntelliJ IDEA set up for Scala:

The class used the Typesafe Scala IDE, which is built on Eclipse. I’ve been using JetBrains PyCharm for most of my python work and when they had IntelliJ 12 Ultimate Edition on sale at a 75% discount last week, I bought it. Last night, I spent a bit of time setting up IntelliJ to work with Scala on my Retina MBP. I’m posting this as a future reference.

Hope IDEA is a bit lighter weight than Eclipse, whose enormity has always confounded me.


DBPedia and SQLite

DBpedia has always felt like a great resource to me, but I’ve never taken the time to play with the data, being somewhat daunted by RDF and triplestores.

Jean-Louis Fuchs has done yeoman’s work in creating a repository of DBpedia infoboxes in SQLite and releasing his import script. May have to grab and experiment with.

When I learned about DBpedia, I wanted to have it installed locally, I read the tutorials on sparql and how to install DBpedia: that has to be simpler. I kind of worship Simple and I didn’t want to learn yet another query language. So I hacked an SQLite import script for DBpedia types and infobox-properties in one evening. The script takes about 50 hours to read instance_types_en.nt and infobox_properties_en.nt. It creates a table per type and a table per property, after tables are created everything gets an index, the whole database is analyzed and finally vacuumed.

An aside thought, wonder if there’s a straightforward way to take DBpedia data and stuff it into a full text indexing system like ElasticSearch? You’d give up the reasoning properties of a SPARQL based triplestore, but would have quick and dirty access to a significant semi-structured knowledge repository.


2 mrjob Thoughts

I’ve been putting mrjob to the task quite a bit recently. Two quick thoughts on a highly useful package.

First, Map/Reduce is a fairly handy data processing framework even if you don’t need a ton of scalability. Any complex UNIX text file processing involving transforms, filters, and aggregations might be more easily expressed as a Map/Reduce computation. mrjob makes that quite simple to do on a single machine at a high level.

Second, mrjob supports the bundling and uploading of custom code and data to support the job. I’ve mainly been using it to add Python extension modules, but there’s no reason one couldn’t include supplemental data in a binary form, like a Python pickle, or an SQLite DB. Very handy for map side augmentation or filtering presuming the data set isn’t too large.


Mr. Potter’s a Real Ass

Yeah, I knew he was a top-N villain from It’s A Wonderful Life but jeez, Potter knew Uncle Billy lost that money, knew George Bailey was covering for him, and still had to kick poor George while he was down. Jerk!

Greatest Christmas movie ever. If you haven’t had a shitty day like George Bailey, you’re not human. And hopefully you made it over the hump.

Cheers!!


D3.0

Be it resolved for 2013, learn and apply D3.js, the recent 3.0 release taking the JavaScript visualization library to 11:

In addition to all of the above, D3 includes a handful of bug fixes, performance improvements, and minor new features. A new d3.shuffle method provides the Fisher–Yates shuffle. The d3.format class now supports align and fill, and both d3.format and d3.time.format support POSIX localization at build time. The d3.layout.treemap now supports the “slice-and-dice” algorithm. Lastly, all of the D3 examples have been moved to bl.ocks.org and GitHub Gist for easier browsing & forking.


Original Content

As the year winds down I’ve been looking at some of my older MPR content. One noticeable difference is the amount of original content. Definitely has decreased over time. There’s been a discernible increase in embeds and pull quotes, probably correlating to increased posting from iOS devices.

Something to think about going into next year. On the one hand quoting and linking makes it easier to post consistently. On the other, I’d really rather not publish a link blog although I could be accused of doing that already.


Bamboo

Looking for a good time series data storage and analysis service. Maybe it’s bamboo: [embed]https://twitter.com/pypi/status/282975117741527040[/embed]

bamboo is an application that systematizes realtime data analysis. bamboo provides an interface for merging, aggregating and adding algebraic calculations to dynamic datasets. Clients can interact with bamboo through a a REST web interface and through Python.

bamboo supports a simple querying language to build calculations (e.g. student teacher ratio) and aggregations (e.g. average number of students per district) from datasets. These are updated as new data is received.


UltraJSON

Need to check out UltraJSON at work:

UltraJSON is an ultra fast JSON encoder and decoder written in pure C with bindings for Python 2.5+ and 3.


The Next MF Mix CD

[embed]https://twitter.com/djmarkfarina/status/281905691092983808[/embed]

URBAN COYOTES - Mixed By Mark Farina - Digi-Pak CD

PRE-ORDER Your copy of Mark Farina’s latest CD featuring 15 tracks all from the Coyote Cuts imprint mixed by the man himself with all the fixins. Released as a 4 panel digi-pak with art by Aaron Miller and Josh Adams. CD’s will begin to ship around the last week of December.

Done!


Spark Deployment

Wow! I didn’t realize how easy it was to run a Spark/Mesos cluster on Amazon EC2:

The spark-ec2 script, located in Spark’s ec2 directory, allows you to launch, manage and shut down Spark clusters on Amazon EC2. It automatically sets up Mesos, Spark and HDFS on the cluster for you. This guide describes how to use spark-ec2 to launch clusters, how to run jobs on them, and how to shut them down. It assumes you’ve already signed up for an EC2 account on the Amazon Web Services site.

Might be fun to just spin one up and run against some Common Crawl data.


IPython Tutorials

Jean-Louis Fuchs has collected some essential IPython videos:

These are videos you have to see, if you don’t know ipython yet! I’ve worked as a programmer for more than 15 years and I thought I’ve seen it by now, but ipython got me as excited as I was when I dialed into a BBS first time… it bursts your horizon.


A Python Distutils Tip

Parkin’ this in case it helps someone else, including me in the future. Oddly difficult to find documentation on the feature, the closest being an aside on distutils config files.

When building a Python package that has a C extension you can build the extension separately, using the build_ext command. This allows you to control the include directories, link directories, and link libraries. Handy for those ornery libraries that you just can’t install in /usr/lib or thereabouts. I recently needed it for geos and Shapely.


Tail Recursive Acronym

I don’t know if the games are any good, but I’ll give Michael Cook a +1 for a moby hack. Growing games is sort of cool.

But +11 for a tail-recursive acronym:

On Friday, however, Cook released a game called A Puzzling Present which shows off the much of the hard work Cook’s put into Angelina (a tail-recursive acronym for “A Novel Game-Evolving Labrat I’ve Named ANGELINA”). Instead of a simple platformer with maps and obstacles, the latest games by Angelina now include new mechanics for the player (in this particular seasonal game, the character is Santa Claus) at each level.


ICWSM 2013

The call for papers for ICWSM, the International Conference on Weblogs and Social Media, 2013 is out. I’ve been a reviewer and attendee in the past and may actually be able to submit something this year. Yeah!!

But, according to Danah Boyd there might have been a bit of a botch:

In the end, I found that there was such extensive miscommunication that the gulf between what I had stated upfront as being key to me running the conference was miles away from what AAAI would find acceptable. Both of us felt as though we were contorting ourselves to make this work and it left both of us very bitter and unable to work with one another. I decided that I could not run a conference that wasn’t accessible to many of the core parts of my research community just to please other parts of the research community. I felt trapped and realized that it would be better to walk away than to let go of my principles.

Boooo!!


Wayne Enterprises Chronicles: Week 14

The Dark Knight Logo MiniVictory!! Your’s truly ends the season on a five game winning streak and takes the regular season title on point differential. Onwards to the Fantasy Playoffs!!

The matchup was unnecessarily tense in that I started RGIII (17.47 points) over Cam Newton (39.17 points). Given the weather, rainy, and opponent, Ravens, I was seriously leaning towards benching Griffin. Instead I went with the “don’t get cute” mantra. Similarly, I went with an oversold Greg Jennings instead of reliable Randall Cobb.

Thankfully, I made a great waiver wire pickup in Knowshon Moreno for 24 points. And no one else underperformed. Thankfully Arian Foster collected just enough garbage time points to get me over the hump. A five point squeaker that had me sweating into the second half of Monday Night Football.

Feeling good about my playoff chances although my opponent has some scary explosive players in Josh Freeman (hot, good matchup), Torrey Smith (overdue), Calvin Johnson (he’s Megatron), and Adrian Peterson (’nuff said). I’m guessing I’ll need someone to go way over projection to pull this one out.


Verizon vs AT&T

Criminy, I know it’s the Verizon Center, but there’s no need to jam the AT&T cells!!

Or at least that’s what it feels like whenever I come to a Wizards game. All I want is to check some stats, lookup random trivia, and pen the occasional blog post. Is that too much to ask? #firstworldproblems


IPython Funding

[embed]https://twitter.com/IPythonDev/status/279392024866721794[/embed]

Cool! Hoping the IPython notebook get’s even more useful and interesting.


PyPgDay

I like Python. I like PostgreSQL. I like PyCon. I should like PyPgDay.

In coordination with the Postgres community and PyCon US 2013, PyPgDay is occurring in Santa Clara on March 13 from 9am-6pm.

PyPgDay will be a full-day event with seven talks about PostgreSQL and Python, including talks by contributors to PostgreSQL, Django, PostGIS, and Python. Half the talks will help PostgreSQL DBAs, and the other half will focus on developing Python applications using Postgres features.


Crawley

Link parkin’: Crawley Project

Crawley is Pythonic Crawling / Scraping framework intented to change the way you think about extracting data from the internet

It’s never actually materialized, but I’m still hoping that user grade, commodity cost, focused crawlers become a reality. Seems like the tech has caught up with the concept. Maybe something like Crawley plus Python’s copious machine learning and data mining toolkits could provide the foundation.

I believe not actually telling folks you’re exploiting focused crawling is one of the key tricks.


Good Luck Doug!

I’m not in the “everybody should code” cargo cult, but a bit of knowledgeable advocacy for one’s discipline is always welcome. Looks like Douglas Rushkoff is getting a chance to do some stumping for Computer Science education:

This week is Computer Science Education Week, which is being observed around the United States with events aimed at highlighting the promise — and paucity — of digital education. The climax of the festivities, for me anyway, will be the opportunity to address members of Congress and their staffers on Wednesday in Washington about the value of digital literacy. I’ve been an advocate of digital culture for the past 20 years, and this feels like the culmination of a lifetime of arguing.

Knock ’em dead!

Memo to self, read Program or Be Programmed over the holidays.


graph-tool

Link parkin’: graph-tool

graph-tool is an efficient python module for manipulation and statistical analysis of graphs (a.k.a. networks). Contrary to most other python modules with similar functionality, the core data structures and algorithms are implemented in C++, making extensive use of template metaprogramming, based heavily on the Boost Graph Library. This confers a level of performance which is comparable (both in memory usage and computation time) to that of a pure C/C++ library.

The Boost Graph Library gives me the willies though. I’ve never had much luck building or using it embedded within Python. Weird linking or dynamic library loading issues got me. At a 2.x release though, this module might be worth a test drive.

Via @hmason [embed]https://twitter.com/hmason/status/277458758127452160[/embed]

© 2008-2024 C. Ross Jam. Built using Pelican. Theme based upon Giulio Fidente’s original svbhack, and slightly modified by crossjam.