home ¦ Archives ¦ Atom ¦ RSS

Real-Time and Big Data

[embed]https://twitter.com/lintool/statuses/263699888842346496[/embed]

Preprint paper from the guys at Twitter: “Fast Data in the Era of Big Data: Twitter’s Real-Time Related Query Suggestion Architecture”

We present the architecture behind Twitter’s real-time related query suggestion and spelling correction service. Although these tasks have received much attention in the web search literature, the Twitter context introduces a real-time “twist”: after significant breaking news events, we aim to provide relevant results within minutes. This paper provides a case study illustrating the challenges of real-time data processing in the era of “big data”.

Via @lintool


C’Ya Sandy

Bye bye Hurricane Sandy! You got our little development here in Leesburg for two hours of power outage, and our townhouse for a little roof leakage, but otherwise you weren’t too bad. You didn’t even get after my Linode up in the Newark datacenter.

I was highly impressed with your wind gusts though.


A Bad Way To Go Out

I know I’m a bit late to the party, but poor Ozzie Guillen got let go in what may be an MLB first. The Marlin’s posted his firing on Twitter: [embed]https://twitter.com/Marlins/status/260829731627347968[/embed]

Yikes! Amongst Chicagoans, there will always be a soft spot for Oz though, thanks to that 2005 World Series. But I had a feeling it would end badly, from the day he signed on.

P.S. More evidence that it’s all just media now. Now need to use that anachronistic “social media”.


Basic Common Crawl Processing

Pavel Repin copiously documents his initial foray into processing the Common Crawl data set:

At my company, we are building infrastructure that enables us to perform computations involving large bodies of text data.

To get familiar with the tech involved, I started with a simple experiment: using Common Crawl metadata corpus, count crawled URLs grouped by top level domain (TLD).

It’s not a very exciting query. To be blunt, it’s a pretty boring one. But that’s the point: this is a new ground for me, so I start simple, and capture what I’ve learned.

This gist is a fully-fledged Git repo with all the code necessary to run this query, so if you want to play with the code yourself, go ahead clone this thing.

Via Pete Warden


RESTful or Restless?

In my REST API expeditions at work, I’ve been using Flask-Restless. Now, via Python Weekly, I find out about Flask-RESTful. Normally I’d just scan and move on, but RESTful is from folks at Twilio and may have a bit more polish. To wit:

While Flask provides easy access to request data (i.e. querysting or POST form encoded data), it’s still a pain to validate form data. Flask-RESTful has built-in support for request data validation using a library similar to argparse.

The only hitch I see is no examples of connecting with ORM based models, admittedly after only 10 minutes with the docs. Restless actually handles this use case pretty well.

Alternative approaches are always good to know about.

Also have to say thumbs up on Python Weekly. Once a week to my Inbox, an easy read, at least one good link.


jq JSON Processing

Link parkin’: jq: a lightweight and flexible command-line JSON processor.

jq is like sed for JSON data - you can use it to slice and filter and map and transform structured data with the same ease that sed, awk, grep and friends let you play with text


bitmapist

Link parkin’: bitmapist

[embed]https://twitter.com/pypi/statuses/261516455638597632[/embed]

Implements a powerful analytics library using Redis bitmaps.

This library makes it possible to implement real-time, highly scalable analytics that can answer following questions:

  • Has user 123 been online today? This week? This month?
  • Has user 123 performed action “X”?
  • How many users have been active have this month? This hour?
  • How many unique users have performed action “X” this week?
  • How many % of users that were active last week are still active?
  • How many % of users that were active last month are still active this month?

Wayne Enterprises Chronicles: Week 7

The Dark Knight Logo Mini Victory!! It was a trouncing, with an 80+ point margin of victory. The other guy mailed it in, starting three players that were on bye this week. Then he got nothing out of his tight end, leading to a grand total of 31 points.

Not much to dig into, but I’m glad to break the losing streak.


Mac Tweetbot

Tweetbot Logo The Tapbots finally released their official for-pay version of Tweetbot for the Mac. The $20 price, driven by some of Twitter’s token limiting practices. induced some sticker shock. However, I didn’t think long before pulling the trigger.

Lex Friedman has a pretty good review of Tweetbot explaining why it’s worth the price:

Of course, this is a review of the excellent new Tweetbot for Mac (Mac App Store link), not a review of Twitter’s business practices. But I bring up the latter here because one of the effects of Twitter’s new developer restrictions—specifically, the finite limit on how many users a given Twitter client can support—is that developer Tapbots is charging more for the new app than originally planned. Specifically, Tweetbot for Mac will cost you $20, at a time when many similar apps can be found for $10 or less. Which means that for many readers, the question isn’t just whether Tweetbot is good, but whether it’s worth the price.

My answer: Yes.

Like Friedman, I’ve sprung for Tweetbot on MacOS, iPad, and iPhone.


Stoked and Chapped

Forgive me Saint Steven for I have sinned. As a recently realized Apple fanboy I failed to tune in for today’s announcements. My employer sent me to occupy a lonely (but very shiny!) booth in the bowels of a second rate hotel with an exceedingly lackluster conference. But I should have found A Way!

The only in-depth coverage I have been able to read so far, is Marco Arment’s assorted thoughts. They have me stoked:

Prior to this, the computer I recommended for nearly everyone was the 13” MacBook Air. But the new 13” Retina MacBook Pro is only about 0.6 pounds heavier, has much higher CPU, RAM, and storage options, and has the much nicer Retina screen. It commands a premium of about $500, which is significant, but you get much more for it.

That’s pretty much the machine I’ve been fiending for for the longest time. I’ll nitnoid about not being able to amp the DRAM to 16Gb, but that’s the only thing I could quibble about. However, between lack of funds, and dampened laptop usage due to life and other gear, now I may just hang on to Ye Olde MacBook indefinitely. Or at least there is clear provocation.

But I’m simultaneously chapped

The timing of the update — just 6 months after the iPad 3, instead of the usual year — will anger a lot of iPad 3 owners. But the previous March releases of the iPad 2 and 3 were more problematic.

You mean The New Toy is already not the new hotness? Ya killin’ me.

At least I have my new Jesusphone 5 to keep me warm at night.


Mushroom Jazz Mixtapes

Just do as the man says.

[embed]https://twitter.com/djmarkfarina/statuses/260860717647921153[/embed]

[embed]https://twitter.com/OmRecords/statuses/260850819618979840[/embed]

Four classic Mushroom Jazz recordings, for free. What’s not to like?


Now With More Sparks

@bigdata continues to sing the praises of Spark.

[embed]https://twitter.com/bigdata/statuses/259710022173466625[/embed]

In an earlier post I listed a few reasons why I’ve come to embrace and use Spark. In particular I described why Spark is well-suited for many distributed Big Data Analytics tasks such as iterative computations and interactive queries, where it outperforms Hadoop. With version 0.6, Spark becomes even faster and easier to use. The release notes contain all the detailed changes, but as you’ll see from the highlights below, version 0.6 is a substantial release. Another good sign is the growth in number of contributors, with now over a third of the developers coming from outside the core team in Berkeley.


Wayne Enterprises Chronicles: Week 6

The Dark Knight Logo Mini Defeat. Sigh. This three game losing streak is getting me down. There’s really not much analysis to be done this week. RGIII kept me in the game with a big day, but when four out of eight spots are subpar, pulling out a win is difficult. At least my guys took it into the Monday night game where it basically boiled down to Eric Decker on my side versus Antonio Gates for the opponent. I don’t know what the Broncos were thinking, but they weren’t thinking about guarding Antonio Gates.

Bye week hell this round, but I’m pretty decently set. And the opponent is in a tough spot, with three of his key players on the Falcons who are off this week.


Discogs Doubleplus Good

Partially because it wasn’t obvious to me where to find it, but mostly because it’s cool

Download Discogs Data

Here you will find monthly dumps of Discogs Release, Artist, and Label data. The data is in XML format and formatted according to the API spec: http://www.discogs.com/developers/

This data is released under the Public Domain license: http://creativecommons.org/licenses/publicdomain/

And funny how we just have to rediscover some things.


Python API Building

Nice 4 step tutorial on building a RESTful API using Python, by K. P. Kaiser. The best tip was a pointer to elasticutils, an ElasticSearch library:

It’s always a good idea to see what your options are in a library. Initially, when I was building this integration I saw pyes, a very well written library, but the code to use it seemed a bit ugly for my tastes.

Luckily, after a bit more searching, I found elasticutils, which is, in my opinion, a much cleaner interface to the very simple elasticsearch server. It always pays to take a few minutes to read the introduction, and example code before deciding on a library. Elasticutils actually uses pyes under the covers.


FDWs FTW

Craig Kerstiens on putting Redis in Postgres

SQL is an expressive language, though people are often okay with accessing Mongo data through its own ORM. The real value is that you could actually query the data from within Postgres then join across your data stores, without having to do some ETL process to move data around.

From experience, here be dragons, but used judiciously Postgres’ Foreign Data Wrappers are a great feature.


Summarizing Streams

[embed]https://twitter.com/bigdata/statuses/258963194163380224[/embed]

From A Framework for Summarizing and Analyzing Twitter Feeds (PDF alert)

In this paper, we present a dynamic pattern driven approach to summarize data produced by Twitter feeds. We develop a novel approach to maintain an in-memory summary while retaining sufficient information to facilitate a range of user- specific and topic-specific temporal analytics. We empir- ically compare our approach with several state-of-the-art pattern summarization approaches along the axes of storage cost, query accuracy, query flexibility, and efficiency using real data from Twitter. We find that the proposed approach is not only scalable but also outperforms existing approaches by a large margin.

My quick half-ass summary:

  1. use frequent item set mining, to come up with a code book
  2. recode your content with the code book
  3. compress, which exploits the redundancy uncovered by the coding
  4. profit!!

But I need to read the paper more closely. And the real-time summarization and topic tracking aspects seem really cool.


Discogs oEmbed

Since I’ve been heavily using the WordPress embed feature for tweets, I had a though that it would be nice if you could do the same thing for referencing musical releases in Discogs.com. But alas, they seem not to support oEmbed.

However Discogs does seem to have a nice client API, with a supporting Python module to boot. Seems like a Discogs oEmbed proxy server might be a nice self-contained hacking project.


This Must Happen!

The Little Guy (TM) has been watching a lot of Yo Gabba Gabba! when it hit me.

Lady Miss Kier must do a dancey dance!! That would be AWESOOOOME!!

[embed]http://www.youtube.com/watch?v=Ibk4Diagkok[/embed]

If they can find a spot on the show for Metta World Peace, they can put a little groove in the heart.


Atomic News

[embed]https://twitter.com/mathewi/statuses/257899167073050625[/embed]

Circa seems pretty interesting even though it’s YAPTRN

Part of the thinking behind Circa comes from ideas that have been described by author and journalism professor Jeff Jarvis, as well as media-startup veteran David Cohn, who is also a founding partner of Circa and acts as its editor-in-chief. The main idea is that the traditional article or story format that newspapers and other news outlets have produced for so many years no longer fits with the way we produce or consume information now. The standard “inverted pyramid”-style article was designed for the days when people might only see one report about a news event, printed on dead trees and without links, so it had to include virtually everything.

The gravity of mobile is counteracting the inertia of print in terms of news distribution, leading to some interesting approaches. The Web in general provided the initial escape velocity, but couldn’t quite get news out of print’s orbit. The confluence of current economics and mobile adoption feels like it could provide enough acceleration to do the trick.

The key is that Circa has pushed back into the newsroom’s production process and if successful in a big way will be a sea change.

Feh. Even though I wrote it, that next to last paragraph was way too weasel wordy, wonky, and jargony.


Network Structure Taxonomies

Link parkin’: Taxonomies of networks from community structure [embed]https://twitter.com/bigdata/statuses/254974410363133953[/embed]

The study of networks has become a substantial interdisciplinary endeavor that encompasses myriad disciplines in the natural, social, and information sciences. Here we introduce a framework for constructing taxonomies of networks based on their structural similarities.

This paper may set a record for number and word count of author affiliations.


Nerds, Geeks, Dorks

Glyph Lefkowitz wrestles with some definitions:

So I always feel a twinge when I identify myself as a “geek”. I usually prefer to say that I am - or at least aspire to be - a nerd.

Have to say I’m quite sympathetic. Suppressing my inner dork is a daily, even hourly, challenge.


The Flask Stack

At work, I’ve been giving Flask a whirl, with a focus on building a REST based API emitting JSON responses. A much different approach from Django, much more amenable to building an API to my mind. Much less magic, but much lighter weight. And at least for me, a bit easier to understand how to generate RESTful responses.

The biggest part of the learning curve has been SQLAlchemy. Between the SQL Core for building DB queries, to the object relational mapper (with two styles of declaring schemas no less), to engines, to sessions, there’s a lot going on. Using PostGIS is exacerbating the issue as 1) the geo part needs extensions like GeoAlchemy, and 2) I take advantage of Postgres’ range types, which aren’t baked into SQLAlchemy and are best used with a non-standard operator (@>). So imagine coming to this Swiss Army chainsaw of object relational mapping system and you’re immediately into figuring how to extend the framework. Fun!

Some helpful Flask extensions: Flask-Restless and Flask-SQLAlchemy


Wayne Enterprises Chronicles: Week 5

The Dark Knight Logo Mini TRAGEDY!! I went into the Monday night game trailing by 24 points and a skoosh. A tough get, but I had the overall #1 pick, Arian Foster going. 120 yards and 2 touchdowns was definitely doable.

152 on the ground. 1 touchdown. 1 reception for 16 yards. 23 fantasy points.

And I lose by a lousy half a point!!

Foster did his part. The actual killer was Stevan Ridley of the Patriots giving up a fourth quarter fumble that cost me 2 points.

Given that my quarterback, Rober Griffin III, got knocked out and only supplied 3 points, I probably shouldn’t have been in the game anyway. But Tony Gonazalez had another big day with 24 points. Meanwhile, my kicker and defense provided double digits as well.

Need to get more production out of my wide receivers, but dang did that one sting!


djay Unleashed?

If I read this right, a vexing problem I had with djay as virtual turntables has been solved:

There are actually two points here. You have the ability to support multichannel audio interfaces in djay, previously unavailable, and you also can route headphone output and main output separately – so even without a multichannel interface, you could use, say, both the headphone jack and an HDMI or USB out. That’s a big deal for DJs, because finally, you can pre-cue tracks, but also should mean apps with multichannel recording and playback, other tools with separate cueuing, and, heck, even iPad-based surround rigs if you want them

Will have to rekick the tires.


Another Coder’s Font

Source Code Pro. Currently making the rounds on the Intarwebs:

Source Code Pro is a set of OpenType fonts that have been designed to work well in user interface (UI) environments. In addition to a functional OpenType font, this open source project provides all of the source files that were used to build this OpenType font by using the AFDKO makeotf tool.

Via Mikko Ohmataa, said post also having useful commentary and comments.


OSM & PGSQL

Michal Migurski drops knowledge on how to get OpenStreetMap data into PostgreSQL/PostGIS:

At first glance, OSM data and Postgres (specifically PostGIS) seem like a natural, easy fit for one another: OSM is vector data, PostGIS stores vector data. OSM has usernames and dates-modified, PostGIS has columns for storing those things in tables. OSM is a worldwide dataset, PostGIS has fast spatial indexes to get to the part you want. When you get to OSM’s free-form tags, though, the row/column model of Postgres stops making sense and you start to reach for linking tables or advanced features like hstore.


pandas 0.9.0

[embed]https://twitter.com/wesmckinn/statuses/255290426477641728[/embed]

Not much of a point-zero release fan, but I might make an exception for pandas.


Capability Challenge

Rafe Colburn has been on fire recently, but a recent post on capabilities really hit home for me:

China has deployed its own aircraft carrier. The vessel is a rehabilitated Ukrainian carrier that China purchased in 1998. Unfortunately, China does not have pilots who have practiced landing a plane on an aircraft carrier, nor do they have any planes that are capable of landing on an aircraft carrier.

What’s the point? That it’s a lot more difficult to develop a capability than it is to build something.

I work in an industry where “capability” is a bit of jargon and I work in an organization that likes to bandy the term about willy nilly. The industry is definitely cognizant of the meaning but there are quite a few people I work with who need to carry a printed version of this around.


Jesusphone 5

I ordered an iPhone 5 a couple of weeks ago, it arrived at my local AT&T this past Thursday, and I picked it up yesterday. I’ll admit it. I made a halfass attempt to wait in line on the first day. There was enough of a line to turn me away, even in the hinterlands of Northern Virginia. If I’d stuck it out, given I was going top shelf for a 64 Gb model, I might have actually gotten one to gloat about. But I had other stuff to do.

So here’s my pure, initial impulse reaction based on less than 36 hours of ownership.

I’d have been real $?!& salty if I’d stood in line a couple of hours for this thing. Physically longer and lighter? Big deal. Ear pods still don’t stay in my ears. Yet another cable to carry, woohoo! And I sort of liked the iPhone 4’s backglass, which gave it a nice heft. Maybe you can be too thin.

But. But! BUT!! … The combination of the new A6 processor and 4G networking makes the iPhone 5 fast as hell. Stuff I’d use to twiddle my thumbs waiting for on the 4? Done!

And the more you use it, the more you notice.

The biggest hurdle is that I transitioned from an early 2000’s poor excuse for a “smart phone” to the iPhone 4, a quantum leap. So anything less could feel like falling short.

But I think the 5 might be a keeper.


Cogs Command Line Toolkit

[embed]https://twitter.com/pypi/statuses/254276578270400512[/embed]

I doubt if I’ll switch from cliff, but it’s always good to be aware of your options. If I could just figure out how to use cliff without distribute or a setuptools install I’d be set.


The Cloud Unit of Computation

ZeroVM is to virtualization what SQLite is to DBMS.

Diggin’ in the feed cratez, I ran across a piece by Ben Lorica: “How ZeroVM changes analytics in the cloud”. As Lorica points out, ZeroVM is more akin to the Java Virtual Machine then virtualization containers. However, there’s an interesting implication:

Converged Storage in the Cloud The amount of time it takes to transfer data between two specialized clusters has led to storage systems with compute capabilities2. A recent example is storage vendor CleverSafe including Hadoop MapReduce into its dispersed storage network. Users of Hadoop MapReduce who have played with cloud computing are familiar with this issue: performing big data analysis in the cloud usually means having to first transfer data from storage systems (S3) to compute resources (EC2). This means that if lowering latency is an issue, bandwidth and data size limits what you can do. In contrast (assuming cloud services providers install it) ZeroVM lets you perform computations on the storage cluster!

Anyone who’s done any significant Big Data or parallel computation has run up against the issue of moving data versus moving computation. A computation container that’s cheap to deploy but isolates like an entire PC could be pretty handy. Throw in modern deployment tools like Chef and Puppet with Amazon Web Service style APIs and things get really interesting. Any chance AWS itself could get commoditized from below?


Tagging Tweets

We’re pleased to announce a new release of the CMU ARK Twitter Part-of-Speech Tagger, version 0.3.

I’ve used the ARK Tagger and can vouch highly for it.


Wayne Enterprises Chronicles: Week 4

The Dark Knight Logo Mini Failure. The Enterprises went down to defeat this past fantasy weekend. Run-DMC, Oakland’s Darren McFadden was the key blow, only scoring 4.3 fantasy points. Arian Foster had a solid 15.9 points, but that’s not enough for a RB1. So my running backs basically did me in.

Not to mention the opponent had Tom Brady storm back for 35+ points in addition to great days for Victor Cruz and Marshawn Lynch.

The real killer is that I lost by about 22 points, but I easily had that much extra left on my bunch. As soon as I taunt Cam Newton and sit him down, he goes (fantasy) off against Atlanta. Stevan Ridley joined in with Brady in abusing the Bills and would have been a nice replacement for McFadden.

But that’s fantasy football for you. Still in first, barely on point differential.


Hittin’ The Trifacta

[embed]https://twitter.com/joe_hellerstein/statuses/253872168100843520[/embed]

More details from Gigaom.

Although Prof. Hellerstein was after my graduate school time, Go Bears!, and all that.


Postgres Performance Tips

PostgreSQL Logo

For many application developers their database is a black box. Data goes in, comes back out and in between there developers hope its a pretty short time span. Without becoming a DBA there’s a few pieces of data that most application developers can easily grok which will help them understand if their database is performing adequately. This post will provide some quick tips that allow you to determine whether your database performance is slowing down your app, and if so what you can do about it.

Craig Kerstiens provides some handy Postgres SQL that reveals how well the RDBMS is handling your queries. Optimization is left as an exercise for the reader.


Brendan O’Connor on Powerset

An interesting insider look, by Brendan O’Connor, at Powerset, a failed natural language search startup:

There’s a lot to say about Powerset, the short-lived natural language search company (2005-2008) where I worked after college. AI overhype, flying too close to the sun, the psychology of tech journalism and venture capitalism, etc. A year or two ago I wrote the following bit about Powerset’s technology in response to a question on Quora. I’m posting a revised version here.


Solr vs Elasticsearch

Haven’t had a chance to dig in to this comparison of elasticsearch and Solr , but I’m link parkin’ to keep an eye on the alternatives. Have to say I’m currently enjoying elasticsearch at work, but it’s still early in my experimentation.

A good Solr vs. ElasticSearch coverage is long overdue. We make good use of our own Search Analytics and pay attention to what people search for. Not surprisingly, lots of people are wondering when to choose Solr and when ElasticSearch.

As the Apache Lucene 4.0 release approaches and with it Solr 4.0 release as well, we thought it would be beneficial to take a deeper look and compare the two leading open source search engines built on top of Lucene – Apache Solr and ElasticSearch. Because the topic is very wide and can go deep, we are publishing our research as a series of blog posts starting with this post, which provides the general overview of the functionality provided by both search engines.


Useful That

[embed]https://twitter.com/lintool/statuses/252868723436822528[/embed]

I have it on good reputation that this Domingos guy knows a thing or two about machine learning.

Ha, ha! Only serious.


Ruh-roh Apple?

About This Mac Snap I upgraded my lil’ ole MacBook to Mac OS X 10.7.5 (not worthy of Mountain Lion support) and now the poor thing has been crashing. And it borked my alpha of Tweetbot for Mac (fixed). May just be age, but I’m hoping there’s a pending os update to fix some obscure issue.

I upgraded my iPhone 4 to iOS 6 and now I think I’m suffering the battery life issues many others have been seeing. Haven’t done a scientific investigation, and an uptick in audio streaming over 3G may be responsible, but still irritating.

Not really Apple’s fault but …, I went to the AT&T store a day after the iPhone 5 launch to put in my order. Okay, it’s going to take 3-4 weeks, but I’m a patient guy. However, it shouldn’t take me 30 minutes to get a hold of a sales rep and complete the pre-order. Yikes!

And then there’s that maps issue.

Stuff used to “just work”.

© 2008-2024 C. Ross Jam. Built using Pelican. Theme based upon Giulio Fidente’s original svbhack, and slightly modified by crossjam.