home ¦ Archives ¦ Atom ¦ RSS

Twitter Mining Recipes

Link parkin’: Matt Russell’s repo for his book “21 Recipes for Mining Twitter”

“This repository contains code for the 21 Recipes for Mining Twitter (O’Reilly, 2011.) As the name of the title suggests, it’s a short cookbook of recipes that’s designed to help you solve common problems when working with Twitter data. Some of the recipes are extracted from content presented in Mining the Social Web while others are completely new additions. In either case, they’re designed to be bite-sized and serve as the jetpack that you can strap onto that great Twitter mining idea you’ve been noodling on — whether it’s as simple as running some disposible scripts to crunch some numbers or as extensive as creating full-blown interactive web application.”


Pull Requesting

Rachel Nabors jumped into the deep end of the open source contributor pool and started making pull requests to fix issues. Of course in GitHub-land, it is not incumbent on the receiving end of the request to accept. Nabors didn’t have much luck. Seems like there are lots of prerequisites (check style guide, visit issue tracker, e-mail maintainers first) before you can realistically start firing fixes.

Part of the reason I’m becoming more interested in GitHub is that I’d someday like to be an open source contributor. But you’ve gotta know the tools. And clearly, as Nabors found out, the culture behind the commit log. Good to know.


Multiplexing

Rafe Colburn is just now getting into terminal multiplexing. All I can add is that I’ve been using GNU screen for a few years and it’s really been a boon to my development.

Usefully though, Colburn has a link detailing why tmux might be better than screen. The arguments are compelling. I might have to give the competition a test drive.


Python Text Munging

Link parkin’: csvfilter “a Python command-line tool for manipulating CSV data”

pyp “Pyp is a linux command line text manipulation tool similar to awk or sed, but which uses standard python string and list methods as well as custom functions evolved to generate fast results in an intense production environment.”

Pyp was even presented at PyCon 2012. Too bad it was when I was completely knocked from the flu and near fainting in a parking lot. Luckily the video is on YouTube.

I could of used these a few weeks and months ago, when I was going hot and heavy on the Tweet processing. Probably still useful for the occasional text munging task.


TIL About CartoDB

Thanks to a blog post, admittedly potentially biased, comparing Google’s Fusion Tables with the open source CartoDB. While enticing for some work stuff the GitHub repo looks like a big inhale with lots of sharp pointy bits. But cartodb.com provides a hosted version possibly worth kicking the tires on.


Whaddup Wigan?

Wigan Athletic Logo So not only do The ’Letics disrupt the very top of the Premiership table by bumping off Manchester United, then they back it up with a road win, downing The Gunners at The Emirates. This now puts 3rd place in play, bringing Arsenal back within range of Tottenham and Newcastle United. Chelsea remains right behind both, but at 2 more points back they don’t look like contenders for third. Thanks Wigan Athletic for putting some excitement into the end of the Premier League campaign.

Now off to watch my DVR of Bayern Munich and Real Madrid in The Champions League. I really wish there was a good option for getting the Bundesliga on Verizon FiOS.


TIL About dajax

Today I learned about the dajaxproject: django+ajax. Good to know about given the baseline knowledge needed for client side, errr, Front-End developers.

Thanks Kevin Veroneau. Looking forward to the tutorial.


Spike Lee’s Joints

Do the Right Thing Cover There’s an inspired programmer over at HBO, who‘s been scheduling pretty much a lot of Spike Lee’s oeuvre, especially some of the earlier studio films. Good to see lesser known works like Crooklyn, Clockers, and Jungle Fever, get some airtime, along with the American classic, Do the Right Thing.


Check-Ins Dead?

Jon Mitchell of ReadWriteWeb makes a reasonable argument that the “check-in” as a location based concept is “dead”. Doesn’t look like consumers are adopting check-in applications at a rate that can lead to large scale commercial businesses. Fodder for thought, especially in contrast with Anil Dash’s effusive praise for Foursquare which is not all that stale.

I tend to concur with the later part of the article that speculates that this failure is just part of an experimental, evolutionary process to find a model for continuous location sensing. What we’ve just learned is that continuous location announcement into our social circles isn‘t particularly useful or attractive.

Now personalized capture and derived insights could be a winner. However, that’s a model hard to scale in a big way fast, which is what VCs these days are looking for.


Pure GeoIndexing

LazyWeb I beseech you.

I could use a pure geographic bounding box library or server that does for geoqueries what Sphinx does for full text search. Maybe a bit prickly, idiosyncratically extensible, horizontally scalable, and fast as hell by default. Throw in uniquely identified polygons, take queries for all intersections or all contains, return ids. That is all. No transformations or manipulations, just insert, delete, index and answer queries.

PostGIS is great, but the performance of the geographic indexing seems opaque and hard to optimize, at least to this simple soul.


Good Flow

For whatever reason, the quality of items in my various knowledge serendipity flows (a.k.a. webfeeds, although not exclusively RSS) has had a burst of high quality recently. Obviously I want to save some items for further blogging, but a number of them are looking like this quality post on “A Year with MongoDB” found via Hacker News. Obviously this is one anecdotal experience, but it somewhat confirms my experience with MongoDB. Interesting to start, painful in practice.

On a side note, I’m looking at some of the data storage numbers the HNers are mentioning in the comments and feeling a sense of pride that a project at work has a dataset easily comparable.

Yeehah! I get to hang out with the cool kids.


Greg Still Geeking

I‘ve been a fan of Greg Linden for a loooong time. Heck, he even tipped me off to MapReduce back when it was just a moderately interesting distributed programming model and well before it was a hip ecosystem. Although now he‘s much more intermittent, I still enjoy his posts which are much more link dumps, such as this, these days.

I suspect he‘s adding more depth over at his G+ abode, but that‘s One Social Network Too Many © for me.


Peak Opacity

Jamais Cascio does his usually great job of mentally exploring the future contradicting the notion that “data is the new oil”, precisely because data is massively increasing in quantity and especially in the face of corporate interests singularly focused on collecting as much personally identifying information as possible. Instead he projects that the ability to obscure inspection of oneself, a.k.a. opacity, is rapidly becoming scarce. Opacity is even more analogous to oil in that it can be hazardous, a pollutant, and difficult to extract.

Cascio proposes three potential opacity regimes emerging in concert and conflict:

  • Top down regulation: slow moving and hard to get right
  • Bottom up protections: individually powerful but clearly against the interest of large powerful concerns: corporations and governments
  • Emergent disruption through pollution: hard to stop, hard to reverse, and chock full of unintended consequences

An interesting thought experiment.


The EPL Plot Thickens

Premier League Logo So just when I was about to check out on the English Premier League, what with ManU up by 8 points with 6 matches left, the Red Devils spit the bit. In an interesting mid-week set of matches, Wigan Athletic beat Manchester United, 1-0. Meanwhile, Manchester City rolled on West Bromich, 4-0, putting The Citizens 5 points back with 5 to play. Plus, there’s still another leg of the Manchester Derby at the Etihads, where City could conceivably pick up 3 points. Definitely appointment viewing for that match. One more toestub by ManU and things get really interesting.

Arsenal also continued their incredible revival, squashing Wolves 3-0. The Gunners are close to locking up 3rd and Wolverhampton close to locking up relegation.


NexGen MacBook Pros

ArsTechnica, which is fairly reputable, reports that 15” MacBook Pros seem to be in short supply:

“15” MacBook Pros are starting to become scarce among popular resellers, suggesting an Ivy Bridge update could be coming as soon as the end of April. Users hoping for updated 13” and possibly 17” models will likely have to wait until at least June, however.”

Touting an April 29th announcement for new 15-inchers is getting my hopes up. However, I’m not sure “slimmer … sans optical drive” definitely means MacBook Air slim, which is what I’m really fiending for.

And 8GB Ram.

And 512GB SSD for a reasonable price.

A man can dream can’t he.


Stupefyingly Bad

Wizards Logo 2012 No, not the Washington Wizards. Yes, your 2011-2012 Charlotte Bobcats.

Tonight’s game against the Wizards was the first I’d seen of the Bobcats on an “extended basis”. In front of an empty Monday night house in Charlotte, the Wizards had a 30 point lead well into the fourth quarter. And it didn’t even feel that close.

There’s nothing I can really put my finger on in terms of why the Bobcats suck so bad. They don’t really have a bona-fide star other than Kemba Walker. Then again, the Wizards really only have John Wall.

But enough of those losers. Yes I know the Wizards aren’t really winning. The level of play has taken an uptick though. They’re not getting mocked on SportsCenter on a daily basis. Guys like Kevin Seraphim, James Singleton, and Cartier Martin are proving surprisingly serviceable, although on a good team they’d be on the the end of the bench.

And at the end of the day, Jan Vesely is showing real signs he might pan out, which would be a coup for Ernie Grunfeld (still needs to go).


Loud Leadership

Ryan Tomayko’s take on his management style as Director of Engineering at GitHub is something to aspire to. Check it out and read it all. Choice quote for me:

“I actually don’t show people how to make decisions and ship product in any real direct way. There’s no How To Ship Product training class or anything like that. Instead, I just do work.”


Hooking GitHub

Tarek Ziadé notes a burgeoning trend based on distributed version control that I think is quite important:

“There’s a trend these days on Github-based online services. That is — point me your Github repo and I’ll do something with it everytime you push a change.”

DVCSs, among other things, fundamentally make completely explicit the act of creating a delta on a repository. The explicit commit presents all sorts of opportunities for automation over a code repository. As Tarek points out, it‘s not exactly a new trend, but seems to be accelerating with the popularity of services such as GitHub.

His point about dashboards is intriguing. I‘d extend it to whole ecosystems of repositories. Betcha GitHub has all sorts of interesting dashboards for their internal QA monitoring.


DJ Sneak : Fabric 62

Fabric 62 Cover Amazon’s e-mail notices finally brought me something useful. An alert that DJ Sneak’s Fabric 62 had been released. With digital download immediately available to boot.

Sneak occupies an odd place in my house music affections. I love, Love, LOVE, his cut Show Me the Way, off of The Polyester E.P.. It may be one of my favorite anthems ever. On the other hand, I’ve never really fallen for any of his mix cd’s. They’ve all been passable, and I got Fabric 62 just in case there’s a breakthrough, but none of them have ever been on continuous repeat for me. That’s been reserved for folks like Lil’ Louie Vega and Evol Intent.

After first listen, seems like Fabric 62 won’t hit heavy rotation. Lost my headphones and haven’t been able to give it a repeat, but we’ll give it a chance to breath.


pandas PyCon Tutorial

Link parkin’: Video recording of Wes McKinney’s tutorial on pandas at PyCon2 2012. Hard to get a better source than the project’s lead developer.

Serendipitously picked up the full catalog of NextDayVideo’s PyCon 2012 recordings.


A Top Notch Sports Week

Was talking with a colleague at work and noted that April’s first week, Monday to Monday, might be the best week in sports:

  • NCAA Men’s Basketball Championship: Okay Kentucky was impressively the best team in the land, but I’m still waiting for this one to be vacated.
  • NCAA Women’s Basketball Championship: I don’t care who you are, 40 games in a row is spectacular. But Baylor also beat two #1 seeds in the Final Four including Notre Dame for the second time in the season. They also beat perennial blue bloods Tennessee (twice also!) and UConn, along with last year’s national champion Texas A&M (twice also!). Definitely not a creampuff schedule.
  • Major League Baseball Opening Day: We’ll ignore whatever the hell that was over in Japan. Frankly, I’m not sure baseball has opened until the Cincinatti Red Stockings have played.
  • The Masters: A tradition unlike any other. Can’t say I’m a huge golf fan (yet) but Tiger got me hooked on watching, especially on Sunday.
  • NHL Regular Season Finale: At least here in Washington, given the Caps situation this year, the NHL playoffs have already started.

The only other week that’s comparable is Saturday to Saturday of the last week of March, which includes the first day of the Final Four, but leaves out the last day of The Masters.


Milan v Barca 2

DVRed the second leg of the AC Milan versus FC Barcelona UEFA Champions League quarterfinals match. Can’t say I was captivated but did see Messi’s brilliance. That dude can accelerate with the ball, like nobody’s business. Two PKs weren’t thrilling but at least Milan put in a little spice by scaring Barca with a leveling goal.

Like I said though, bet on Barca.


PostGIS 2.0

PostGIS Logo Small After well over 2 years of development, there’s a new release, 2.0, of PostGIS. The old graybeard general wisdom was that one never took a x.0 release seriously as there were bound to be bugs. Still, I’m mildly intrigued as at work I have multiple millions of geolocated objects stored in PostGIS. There are a few queries that could use any help they can get. Then again, I really need to sit down, do some analysis and benchmarking, and really understand the distribution of my data.

Still, maybe a squeaky new PostGIS can help in the performance arena. Hopefully, the query selectivity has been improved at least.


AdoptedArt

Speaking of the remix project, I finally figured out the perfect name:

The AdoptedArt Project

TA DAH!! Adopted being the antonym of abandoned.

I’ve even gone ahead and snagged the domain name AbandonedArt.org to eventually provide a Web home for the effort. Nothing to see there as of yet though. Move along.


Slowly Getting Git

I’ve been trying to use, or more importantly absorb the ethos, of git off and on for a while now. It’s one thing to read about basic branching and merging in a book, and another to internalize an intuitive feel for how to put the facility to use.

Recently, between work and an initial start on the proposed AbandonedArt remix project, I’ve been getting a heavier dose of git usage. ”Practice makes perfect,” and that’s definitely happening here. I’ve finally internalized that branches are coding excursions, you have to checkout a branch you want to merge into, and then name the branch you want to merge in.

Now if I could only get a sustainable working model of remote repositories. I‘m functional, but there are definitely useful bits I’ve missed and still get hung up on a jagged edge here or there.

And I think resolving merge conflicts is an area most git coverage could use some extended attention. I’m guessing conflicts are supposed to be rare but they popup enough, and are tricky enough, that more detail would be helpful to this git apprentice.


ProGit

Link parkin’: Scott Chacon’s ProGit. A handy reference on the Web for using git. Help him out and buy a copy.


Tab Killin’, Python Edition

Python logo Flushing out some of the many tabs I’ve collected in Chrome:


Twitter Utility

Twitter Bird Small I’ve actually been on Twitter since 2007, but really haven’t had much use for it. I was connected to a few friends, but wasn’t tweeting much useful or seeing many useful tweets.

Recently I added a few data science folks, and began slowly expanding my follows. Now I’m getting a lot of interesting links, even hitting Tweetbot multiple times in a day. Props to Ben Lorica, @bigdata, Jimmy Lin, @lintool and Joe Hellerstein, @joe_hellerstein for bringing the good bits. Hilary Mason @hmason and Andy Hickl @andyhickl provide a little personality and I’ve got an emerging cluster of Python folks including David Beazley, @dabeaz, Wes McKinney, @wesmckinn, and Adam Klein, @atomklein.


Champions Dud

Well, I got all excited about the AC Milan v FC Barcelona Champions league match yesterday for nothing. A 0-0 draw makes for a dud. Even worse, the big players just missed plays in the final third. And I’ve never seen topflight defenders diddle around in the box so much. Maybe it‘s a strategic approach I‘m not sophisticated enough to get, but seemed like they were constantly playing with fire.

Hopefully the other half of the tie, next week at the Camp Nou, will be a lot better. Winner take all, bet on Barca.


Yet Another Project

I‘ve noodled around with turning the Retrosheet data into SQL for MySQL and PostgreSQL, but never did anything further. Now that I’ve dug into Tastypie a bit, it might be fun and easy to wrap such a database with a RESTful Web API. Desktop visualizations could be easily created, but even better you could do neat browser based renderings given the direction of new toolkits like D3.


Ivy Bridge MacBooks

Marco Arment, who stays way more up to date on this stuff than I do, is making some reasoned projections on changes to Apple’s MacBook lineup. The dream of the 15” Screen/8GB RAM/250GB SSD seems a bit more promising. I’m actually seeing more and more reports of the current line of MacBook Air processors being quite comfortable for computationally intensive activities, which is where I would have been willing to compromise.


bbfun and point total pool

Here’s another project, slightly more complicated than remixing AbandonedArt (although getting pyprocessing working on OS X Lion is a bit challenging). This is sparked by the fact that my long time March Madness bracket league collapsed under the weight of its own success. Last year the amount of prize money in the pool raised the ire of PayPal and the organizers barely managed to get the purse out of them. This year the organizers came to their senses, realized they have better things to do with their lives, and shut the whole thing down. I wasn’t too put out, since I never won any money anyway, and picking brackets is starting to feel stale.

But there’s still a little competitive juice flowing during March Madness.

Back in my late days of undergrad, early years of grad school, there were a couple of fun contests run on USENET, (yes USENET), around college basketball. The first was bbfun, which was essentially a confidence pool over the week of Big 10 men’s basketball games. You picked the winners at the beginning of the week, ranked them in order of confidence, and the scoring was weighted by your rankings.

The second game, Matthew Merzbacher’s point total pool, had you select 8 NCAA tournament teams. You collected contest points for each real-world point your teams scored throughout the tournament. Later on Merzbacher spiced it up by adding bonus points for having lower seeded teams in your slate. In addition to the serious attempts, which could actually involve a fair bit of analysis, plenty of people entered fun theme or joke slates.

Both of these games seem eminently implementable using modern Web frameworks and toolkits. I actually wouldn’t be surprised if either or both had already been done but it would be a fun “reinventing the wheel” project. If executed smartly, probably wouldn’t be too taxing in terms of compute resources, and a small fee from a decent sized participant pool would cover your infrastructure costs. If one could navigate the purse vice gambling issue, prize money could even be incorporated.


DC vs Detroit

Wizards Logo 2012 At the other end of the spectrum, I can’t believe I’m watching one of the worst NBA matchups this season. The Detroit Pistons against the hometown Washington Wizards. Both teams have lost at least 2 out of every 3 games this season. The Wizards have the second worst record in the league. The Pistons are in a bad bunch with the New Jersey Nets, the Toronto Raptors, and the Cleveland Cavaliers.

I know why the Wizards are awful. What I don’t understand is how the Pistons can be so bad. When you look at their roster, they’ve got two guys with championship rings in Tayshaun Prince and Ben Wallace. Ben Gordon is a proven scorer and has playoff experience. So there’s veteran experience. Greg Monroe and Brandon Knight look like promising young players. Will Bynum, Jason Maxiell, Rodney Stuckey, and Charlie Villanueva have proven NBA talent.

It’s a shame a once proud franchise has fallen so far with no relief in site.

As for the Wizards, with Andray Blatche on the sideline for “conditioning”, that’s 4 out of 5 that need to be gone. Although the way Jordan Crawford’s shot selection is going, he might make the list soon. Dude ease up on the pound the dribble for 10 seconds then shoot possessions. The Wizards show flashes of quality but not well strung together and the team really doesn’t know how to finish. But it beats the crap they had on the floor before.


Champions League Quarters

Chelsea FC Logo So somehow Manchester United, Aresenal, and Manchester City are out of the UEFA Champions League and Chelsea is still alive in the quarterfinals. Something’s unjust in the world. Can’t say as I’m too excited about the Blues v Benfica tie, but might be worth watching given what’s at stake. Maybe Chelsea’s vets will get up off the deck and show some pride after AVB get summarily dismissed.

Now FC Barcelona versus AC Milan? That’s a battle of titans to be admired. Messi against Ibrahimovic is just the starting point. Might have to fire up the DVR for those two matches.


The Lantern Logo

Green lantern corps logo The modern update of the Green Lantern logo has to be at the top of the list of superhero iconography. I’ve seen the logo in a number of pop culture appearances. I might argue that it’s surpassed the Batman and Superman logos for social currency.Makes for a damn nice t-shirt. And as my OS X logon image, the operating system automatically adds some nice shiny highlights that make it look even better.

Too bad it’s tied to one of the dopiest characters in the DC universe. I mean c’mon, a guy who creates randomly convenient physical manifestations sheerly by willpower? You’ve got to be joking. Oh yeah, doesn’t work against the color yellow.


OpenBastion Conferences

While somewhat opaque, The Open Bastion, looks like it’s running an interesting series of Python conferences. DjangoCon US, Sep 3-8 in Washington, DC is a no-brainer for me, although I’m a bit tempted by Open Django, June 8-9, in Chicago.


Behren’s PyCon

Shannon -jj Behrens is doing a great job summarizing a number of the PycCon talks.

I especially like his summary of Ned Batchelder’s Pragmatic Unicode talk. I agree with him that it was on of the best talks of the conference. Despite my limited attendance, we had quite a bit of overlap in session attendance. Maybe that’s an indicator I have good Python taste ;-/


Fiendin’

I’m actually really liking the notion of remixing AbandonedArt (AbandonedRemixed?) and taking it on as a project. But I had to catch myself prematurely optimizing and fiending for new hardware.

New project? Of course I need a new MacBook Air so I can work it on it during all my idle moments. Like that worked out so well before.

Okay, it would be nice to have a new low end, or used laptop, just for this project. Ugh. Can’t get much for a measly $250. Besides, I have more important things to spend money on at the moment.

Hey! Let’s get an inexpensive desktop box just for the sake of this effort. Wait, we’re trying to cut back on stuff this year, not add more.

Howabout this. Get started using The Trusty Ole‘ MacBook. IF performance really is an issue then deal with it. If not and the mythical 15”/8 Gb RAM surfaces, reward yourself.

Right answer.


PyProcessing and Abandoned Art

Processing Logo Now this might be a feasible 100 Hours project for yours truly. Download the processing sketches at AbandonedArt. Upload sketches to github. Clone each sketch, see how well it comes out in pyprocessing, fixup as needed.

Can be worked on in small discrete chunks. Public visibility would be nice and maybe others would join. And it would provide good feedback for the pyprocessing project, maybe leading to contributions.

Processing logo retrieved from Wikimedia Commons, through a Creative Commons License

N.b. Adjusted publish date to real release date. WP just doesn’t handle dates on draft posts correctly to my mind


Circus Process Watcher

Link parkin‘: Circus is a program that will let you run and watch multiple processes.”

Might be a little bleeding edge for what I’m doing at work, but looks very attractive.

© 2008-2024 C. Ross Jam. Built using Pelican. Theme based upon Giulio Fidente’s original svbhack, and slightly modified by crossjam.