home ¦ Archives ¦ Atom ¦ RSS

UNIX split

Need to chop a 7 million line file into 1 million line chunks? There’s a UNIX command line utility for that: split.

Does what it says on the tin and does it quite well.

In the spirit of Dennis Ritchie. Godspeed kind sir.


Plexus Ranger Chronicles: Week 5

PlexusRangers Logo Small Sigh. The beat goes on. Lose, win big, lose, win big, lose. And of course Julio Jones adds to the injury streak, going down with, you guessed it, a bad hammy.

About the best that can be said for this past week is that I had my optimal lineup in. Of course my bench was fully parked with Must Start level guys on bye. Makes filling out the lineup card easy.

Did have a little hope, holding a lead going into Monday Night Football. But when the opponent has Matthew Stafford and Calvin Johnson… it was over mercifully quick.

Still 9 games left, and only a game out of the last playoff spot. A-Listers back in the lineup in a big way. Every streak starts with one win.


GeoNames

Link parkin’: GeoNames

The GeoNames geographical database covers all countries and contains over eight million placenames that are available for download free of charge.


Memo to Facebook

To: Mark Zuckerberg

From: C. Ross Jam

Re: “You Have Notifications Pending”

Zuck dude, all I know is your unsubscribe link better work. Because I skipped out on those activities on purpose. Remind me again and I may dump you wholesale.

That is all.


Violent Agreement

I’m in violent agreement with Jason Snell’s thoughts regarding Tim McCarver’s recent illness:

I wish Mr. McCarver a speedy recovery and a healthy, long life. I also wish him a lengthy and joyous retirement.

By which I mean, Tim, this broadcasting game. It’s too stressful. You have just had a “minor” heart procedure—but how minor can anything regarding your heart be?

…So Tim. We’re your pals. Quit this broadcasting racket. …

Might be the best thing I’ve seen on American McCarver yet.


Serendipity

Despite my little tantrum, turns out one round of my tweet harvesting experiment spanned the evening of Oct 1st to the afternoon of Oct 6th. Geofocused on New York City and San Francisco, there’s a good chance I caught a solid burst of interesting tweets.


Cube, Timeseries Browser Viz

Cube Logo Link parkin’: Square’s Cube project.

Cube is an open-source system for visualizing time series data, built on MongoDB, Node and D3. If you send Cube timestamped events (with optional structured data), you can easily build realtime visualizations of aggregate metrics for internal dashboards.


On Specialization

Specialization is for insects. —Robert A. Heinlein in Time Enough For Love.

I really enjoyed Ted Leung’s recap of the Surge 2011 conference. I don’t know squat about scalability matters, the purview of the DevOps community, but there was one quote from Leung’s recap that really struck me:

specialization is an industrial age notion and needs to be discounted in spaces where we operate at the boundary of the known versus unknown.

This was from Ben Fried, Google’s CIO, commenting on how his organization’s structure failed to solve a problem with a very complex system. Turns out no one could really understand the system from a global, end-to-end perspective, which was the key to solving the problem.

The quote really resonated, because on the day job I’ve been working on a team trying to strategically address a particularly thorny innovation challenge. Many of the technology components are known, and pairwise compositions well understood. But the unknown solution feels like it needs to emerge from a global, end-to-end perspective. If you agree with Fried, and I think I do, we’ll need some generalists to rise to the task.

And of course our company is chock full of stovepiped specialists.


Hagiography

Today, I declare myself The Most Unsentimental Man In The World ™. Think Different my friends.

Moment of silence for Rev. Fred Shuttlesworth though. Godspeed kind sir.


250K Tweets

About a week and a half ago I got possessed with the luny notion to collect 1 million tweets. I’ve worked with the Twitter streaming api in the past, so thought it would simply be a matter of keeping the connection up for an extended period of time.

I started with the simplest possible thing that might work, directly from the Twitter streaming API documentation:

curl -d @locations https://stream.twitter.com/1/statuses/filter.json -uAnyTwitterUser:Password.

First couple of cracks didn’t quite work. As documented, the HTTP connection is subject to various disruptions, so I wrapped the curl invocation in a Python script that would re-execute as needed.

So now the script has been running for about 5 days now. I’ve got about 647 Mb of data totaling north of 250,000 tweets. I can’t tell if the script has needed to do a restart, but I’m impressed it’s lived this long.

And 250K tweets, with metadata, should make for some interesting analysis.


Parking Talk

A maxim to live by, from your’s truly:

There’s no need to sit in a parked car, listening to sports talk radio. It’s just sports blather. They’ll make more.

Occasionally I have to catch myself on this one.


Plexus Ranger Chronicles: Week 4

PlexusRangers Logo Small All I have to say is, I’m glad Aaron Rodgers is on my fantasy football team. With the scoring rules in my league, he dropped 51 fantasy points, far and away the top scorer in the league.

Which was pretty handy, seeing as how the opposing fantasy QB was Cam Newton, who wasn’t too shabby at 30 points himself. Plus my opponent also played Greg Jennings for 19 points, Ray Rice for 20, and Darren McFadden for 14.

Besides Rodgers, the key support on my team came from WRs Santana Moss for 12 points, Julio Jones for 18, Fred Jackson for 18, and Jason Witten for 19. Between Rodgers and Witten, I cruised to an easy 26 point lead, although I had to wait out a potential Ray Rice explosion.

The league only has 8 teams in it, so there’s plenty of talent to go around. The key is getting one or two really outstanding performances, which is a combination of matchups and luck.

At this point I’m 2-2, fighting for the 4th playoff spot. Still plenty of games to go, but I have to put a little streak together. I think I’ve got a viable core, but I’m always scuffling on a spot or two. At least this week, I don’t think anyone got injured.

The NFL is heading into its stretch of bye weeks, which puts the bench and the waiver wire at a premium. I think I’m configured for success during these challenging weeks, but we’ll see.

Update: Title modified for the correct team in 2011. Yeesh! I’m slipping.

Update 2: Oops! Forgot about Rashard Mendenhall pulling a hammy, and possibly missing next week. Then again, he was due for a benching anyway.


@bigdata

Ben Lorica runs the @bigdata Twitter stream, which routinely provides good big data related nuggets. I don’t actually follow the stream because he occasionally summarizes the best tweets on his blog


PostGIS

PostGIS Logo Small Speaking of PostgreSQL, it’s also worth mentioning PostGIS, which is a fabulous geospatial extension for PostgreSQL. Nelson Minar recommends the book PostGIS in Action to supplement the PostGIS documentation. As an example, PostGIS is improving in the area of doing nearest neighbor searches.

Alternatively, one can try SpatiaLite which embeds GIS capabilities in the self-contained SQLite RDBMS. However, I found SpatiaLite had a lot of moving parts and didn’t get a lot of mileage out of it.

Sideways observation: Pinboard doesn’t make individual item URLs obviously visible in its HTML. A little digging in an RSS feed reveals their existence. Turns out, on the person’s web page, the dates are the permalinks for the bookmarks. Useful for giving credit.


PL/Python

PostgreSQL Logo Been noodling around with PostgreSQL at work. Great relational database system. Glad I got a chance to get back to it. Gets even better with age.

For the longest time, I’d never really used stored procedures in an RDBMS because the languages to write the procedures were unappealing. However, PostgreSQL has embedded scripting languages, including Python, as procedural languages. Makes things a lot more fun.

One caveat, it was pretty tricky to convince the PostgreSQL build process on MacOS Lion to not use the system Python. An important point if you don’t have root privileges to install modules, which comes in handy. I’ll have to look up the mods and post them since it could save someone a little bit of time.


Plexus Rangers Chronicles: Week 3

PlexusRangers Logo Small Well the Fantasy Football gods giveth, and the Fantasy Football gods taketh away. The Rangers got beat by a little under 48 points. I was saved from the worst beating award by someone who got shellacked for a 60 point loss.

This is the third week in a row I’ve had a player incur an injury that put them out the next week. I lucked out in that Steven Jackson and Miles Austin had decent and great games. Kenny Britt wasn’t so lucky, done for the season with no production.

I’m patting myself on the back for picking up Buffalo’s Fred Jackson off the waiver wire. Only he and Aaron Rodgers scored in double digits for me. Meanwhile the opposing team had five 15+ point scorers, including Jermichael Finley going for 30. Note that it’s pretty difficult in fantasy football to gain ground when your quarterback is throwing TDs to the other guy’s tight end.

Biggest disappointment was Rashard Mendenhall, who couldn’t do anything against the Colts. The Pittsburgh running game is giving me the willies.

Anyhew, back to the waiver wire. Gotta scrounge another receiver.


Data Science Teams

Link parkin’: DJ Patil, who build LinkedIn’s data team from the ground up, expounds on Building Data Science Teams


Minar’s Pinboard

I don’t follow too many linkblogs these days, but Nelson Minar established a Pinboard account to capture his URLs of note. He claims it’s better than his blog. Don’t know if that’s true but in NetNewsWire I’ve been flagging a lot of links he’s been pointing too. Plenty of good, solid, diverse hacking content. Time to give the stream a little publicity.


Woody’s Worst

I’m sort of binary on Woody Allen films. If he’s in the film, I can’t stand it. If he’s not I generally enjoy the flick: Match Point, Cassandra’s Dream, Vicky Cristina Barcelona, Celebrity. The only exception has been Manhattan, but then that’s a very exceptional film. His typical nebbish take drives me batshit.

I made the mistake of watching Anything Else, even though I knew Allen had a role, principally because I like looking at Christina Ricci. It was doubly horrible because Jason Biggs filled the Allen lead role to accurate effect, meaning you had twice the nebbishness. Yeesh!

Glad someone else is in violent agreement, calling it Woody Allen’s worst film. Oddly enough, Tarantino apparently loved Anything Else. No accounting for taste.


I <3 HDNet Movies

True Romance Poster Okay, for all I know they probably don’t have the biggest catalog, but HDNet Movies has made a big impression on me over the last couple of weeks. First it was just the delight of seeing Salma Hayek in After the Sunset. Then it was being reintroduced to The Rundown.

Did I mention the movies are uninterrupted (c.f. the suckage of IFC)? Unmodified? Except for the occasional, tastefully displayed HDNet Movies logo. Letterboxed, to boot, so no pan and scan.

But the closer is that somebody in the programming department thought today would be a great day for a Quentin Tarantino double bill: Pulp Fiction and True Romance. Yeah, I know Tony Scott directed True Romance, but Tarantino wrote the script and Scott retained most of Tarantino’s crime, violence, and dialogue oeuvre. It’s essentially a Tarantino film.

Brilliant. Two of my favorite films of all time, and Pulp Fiction is in my top 3.

Set DVRs to stun.


Slow Lion

OS X Lion Grrrr. My trusty MacBook is feeling a bit pokey for the first time ever. I may have reached the limits of its resources what with a bazillion tabs open in Chrome, a few in Firefox, and a handful of other apps running. Plus, I’m using Spaces.

Maybe some tab clearance will help out, but whenever the next edition of MacBooks or MacBook Airs come out, I will seriously be lusting.


The Fantasy Philosophy

I realized this weekend, that whatever success I have in fantasy football is based upon the following principle. During the currently playing weekend, I start planning for the next weekend.

By halfway through the first game on a Sunday afternoon, I’m already scanning the waiver wire for promising mid-week pickups. In leagues where you can still make afternoon free agent claims, I’ve jumped on key players after early injuries.

A second minor philosophy, is to consider the real world matchups, but not over think my lineup. Don’t mess with players on bad teams (c.f. Indianapolis Colts 2011). Studs always play.


Plexus Rangers Chronicles: Week 2

PlexusRangers Logo Small Didn’t manage to follow up on the week 1 quick hit, but the key to that week was subbing out about 20+ points that could have made the difference. Picked the wrong kicker. Picked the wrong DEF. Stuck with Rashard Mendenhall against the Ravens. Then dumped Ahmad Bradshaw, with a decent matchup, thinking I might be able to make a late afternoon waiver move. Oops! All free agents hit the waiver wire at the start of Sunday, so nobody can make moves during the day. Know your league’s rules son!

Week 2 was a much happier affair. I had some plush match ups, and actually made some good GM moves.

  • Figured Miles Austin was going to have a big day against SF. Didn’t know he was going to be the top WR for our league this week.

  • Made the brilliant move of picking up and starting Buffalo’s Fred Jackson. Finished in the top 5 and only 0.8 points out of the top RB scoring position.

  • Rob Bironas kicked his way back into form, and Houston’s defense did a decent job against Miami.

  • Stuck with regulars Aaron Rodgers, Jason Witten, Rashard Mendenhall, and Santana Moss. Everybody delivered double digit fantasy points, with Moss surprisingly totaling 14.

With everyone on my team scoring double digits, I’ve cruised to a 40+ point victory. The only downside is that Austin looks like he’s going to be on the shelf for a few games. That’s okay though, there’s a couple of emerging WRs, with good matchups, I’ve got my eye on. Been working hard to make up for my draft day bye blunders. I’m feeling that waiver wire magic again, revamping my team week to week.


The Rundown, Guilty Pleasure

The Rundown Poster

The Rundown popped up on HDNet this past week. A little over 8 years since its release, the movie has held up surprisingly well. Actually, it shouldn’t be all that surprising. The Rundown is a pretty basic action film, not overly enhanced with sci-fi techno elements, or CG special effects. Fights, double-crosses, chases, captures. Hero wins in the end.

Herewith some of the reasons I enjoyed watching it yet again:

  • Dwayne “The Rock” Johnson actually evinced a little acting depth.

  • The directer, Peter Berg, tightly paced the flick with no fat. Comes in at a clean, efficient hour and a half. Plus there’s actually a distinctive visual style to the whole affair.

  • Christopher Walken doing a great, batshit exploitative warlord.

  • Rosario Dawson is, as always, easy on the eyes.

  • The film is actually funny in quite a few spots. Go away monkey!! Seann William Scott’s scruffy ass actually adds a solid comedic lift.

  • The kick ass opening sequence. Heck, all the fight sequences. Well done and not your run of the mill punch out.

You could do quite a bit worse than The Rundown on a slow, weekend afternoon.

Update. Forgot about the Arnie cameo.


RIP Stathead

Well that didn’t last too long. I just locked onto the Stathead flow back in June. While I found the signal to noise ratio wasn’t very high, I got a solid nugget or two.

Of course it takes a lot of human effort to filter the stats oriented portion of the sportsosphere. And the humans have just given up.


XNF, Excellent Night Football

NFL Shield As recently as 2005, the NFL’s primetime games were quite often blowout duds. Especially late in the season when matchups that looked exciting in July fizzled because one or both teams had disintegrated. This eventually led to the flex system which allowed the networks to schedule more attractive late season games into the primetime slots.

I don’t know what deal the league made with the devil for the 2011 season, but the night games for the season opening were amazing. To wit:

  • The last two Super Bowl champs, Green Bay and New Orleans, go toe-to-toe in a last play, goal line stand thriller. Rodgers and Brees come out firing on all cylinders. Plus there was a 108 yard kickoff return for a TD. That’s your season opener.

  • The Sunday night game showcases America’s Team, Dallas, versus the New York Jets. In New York. On the tenth anniversary of 9/11. Hometeam goes on a dramatic 17 point 4th quarter scoring spree to pull out the comeback victory. Said spree includes a blocked punt for a TD

  • ESPN has a double header on Monday night, with a decidedly lukewarm tilt between New England and Miami to start off. Miami hangs in for about 3 quarters. The Fish were especially helped by a pick-6 off of Tom Brady, his first interception in 350+ attempts. Then the Pats explode to put Miami down, but along the way Wes Welker goes for a 99 yard TD reception and Brady totals 517 yards in the air.

  • The second ESPN game is the Raiders vs Denver, two mediocre teams that hate each other. Per usual there’s a scrum after just about every play. While outplaying the Broncos, the Raiders can’t pull away. A 90 yard punt return sparks the Broncos in the second half, but the Raiders manage to hold on for the win. Did I mention that Sebastian Janikowski kicked a league record tying 63 yard field goal?

No real duds, lots of riveting action, and even some historical performances. Way to kick off the season NFL.


Cyberduck S3 Multipart Uploading

Cyberduck Icon Taking some “vacation” time this week, and what am I doing? Uploading multi-gigabyte files to Amazon S3. So I’m a nerd.

Turns out though that uploading files greater than 5GB requires multipart uploading, which breaks the file up into chunks for better throughput and reliability. Client support for this part of the S3 API is not obvious.

Hat tip to Joe Miller for cataloging a few S3 clients that do support multipart uploads, including Cyberduck. Open source, free as in beer, cross platform, and eminently usable, Cyberduck is the Swiss Army knife of file uploading. Just for this capability I will be making a donation shortly.


Ending A Drought

Rise of The Planet of the Apes Poster I can’t speak for every working parent, but The Job(™) and The Kid(™) have definitely killed my outbound movie going experience. At home, you’re bombarded by film choices for your in-residence studio: Premium Cable, Pay Per View, Netflix, iTunes, Amazon Instant Watch, not to mention good old Blu-Ray and CDs for the old timers.

For the classic movie theatre experience? Overpriced, blockbustered up, and inconvenient? Not so much. Until today in the theater I’d seen exactly 1 movie in the last 24 months: Toy Story 3. (Surprisingly good for the third of a trio).

Every now and then though, one has to partake of “cinema” just to remember what it was like. Since I was taking a day off I caught Rise of The Planet of The Apes, (Surprisingly good for a reboot). Yeah, I gouged myself even on a matinee: (8.50?! don’t ask about the popcorn and soda I got), but I was the only one in the viewing at Leesburg’s spanking new Cobb 12 Theater, which is quite nicely appointed. Think full service restaurant, full bar, spacious accommodations attached to your friendly neighborhood multiplex. No great shakes for the rest of the country, Evanston, IL had the same setup when I lived there over 5 years ago, but relatively new to good old Loudoun County.

The movie was good, not great. Didn’t feel stupider having seen it. No commercial ads, reasonable number of trailers. Nobody talked, cried, texted, or took a call during the screening. The seats were comfortable and the space clean. To my eyes the projectionist didn’t screw up. A 5 minute drive from my humble abode. What’s not to like? Other than the prices.

I’ll probably never get back to my grad school heyday of seeing multiple double bill classics at the Berkeley Theater in a week (The Godfather, and The Godfather, Part II back-to-back, $7, FTW! RIP Berkeley). But there’s still a little life in The Big Screen business.


Plexus Rangers Chronicles: Week 1 Quickshot

PlexusRangers Logo Small Deeper analysis later in the week, but suffice it to say the Rangers went down already (no Monday night players left), and went down hard. Major disappointment from my running backs. While Stephen Jackson managed to get in the end zone, he hurt himself along the way. Rashard Mendenhall just stunk.

Shot myself in the foot by reading the Yahoo projections and switching the Bears defense out and the Browns defense in.

And I gotta get more than 2 points out of my kicker.


EPL Quick Thoughts

Another Premiership Saturday has come and gone. After the first month we’re starting to get a little insight.

Manchester United is looking pretty good. Then again Chelsea started out with 6 straight wins last year. Not that the Blues completely collapsed, or that I’d bet against the the Red Devils, but it’s a long season.

Chicarito is a bad dude.

Sheik Mansour is getting his money’s worth out of the Man City club. And Carlos Tevez just hit the pitch for his first match.

Maybe it’s just Fox Soccer Channel house style, but I don’t like how after halftime the announcers don’t reset who’s on the pitch.

Premier League games are a welcome replacement for the early NCAA talk shows. That’ll probably work well for the NFL yakfests as well. The matches are great because they move at a good pace, no commercial interruptions, and they finish within 2 hours.


Warehouse-Scale Computing

Link parkin’: Warehouse-Scale Computing: Entering the Teenage Decade. Comes highly recommended. Warning though, the material is a 50 minute, 1 GB QuickTime file, password protected by an ACM account. You can Flash stream it as well.

Seems like real-time at global scale is the new frontier.


Plexus Rangers Chronicles: Draft 2011

PlexusRangers Logo Small Of course on this initial night of the 2011 NFL season, it is most appropriate that I regale you with the results of my fantasy football draft. At this point in time (1:53 left in the Green Bay vs New Orleans game) it’s sort of hard to complain about the results. I got Aaron Rodgers (Cal product!) with my second pick and he’s currently racked up 25.77 fantasy points.

Still with the 6th position in a snake draft I was actually disappointed with my selections quite quickly. The problem is that I was really indecisive with my strategy. Conventional wisdom is that you focus on getting two top running backs, but that strategy never really works for me. I thought I was going to follow that strategy, then waffled, then didn’t do a good job with my wide receivers. It wasn’t so much that I didn’t get great receives, it’s that they both have the same bye week! And did I mention both of my running backs have the same bye week. Yup you guessed it. The same week as my receivers. Stupid. Now I’ve got a big roster scramble coming up in week 5.

Draft details after the break.

Overall selection #’s in parens:

  1. (6) Rashard Mendenhall (Pit - RB)
  2. (11) Aaron Rodgers (GB - QB)
  3. (22) Steven Jackson (StL - RB)
  4. (27) Miles Austin (Dal - WR)
  5. (38) Jason Witten (Dal - TE)
  6. (43) Santana Moss (Was - WR)
  7. (54) Mark Ingram (NO - RB)
  8. (59) Ahmad Bradshaw (NYG - RB)
  9. (70) Julio Jones (Atl - WR)
  10. (75) Ben Roethlisberger (Pit - QB)
  11. (86) Rob Bironas (Ten - K)
  12. (91) Chicago (Chi - DEF)

According to the Yahoo! average draft, I might have overreached on Jason Witten, Santana Moss, Julio Jones, and Mark Ingram. Then again I got a little bit of a steal on Ahmad Bradshaw and Ben Roethlisberger. With an eight team league, there’s not much of a premium on QBs after Rodgers, Brady, and maybe Brees. But I’m hoping to parlay Big Ben into some talent with a trade later in the season. Somebody’s QB has to go down this year!

I’m guessing my team will go as Aaron Rodgers does. Running back might be a strength, might be a bust. If Tony Romo comes back big, Miles Austin could be a solid number one receiver, but my number two is going to be a pain all year.

And I can’t stress enough how stupid not managing the byes was. Lesson learned.


Commodity Focused Crawling?

Speaking of outsourced web crawling, I wonder if it would be possible to build a focused crawler on top of the 80legs infrastructure. A lot has changed since Filippo Menczer first introduced the concept. Sophisticated client side web programming, cloud computing, social media. Given today’s vast sprawl of the Web, there are a lot of tasks where high topical precision, completely forsaking recall, would be really useful.

As to implementation, you probably can’t get into the retrieval loop as tightly as a custom crawler, but release from the headaches of actual page fetching could free up thought cycles for creative foraging approaches. Outsourcing the hard parts would probably radically improve reliability and availability.

Another upside is that with a good front end you could scalably provide crawlers to end users. Wonder what dirt cheap personalized, adaptive, social spidering could enable.


Timing is Everything

Enjoyed Paul Lamere’s documenting how he processed a very large music dataset using Amazon S3 and Amazon Elastic MapReduce. Paul was hacking the Million Song Dataset, but the info seems highly relevant to my noodlings with the MemeTracker data. It’s pretty amazing anybody off the street can harness that much processing power for 10 bucks. And it finished for Lamere in 20 minutes.

One thing I’ve learned, munging a few tens of gigabytes, you start to time everything. Once things start taking longer than 10 minutes or so you want to 1) figure out if you’re doing something stupid, and 2) figure out how to make it faster.

Personally, I’ve got a short term goal to get my data up onto S3. Then I’ll be digging into MrJob, which I’ve run across previously.


Plexus Rangers Chronicles

PlexusRangers Logo Big Forewarned is forearmed. As in previous NFL seasons, I’ll be playing in an office fantasy football league, and providing weekly updates, if not more frequent. Even though I was a pretty poor chronicler, the 2010 edition of Doom Patrol at least made the the playoffs last year. Lost my first game, but won the consolation for a third place finish.

This year, I’ll be taking it a bit less seriously, especially the league draft which is tomorrow. I’ve found I’ve been most successful when I really work the waiver wires well. Last year I had the overall #1 pick, which netted me Chris Johnson of the Titans. Of course Johnson had a subpar season, mainly because the Titans stank for half of the year, which had me scrambling a lot.

Two years ago, I managed to rack up two titles. Time to get back to the mountaintop.

P.S. Given Doom Patrol’s failure last year, the name got retired. This year it’s the Plexus Rangers in action.


Blogaversary and Macavesary 3

About This Mac Snap Well the dates slipped by early last month, without jostling a remembrance on my part. But it’s been over three years since I started this blog which was mainly driven by the fact that I had a new, bottom of the line, MacBook I needed to put to use.

Functionally, the computer’s been holding up pretty well. Of course I lust for one of the new PowerBooks or Airs, but day to day my trusty old MacBook easily supports what I need to get done. The physical elements are showing a bit of wear and tear. Battery doesn’t quit fit correctly. Parts of the body are chipped and dinged. Still a keeper though.

As for the blog, Ev Williams pointed out “You have to be a bit more dedicated to blog than to tweet or post on [Facebook] now and then.” Except I don’t really tweet or use Facebook, so I have no excuse. I’ve had some good runs in the past, but I’ll be trying to elevate the dedication over the next few months.


Not Mawking Awk

Recently I started munging some really large datasets for work and for fun. In just doing some basic statistical verification I learned a lesson that Brendan O’Conor documented: gawk is really useful, reasonably fast, and some versions blaze.

I’ve got to go and check out mawk, even if it is old and slightly busted.


Outsourcing Web Crawling

80legs Logo Once upon a time, web crawling was the providence of manly men. Men who wrote hairy systems code that optimized DNS queries, tweaked operating system TCP kernel options, avoided robot traps, and honored complex, dynamic revisit schedules.

Nowadays, thanks to 80legs, you can fork over 99 bucks a month, setup 3 crawl jobs, repeating if you like, and get the contents of a million URLs. All from the comfort of your own home.

This is progress.


Life in a Meta City

Life in a Meta City Cities in Fact and Fiction, an interview with William Gibson, ran across my aggregator recently. Didn’t really notice the link to his essay Life in a Meta City, but that may get me to buy a copy of this month’s Scientific American.

Not to mention that I’m just generally interested in the science of dense urban centers and the whole September issue is devoted to cities.

I’m guessing he hasn’t even started on a post Zero History book, but I’m wondering what Gibson will come up with next.


TIL pip requirements

Python logo Today I Learned about pip requirement files. pip is a package installation tool for Python. It’s great when you combine it with virtualenv, so that you can easily build up complex Python installations with clean isolation from your base installation.

The kicker is pip freeze > req_file.txt and pip -r req_file.txt which will stash your installed packages and load them into a new environment. Makes it easy to kit out a new virtual environment with your favorite 3rd party modules. You only have to figure out the complete list of stuff you want once and then install it with one command line. And of course you can keep around variations of req_file.txt for different install types.

Bonus: You can use virtualenvwrapper post install hooks to automatically run pip on a newly created virtual environment. You are using virtualenvwrapper aren’t you?

© 2008-2024 C. Ross Jam. Built using Pelican. Theme based upon Giulio Fidente’s original svbhack, and slightly modified by crossjam.