home ¦ Archives ¦ Atom ¦ RSS

Solr vs Elasticsearch

Haven’t had a chance to dig in to this comparison of elasticsearch and Solr , but I’m link parkin’ to keep an eye on the alternatives. Have to say I’m currently enjoying elasticsearch at work, but it’s still early in my experimentation.

A good Solr vs. ElasticSearch coverage is long overdue. We make good use of our own Search Analytics and pay attention to what people search for. Not surprisingly, lots of people are wondering when to choose Solr and when ElasticSearch.

As the Apache Lucene 4.0 release approaches and with it Solr 4.0 release as well, we thought it would be beneficial to take a deeper look and compare the two leading open source search engines built on top of Lucene – Apache Solr and ElasticSearch. Because the topic is very wide and can go deep, we are publishing our research as a series of blog posts starting with this post, which provides the general overview of the functionality provided by both search engines.


Useful That

[embed]https://twitter.com/lintool/statuses/252868723436822528[/embed]

I have it on good reputation that this Domingos guy knows a thing or two about machine learning.

Ha, ha! Only serious.


Ruh-roh Apple?

About This Mac Snap I upgraded my lil’ ole MacBook to Mac OS X 10.7.5 (not worthy of Mountain Lion support) and now the poor thing has been crashing. And it borked my alpha of Tweetbot for Mac (fixed). May just be age, but I’m hoping there’s a pending os update to fix some obscure issue.

I upgraded my iPhone 4 to iOS 6 and now I think I’m suffering the battery life issues many others have been seeing. Haven’t done a scientific investigation, and an uptick in audio streaming over 3G may be responsible, but still irritating.

Not really Apple’s fault but …, I went to the AT&T store a day after the iPhone 5 launch to put in my order. Okay, it’s going to take 3-4 weeks, but I’m a patient guy. However, it shouldn’t take me 30 minutes to get a hold of a sales rep and complete the pre-order. Yikes!

And then there’s that maps issue.

Stuff used to “just work”.


More itertools

[embed]https://twitter.com/pypi/status/250302968228880384[/embed]

Here I’ve collected several routines I’ve reached for but not found. Since they are deceptively tricky to get right, I’ve wrapped them up into a library. We’ve also included implementations of the recipes from the itertools documentation. Enjoy! Any additions are welcome; just file a pull request.


Manna

Keurig B70 I finally gave up on the “coffee shop” options here in greater Leesburg, VA. 3 Starbucks’, a combo cafe and Sushi bar, and retrofitted cobbler’s storefront. The latter’s somewhat charming, but doesn’t quite feed the need.

What’s a former illycaffè barista to do? At home, I could do a standard drip brewer, or a French press, or even a small espresso machine. But all of those are too inconvenient. When I’m at home and feel the urge, I don’t need top of the line java, just something to meet the craving.

So I bought a Keurig B70 Platinum K-Cup coffee brewer this past weekend. The Keurig scratches the itch when I get a few free moments at home to read, or hack, or just stargaze. Doing that in a public place with interesting foot traffic? That’ll have to wait until our next abode.


elasticsearch Cool?

I’ve had an eye on elasticsearch for a while now but didn’t really have a good reason to use the Lucene based search engine. Until today that is, when at work I was looking at the use cases I was applying MongoDB towards. I’m not a hater, but MongoDB jus wasn’t fitting the bill.

Time to give elasticsearch a shot. At least Luca Cava of orange11 thinks elasticsearch is cool:

First of all, what you’ll notice as soon as you start up is how easy elasticsearch is to use. You index your JSON documents, then you can make a query and retrieve them, with no configuration needed. One of the reasons is that it’s schema-less, which means it uses some nice defaults to index your data unless you specify your own mapping. For more precision it has an automatic type guessing mechanism which detects the type of the fields you are indexing, and it uses by default the Lucene StandardAnalyzer while indexing the string fields.

If you need something beyond the standard choices you can always define your own mapping, simply using the Put Mapping API. In fact every feature is exposed as a REST API.

Seems auspicious, but we’ll put the engine to the test.


Thirty For Twelve

Milestone. Twelve months straight of 30+ posts per month in this here venue. Sometimes it’s a grind, but routine is good. And I’ve stashed away a heck of a bunch of good information. Still need to get cranking on a project that I can narrate.


eGenix PyRun

eGenix PyRun looks like it might come in handy some day.

Our new eGenix PyRun™ combines a Python interpreter with an almost complete Python standard library into a single easy-to-use executable, that does not require a system wide installation and is fully relocatable.

eGenix PyRun’s executable only needs 11MB, but still supports most Python application and scripts - and it can be further compressed to 3-4MB using gzexe or upx.

Compared to a regular Python installation of typically 100MB on disk, this makes eGenix PyRun ideal for applications and scripts that need to be distributed to many target machines, client installations or customers.


Ask, Answer, Tell, [Suggest?]

Sean J. Taylor interned with the LinkedIn Data Science team and had some cogent observations on what the work can really boil down to:

Most people who describe data science are actually describing what it takes to get the job. (e.g. Take statistics and machine learning courses. Learn such-and-such languages/packages. Hack.). Or they describe how the job is growing in importance. I’m going to describe the actual practice of data science.

The Data Science Loop:

  • Ask a good question.

  • Answer the question while economizing on resources.

  • Communicate your results.

  • (Sometimes) Make recommendations to engineers or managers.

The whole thing is worth a read, but most importantly, the closing graf.


Wayne Enterprises Chronicles: Week 3

The Dark Knight Logo Mini Victory!, although Cam Newton had me scared on Thursday with a lousy eight points. That’s 3 and 0 for those counting at home. Leading the league at this moment, although as in all things fantasy, my luck could turn at any minute.

The big surprises this week were Darren McFadden with 18.5 points and Tony Gonzalez with 19.6 points. Great production out of RB2 and the TE position. McFadden scored more than Arian Foster (!) my, and the league’s, overall number one pick. Both my wide receivers were solid with double digit performances and even K Robbie Gould pitched in with 13 points.

I had a comfortable enough lead going into the Sunday night game, that I could even leave the Ravens DEF in against the New England offense. Luckily they didn’t go negative or I really would have been sweating. Serious consideration was given to benching the Ravens.

102 points without much production from a key position, QB, just shows the strength of this team. And Cam Newton might get benched for RGIII this weekend!


pyDAWG

Okay, tons of background knowledge is great, until you find yourself having to search 20Gb / 270+ million lines to find anything in that mountain of data. Full text indexing seems like the right choice, but again, who wants to deal with Solr/Lucene, ElastiSearch, or Sphinx just to get started?

Enter DAWG

This package provides DAWG-based dictionary-like read-only objects for Python (2.x and 3.x).

String data in a DAWG (Directed Acyclic Word Graph) may take 200x less memory than in a standard Python dict or list and the raw lookup speed is comparable. DAWG may be even faster than built-in dict for some operations. It also provides fast advanced methods like prefix search.

Based on dawgdic C++ library.

I’m going to give it a shot at work and see what happens. The big upside is fast prefix search (hopefully) over and above a key/value store’s fast key lookup. Only obvious downside I can see is constantly hearing Xzibit in my head when I read the docs for this module.


What A Trip!

[embed]https://twitter.com/Dope_Den/status/249882120196079618[/embed]

Two decades ago, Chicago-native Mark Farina began experimenting at his DJ gigs, dropping slower, deeper tracks along the lines of disco classics, acid-jazz, hip hop, and downtempo. In Chicago, which is primarily known as a house music town, his selections were deemed more appropriate for home listening than for the dance-floor. Despite this, Farina continued to develop his sound, which soon after became coined as Mushroom Jazz. A few mixtapes followed, but his experiment-turned-endeavor finally got off the ground in his new home, San Francisco.

Farina juggled between his two residencies of Chicago and San Francisco until he made a permanent move to SF in 1994. He saw opportunities in the city with its vibrant music scene, which featured a slew of promoters catering to fans across a broad spectrum of music styles including hip hop, jazz, house, and reggae to Wicked-style breaks and techno. A couple years before he made his move, he teamed up with Patty Ryan-Smith to throw his own Mushroom Jazz club night.

I’ve done some dumb things in my life. Not going to a Mushroom Jazz event, when I was essentially a very gradual student at Berkeley, has to rank pretty high on the list.


Diggin’ On Whoosh

I’m starting to observe that when dealing in data exploration, right after summary statistics, keyword style searching is high up on the TODO list. Until you really need them though, pulling out the big boys like Solr/Lucene or Sphinx are sort of a pain. When you’re in iterative exploration mode the tax of dealing with enterprise scalable software is substantial. YAGNI probably applies. However, if you’re of a Pythonic mind the Whoosh is a nice, lightweight starter toolkit.

Whoosh is a fast, featureful full-text indexing and searching library implemented in pure Python. Programmers can use it to easily add search functionality to their applications and websites. Every part of how Whoosh works can be extended or replaced to meet your needs exactly.

I’ve been incorporating Whoosh into some data analysis on a tiny data set and it’s been a blast. So much so I’ll soon try it out on a bigger, but not massive, pile of bits. A nice feature of a pure Pythonic search library is that you can stash arbitrary Python data structures in the index. This really increases the utility of dealing with search results as opposed to having to go to another store to retrieve more complex non-indexed objects.

Whoosh is also useful for embedding in Swiss Army command shells built using cliff.


python-omgeo

Link parkin’:

[embed]https://twitter.com/pypi/statuses/248871216629313536[/embed]


Background Knowledge

Diggin’ in the feed cratez, ran across From Words to Concepts and Back: Dictionaries for Linking Text, Entities and Ideas, announced by Google’s Valentin Spitkovsky and Peter Norvig:

Human language is both rich and ambiguous. When we hear or read words, we resolve meanings to mental representations, for example recognizing and linking names to the intended persons, locations or organizations. Bridging words and meaning — from turning search queries into relevant results to suggesting targeted keywords for advertisers — is also Google’s core competency, and important for many other tasks in information retrieval and natural language processing. We are happy to release a resource, spanning 7,560,141 concepts and 175,100,788 unique text strings, that we hope will help everyone working in these areas.

And the data is readily accessible with publication capturing the details.


The Data Kalashnikov

[embed]https://twitter.com/pudo/statuses/248473299741446144[/embed]

Word. Although I will say that today I managed to export two Excel worksheets to tab separated value, text files encoded in UTF-16 and then safely turn them into CSVs using Python’s csvkit. No children were harmed.


psycopg2 and copy_from

Yesterday I was developing some code to do bulk loading of data into PostgreSQL. The db’s COPY command is the best, if not exactly most inviting way to do this, modulo possible extensions, but those haven’t worked for me.

Unfortunately, COPY reads from files local to the server or over standard input. I was dreading having to jerry-rig something out of SSH when I revisited the psycopg2 module. Turns out psycopg2 has a copy_from method that does what you’d expect local or remote DB. Python FTW yet again.

Probably should have been using psycopg2 from the get go.


Storm Turns 1

Even though I don’t have an opportunity to use it, Storm feels like goodness.

Storm was open-sourced exactly one year ago today. It’s been an action-packed year for Storm, to say the least. Here’s some of the exciting stuff that’s happened over the past year:

And it looks like momentum is building.


Wayne Enterprises Chronicles: Week 2

The Dark Knight Logo Mini Victory! It’s great to wake up on Monday morning, know victory is in hand, and have an extra player to run up the score with. My 43 point margin victory is the highest in the league this year.

I take back any rude thoughts or comments I might have had about Arian Foster (23 points), Hakeem Nicks (30 points), and Cam Newton (25 points). Tony Gonzalez and the Ravens DEF chipped in with double digits as well.

Marques Colston and Darren McFadden? Not so much. Run-DMC is troubling because he’s manifesting his usual inconsistency, but he’s my number two running back. At least he’s not injured … yet. I may have to work up a trade or an inspired waiver wire pick. Colston probably won’t even be on my week three roster. I know he’s in a high powered offense, but he’s not delivering at the moment.

Last week every team plays before the bye weeks kick in. Then things get really fun. I feel good about my team, but I need to shore up second positions at RB and WR.


TIL @pypi

Today I Learned that PyPI has a Twitter account, @pypi

[embed]https://twitter.com/pypi/statuses/247637227658674176[/embed]

It’s a bit of a firehose, but worth it to pickup packages like fastcluster.


Gibson, Wired, io9ed

William Gibson has been out and about a bit, and giving some interviews. First with io9:

[embed]http://youtube.com/watch?v=JEXIqKRdz9A[/embed]

And a three-parter from Wired, 1, 2, 3


XLHost

Earlier, I complained about the cost of purchasing hosted PostGIS. An alternative? Run your own damned installation. If you’ve got a relatively small amount of data, 10s of Gbs, you can straightforwardly do this on a Virtual Private Server. 100s of Gbs approaching Tbs and things get really expensive. VPSes just weren’t designed for gobs of virtualized storage, except maybe for Amazon and even there you run into fun with EBS if you want durability.

There is hope with a dedicated server, which are generally for the pros. Still a bit expensive for us knowledgeable amateurs. But XLHost, out of Columbus, Ohio, seems to have a few discount products within reach for those somewhat serious about stashing a large pile of bits. I’m not going to claim it’s cheap, but an industrious hacker with some disposable income could make a go of it.


Silver, Signals, & Noise

From Nate Silver

An excerpt from my forthcoming book, “The Signal and the Noise,” was published this week in The New York Times Magazine. You can find an online version of the excerpt here.

The book takes a comprehensive look at prediction across 13 fields, ranging from sports betting to earthquake forecasting. Since 2009, I have been traveling the country to meet with experts and practitioners in each of these fields in an effort to uncover common bonds. The book asks an ambitious question: What makes predictions succeed or fail?

Pre-ordering tonight. I had a thought about an interesting sports related challenge on the way home today. There’s enough historical play-by-play data for Major League Baseball, and possibly the National Football League, and straightforward enough grading, that you could build predictive models and test them real-time. Would be interesting to see how quickly and accurately a system could predict a game’s next pitch or play and outcome.

Might be some financial opportunities there as well.

Has it already been 4 years since 538 stormed the world? Time flies.


Wayne Enterprises Chronicles: Week 1

The Dark Knight Logo Mini Victory! Week one brought a solid performance for Wayne Enterprises in my office fantasy football league. There was some nervousness when Arian Foster reported a knee injury at the tail of the week and was something of game time decision. Shades of CJ2K biting me in the rear two years ago. But Foster gave me a solid 20 points with a surprising 18 from Darren McFadden.

My starting receivers, Hakeem Nicks and Marques Colston were disappointing with less than 10 points each. Cam Newton came up well under his projected as well. In contrast, I had Tony Romo and Miles Austin had big games. Cam probably gets another week, but I’m sorely tempted to plug in Austin.

Whatever. My team put won handily, although I had to wait until late Monday night to really feel comfortable.

Yahoo! has these automatically generated game recaps that are actually pretty fun to read.


Retinaed

Apple Gear.jpg

Apropos today’s announcements.

Okay. I admit it. I have become a legitimate Apple fanboy.

The laptop I’m writing this post on? A 13” 2008 White Plastic MacBook.

The laptop I do my day job on? A kitted out 15” 2011 MacBook Pro provided at my request.

My wife’s 2010 Christmas gift? An iPad, 2nd generation, AT&T 3G.

The handheld I use to distract My Little Guy(™)? A fourth generation 64Gb iPod Touch.

The cellular phone where I take calls, read Twitter with, track my weight, and text on? A 2010 16Gb iPhone 4.

The tablet I carry around just about everywhere I go these days? An iPad, 3rd generation, 64Gb, Verizon LTE. Damn if it ain’t a sweet product.

I’ve seriously considered purchasing an Apple TV and an Airport Express.

Keep in mind that I’m a built my own computer, written my own compiler, written assembly code with some effort, ported X Windows to OS/2, used Linux on floppy disks, Scheme embedding sort of hacker. I’m not quite sure if this state of affairs should be distressing.

In any event, note the last two products of mine that I listed are Apple’s first Retina Display products. I have to say, using both of those continuously, then going back to say the White MacBook display for an extended period, really highlights the difference.

So much so that a Retina Display is a must have feature for the next round of devices I’m lusting after, the next iPhone and a power user’s MacBook Air.

Thus have I been Retinaed.


Rule of Verticality

“It’s a good day to be vertical.”

We forge on.

That is all.

Hat tip to Garret Vreeland


PyCon Proposal Anti-Patterns

Funny! [embed]https://twitter.com/ctitusbrown/statuses/239007004536995842[/embed]

Definitely digging this WordPress embedded tweet feature. Next up YouTube vids.


Map Building @ Google

Somewhat grandiose, yet at the same time lightweight, Alexis Madrigal’s How Google Builds Its Maps—and What It Means for the Future of Everything still has a nugget or two of insight:

I came away convinced that the geographic data Google has assembled is not likely to be matched by any other company. The secret to this success isn’t, as you might expect, Google’s facility with data, but rather its willingness to commit humans to combining and cleaning data about the physical world. Google’s map offerings build in the human intelligence on the front end, and that’s what allows its computers to tell you the best route from San Francisco to Boston.

A couple of things I took away. Maps and other assorted geospatial artifacts are a really important way of organizing the world’s information. People are still important to making excellent maps. There is a logical, non-visual underpinning to how maps work.


Slow Copyright Progress

An interesting, contrarian perspective from Lucas Gonze:

News stories about copyright often pop up in the technology press. The subtext is usually like: “here’s how deeply broken copyright is.”

Don’t believe the subtext. Copyright is not broken in any sense that matters.

Copyright is in the process of adapting to changes in information technology. The process of adapting will take a longish time. The process started in the 1980s, which led to passage of landmark legislation such as the DMCA in the 1990s. In the years 2000-2010 the DMCA safe harbors were tested and defined in detail, and internet sites went from flat defiance of the law to grudging compliance. That’s 25 years of progress. I imagine 25 more will complete the job.

The notion is also applicable to other domains that typically frustrate technologists. I’m looking at you healthcare, even though you can make a good argument that health payment is broken in many senses that matter.


Green-Marl Graph DSL

Let’s try out this here newfangled tweet embedding in WordPress:

[embed]https://twitter.com/bigdata/statuses/243742637352419330[/embed]

Just the link on a line by itself isn’t quite working for me. However, The embed shortcode seems to do the trick.

This is fabulous!! Definitely handy for spicing up this space visually.

And a quote for old time’s sake:

The increasing importance of graph-data based applications is fueling the need for highly efficient and parallel implementations of graph analysis software. In this paper we describe Green-Marl, a domain-specific language (DSL) whose high level language constructs allow developers to describe their graph analysis algorithms intuitively, but expose the data-level parallelism inherent in the algorithms.

Released to the public from The New Toy (TM) via Verizon LTE. Yeah me!


Fast Python Data Structures

Fast as in CPU performance, not as in …, well you know:

Python provides great built-in types like dict, list, tuple and set; there are also array, collections, heapq modules in the standard library; this article is an overview of external lesser known packages with fast C/C++ based data structures usable from Python.


Kids Don’t Do This

A lot of my data hacking involves mugging over multiples of biggish files stashed in a standard UNIX filesystem. On any UNIXen worth its salt, the find and xargs utilities are your friends here. Use ’em.

But don’t make the same mistake that’s been biting me in the rear recently, embedding invocations of these tools within my programs. So tempting, especially in languages like Python that make shell invocation so easy, even though Python has the nice os.walk module. Works great until you move to another system and one or the other of find or xargs doesn’t work the way you assumed. Now you have to extract the calls and use the language’s built in filesystem walking capabilities or provide a command line option to specify the specific tool binaries if you build local custom versions.

Instead devise your program to read filenames from its command line args. Bonus points if you have an option to read from stdin. Then wrap find and xargs around your program. And if in a specific situation your args to either utility get too hairy, stuff the invocation into a shell script.

You’ll thank me in the long run.


Hosted PostGIS Pricing

The one issue with going bigger and better with my Tweet collection is the actual storage, especially if I want to continuously run a spatial DB, like PostGIS. I was just poking around on the Intarwebs and the pricing seems prohibitive for a beginner spatial hacker with significant data.

CartoDB can get you started for $29 a month, but you only have a piddly 50Mb of storage.

According to Heroku, PostGIS is available as an extension to their hosted Postgres product. This includes a whopping 1 Tb of storage. However, the smallest package is $200 a month!

Man, I didn’t know hosting PostGIS was so difficult.


Fear of an OAuth Planet

Twitter Bird Small For the final third of the year, I’m resurrecting my Twitter data collection project on a grander scale. More cities, more data, more processing, more analysis. The only major delta is that Twitter is seriously threatening to apply OAuth to all of its API endpoints. Since my little project isn’t really a Web application per se, the thought of having to do a 3-Legged OAuth handshake seemed daunting especially for the little ’ole Streaming API.

Fortunately, Peter Hoffman teed up an easy authentication workaround by just generating the tokens for the consumer and app through Twitter. Then you simply stash them away and appropriately sign your initial connection request.

Meanwhile, Greg Roodt at The Mosh Pit, goes the full Monty, explaining how even with the complete 3-Legged handshake, authenticating streaming with OAuth isn’t really that bad. Bonus props for pointing to a GitHub repository of sample code that can be conveniently forked and extended.


Wayne Enterprises Chronicles: Prologue

You either die a hero or you live long enough to see yourself become the villain. — Harvey Dent

The Dark Knight Logo Mini I know no one cares about my fantasy team, but forewarned is forearmed. This year’s fantasy football office league edition has been named Wayne Enterprises in honor of the conclusion of The Dark Knight trilogy.

Besides, it makes for a catchy logo.

After the break, my draft and some thoughts on the team’s potential.

8 team league. Snake draft with random position assignments. Overall draft position in parens. I lucked into the number 1 pick for the 2nd time in three (four?) years. And of course, this is a year I do no preparation. The last time I drew first overall though ended in tears, with CJ2K not doing very well, and dragging my team down with him.

  1. (1) Arian Foster RB
  2. (16) Cam Newton QB
  3. (17) Darren McFadden RB
  4. (32) Marques Colston WR
  5. (33) Hakeem Nicks WR
  6. (48) Miles Austin WR
  7. (49) Jason Witten TE
  8. (64) Steven Jackson RB
  9. (65) Tony Romo QB
  10. (80) Baltimore DEF
  11. (81) Dwayne Bowe WR
  12. (96) Robbie Gould K

I might have gone a little early with Cam Newton, but given I could get two top RBs in my first three picks, I thought he was worth the reach. My receivers are a boom or bust set. I expect two of the four to go down with injuries by week three. I had to dump Jason Witten already for Tony Gonzalez.

Relative to other teams in the league, I feel like I’m in pretty good shape. There’s one other, maybe two, teams that had similar quality drafts. Of course things will evolve rapidly as real games are played, but I’m looking forward to my chances this year.


New Media Hacking

A lie gets halfway around the world before the truth has a chance to get its pants on. — Winston Churchill

Jamais Cascio has been doing some informed thinking about the intersection of Twitter, botnets, misinformation, and systems hacking.

Two of my rules for constructing useful and interesting scenarios are to (a) think about what happens when seemingly disparate changes smash together, and (b) imagine how new developments might be misused. In both cases, the goal is to uncover something unexpected, but (upon reflection) disturbingly plausible. I’d like to lay out for you the chain of connections that lead me to believe that we’re on the verge of something big. …

A black hat hacker could, with ease, create a network of Twitter bots set to retweet each other on command, send @messages to important information hubs (a few of which would retweet stories further), and drive up the visibility of certain hashtags and keywords. Done with the right target and message, and at the right time, such a network could potentially trigger sudden swings in value of targeted shares. The drop in value need not last for long; trading systems that know the stories to be false could swiftly snap up the briefly-undervalued stock. Conversely, the attack could be done in a way to cripple a particular company or stock market, or even to distract journalists from another story.

Similarly, a Twitter bot network, retweeting/spreading misinformation, could potentially cause a media firestorm if the target was a politician. Even if the misinformation was corrected within the hour, the spread would be impossible to fully contain. Could something like this even swing an election?

I can assure you, as a heavily invested Twitter researcher, what Cascio describes is eminently plausible.


So Much For Posts

Thanks to Cult of Mac it looks like Posts isn’t what I’m really looking for in an iPad blogging app.

Also missing is Markdown support, John Gruber’s human-friendly markup language. This would certainly be a handy third pane editor in addition to HTML and Rich Text, but again any writers making heavy use of Markdown will likely be using another app to do it. It would, though, let writers add things like bulleted lists and similar.

Blogsy is in a similar boat, although I made the mistake of rushing out and buying it without due diligence. Th current popular operating mode seems to be the use of a nice text/Markdown editor like Writing Kit, generate the HTML in that app, then ship to an HTML oriented app like Posts.

Feels like a kludge. On the other hand, I’m pretty intrigued by Writing Kit.


Linden’s Links Redux

Greg Linden is back, with another pile of interesting links. A sampling

Nice example of how better hardware in your database can be faster and cheaper than expanding your caching layer (link)

Google Research and their hybrid research model blends research and engineering (to maximize impact and avoid the problematic tech transfer from research) and keeps projects short (but still do long-term research by iterating). (link)

Why read research papers? “These papers often foreshadow where the rest of the world is going.” (link)


September Already?!

Wow! It’s already September. Roughly two thirds of 2012 has gone by. Where did it all go?

It’s been a topsy turvy year for me so far, mostly good with some noticeable dips. But with four months left, still seems like plenty of time to make 2012 top notch all the way around.


NFLgame

Uh oh! Andrew Gallant has put together two of my favorite tastes: fantasy football (really sports stats) and Python

As a programmer and a fantasy football addict, I am embarassed by the meansthrough which we must expend ourselves to get data in a machine readable form. This lack of open source software cripples the community withsub-standard tools, and most importantly, detracts from some really cool andfun things that could be done with easily available statistics. Many toolsare either out-dated or broken, or if they work, they are closed source andoften cost money.

Yesterday I started work on a new library package that I hope will start toimprove this sorry state of affairs.

NFLGame is a Python package that provides convenient access to NFLstatistics. This includes games that are currently being played, or games as far back as the 2009 season.

Looks like a definite install. The question is how much of my time it eventually sucks up. Thanks Andrew!

And I know nobody cares about my fantasy football team, but yup, I will be competing in my office league yet again and of course blogging about the season. Super! Thanks for asking!

© 2008-2024 C. Ross Jam. Built using Pelican. Theme based upon Giulio Fidente’s original svbhack, and slightly modified by crossjam.