4K Monitor Buyer’s Guide

Posted on: Sat 26 November 2016

I’m in the market for a new desktop monitor. 4K resolution seems to have hit a reasonable price and quality point. The Wirecutter has a top notch list of suggestions.

Au Revoir AMPLab

Posted on: Fri 25 November 2016

And @amplab project is officially over - it was great to see the staff, students, co-directors mike franklin & ion stoica pic.twitter.com/CawDMbWLKB
— Ben Lorica 罗瑞卡 (@bigdata) November 19, 2016

End of Project has arrived for UC Berkeley’s AMPLab. Spark and the other varied projects of the group hit my radar back in July of 2012. The project was run according to Prof. Dave Patterson’s guidance for collaborative research centers. I think it’s fair to say AMPLab was a success.

Looking forward to what comes next and Go Bears!

Thanksgivings

Posted on: Thu 24 November 2016

Duly noting a number of things I’m thankful for this year:

My enduring family, surviving and thriving as African-Americans in the US. Special remembrance for my Aunt Gracie who passed away this summer after a long battle with cancer. She went to work for NASA straight out of high school and served the country for 38 (!!) years. Her eulogy highlighted some of the BS she overcame but I never knew about. I’ll always remember her as a faith filled, uplifting spirit.
Widely scattered friends, who nevertheless helped me get through a very challenging year.
The continuing gift of my education, focused and funded by my parents, that keeps opening up opportunities of all sorts, especially chances to help others.
Interesting, challenging, and impactful employment.
Eight years of President Obama, an intelligent, even keeled, leader who did the office proud.
That I actually have a lot to be thankful for. There are many out there struggling to find even a tiny morsel of hope.

kube-ui Trick

Posted on: Mon 06 June 2016

Eventually I’ll get into k8s enough that this trick will be useful to know:

In my opinion, the standard kube-ui is pretty spartan. It doesn’t really give me a good overview of what is going on in my cluster.

Weave Scope is an open source tool that helps you monitor and visualize your cluster. It is currently very beta, but I think it has a lot of potential!

Running it is also super easy.

Plus I really like what the Weave folks have been up to.

Thanks Sandeep!

Link Parkin’: The Flattened Big Data Reading List

Posted on: Sat 04 June 2016

I’ve already read a bunch of links on this reading list, but there’s some new nuggets. I also like Lars Albertsson’s take on the process of building data intensive systems:

This is a curated recommended read and watch list for scalable data processing. It is primarily aimed towards software architects and developers, but there is material also for people in leadership position as well as data scientists, in particular in the first section. The content has been chosen with a bias towards material that conveys a good understanding of the field as a whole and is relevant for building practical applications.

Processing A Billion Taxi Rides

Posted on: Sat 04 June 2016

Mark Litwintschik has been doing yeoman’s work with the New York City Taxi & Limousine Commission data. Over a series of blog posts he’s taken this one dataset and processed it with a number of data management and “big data” technologies, including purchasing an Amazon Redshift cluster:

Over the past few months I’ve been benchmarking a dataset of 1.1 billion taxi journeys made in New York City over a six year period on a number of data stores and cloud services. Among these has been AWS Redshift. Up until now I’ve been using their free-tier and single-compute node clusters. While these make Data Warehousing very inexpensive they aren’t going to be representative of the incredible query speeds which can be achieved with Redshift.

In this post I’ll be looking at how fast a 6-node ds2.8xlarge Redshift Cluster can query over a billion records from the dataset I’ve put together.

Litwintschik’s admirable in how well he documents the steps he takes to actually running queries against such a large dataset. Just getting such data into the right place to work with is challenging. There are lots of places you can trip up doing this stuff and his work can save others a lot of trouble.

Something along these lines is what I’m aspiring to do with Fun With Discogs Data.

Preordered

Posted on: Fri 03 June 2016

Mushroom Jazz Eight has launched! https://t.co/OremgYvlB9 @PledgeMusic #downtempo #chillout #triphop
— Mark Farina (@djmarkfarina) June 3, 2016

Heck yeah, you know I ordered up some of Mark Farina’s forthcoming Mushroom Jazz 8. Now it’s just a matter of what and when I add some other items.

Diggin’ On: Data Machina

Posted on: Tue 31 May 2016

Really been enjoying the weekly Data Machina e-mail newsletter put out by @ds_ldn a.k.a. Data Science London. It’s actually fairly dense, but usefully eclectic, focused on things relevant to “data science”, broadly construed. I always find at least a handful of links that are out and out great.

Subscribe now or try before you buy

FWDD 0: Fun With Discogs Data

Posted on: Mon 30 May 2016

This may become a recurring aspect of the blog, thus the abbreviation and numbering. I’ve managed to catch up and download the entirety of discogs.com data dump archives to my personal laptop. As of this writing, it’s about 153 Gb of mostly compressed XML data, at varying levels of quality going all the way back to 2008.

I don’t really have much of a plan other than to explore an interesting longitudinal data set. One thing I’m hoping to do is come up with a modernish set of tools to process the data, including normalizing and transforming to other formats. The other goal is to push it all up into Google Cloud Platform and see what working with data in that environment is like. Also, planning to make code and generated data open, since Discogs provides it under an extremely liberal license.

Consul Service Discovery

Posted on: Sun 29 May 2016

Nice little video overview of service discovery in microservice architectures and how Consul can fill that role. “Why is service discovery important? (And what is Consul?)”. Fair notice, it’s a teaser for an O’Reilly video training course on microservices.

What’s A Unikernel?

Posted on: Fri 27 May 2016

I recently worked on a proposal that heavily incorporated the notion of unikernels. Even still, I’m not really sure I could have explained what they were to even someone else technically proficient.

Enter the Google Cloud Platform Podcast. Listening to Pivotal’s John Feminella I finally heard a clear, clean explanation. Check it out for yourself, but the notion of an automatically constructed, application specific, machine image that can run on a hypervisor nails it for me.

They’re still extremely bleeding edge, but it looks like unikernel based approaches will have a place in the microservices oriented future.

P.S. I just started listening to the GCP Podcast, but I’m encouraged by how informative these first couple of episodes have been.

Armbrust on Spark Structured Streaming

Posted on: Thu 26 May 2016

I enjoyed this O’Reilly Data Podcast conversation with Michael Armbrust regarding Apache Spark 2.0’s Structured Streaming:

With the release of Spark version 2.0, streaming starts becoming much more accessible to users. By adopting a continuous processing model (on an infinite table), the developers of Spark have enabled users of its SQL or DataFrame APIs to extend their analytic capabilities to unbounded streams.

Within the Spark community, Databricks Engineer, Michael Armbrust is well-known for having led the long-term project to move Spark’s interactive analytics engine from Shark to Spark SQL. (Full disclosure: I’m an advisor to Databricks.) Most recently he has turned his efforts to helping introduce a much simpler stream processing model to Spark Streaming (“structured streaming”).

You’ll need a login, but there’s also a deeper dive video from Armbrust and Tathagata Das going into more details of Structured Streaming.

At one point, Ben Lorica asked Armbrust about the dimensions upon which developers should evaluate streaming platforms. The obvious ones (delivery guarantees, latency, throughput) were brought up. I’d add a few more

expressiveness, how convenient is it to express common streaming computations and how possible is it to implement exquisite solutions
agility, the ease with which stream processing code can be correctly updated and re-deployed
monitoring, getting useful performance metrics and debugging information out of the system

Apache Spark Structured Streaming, Kafka Streams, Twitter Heron, Apache Flink. So much to choose from.

Twitter Heron

Posted on: Wed 25 May 2016

Like I said, a crowded space.

Last year we announced the introduction of our new distributed stream computation system, Heron. Today we are excited to announce that we are open sourcing Heron under the permissive Apache v2.0 license. Heron is a proven, production-ready, real-time stream processing engine, which has been powering all of Twitter’s real-time analytics for over two years. Prior to Heron, we used Apache Storm, which we open sourced in 2011. Heron features a wide array of architectural improvements and is backward compatible with the Storm ecosystem for seamless adoption.

Twitter Heron, now an open source project.

Kafka 0.10.0

Posted on: Tue 24 May 2016

As I said, I use (and like) Kafka quite a bit. The new release of Kafka 0.10.0, as covered by Confluent’s Neha Narkhede, has a number of interesting features:

I am very excited to announce the availability of the 0.10 release of Apache Kafka and the 3.0 release of the Confluent Platform. This release marks the availability of Kafka Streams, a simple solution to stream processing and Confluent Control Center, the first comprehensive management and monitoring system for Apache Kafka. Around 112 contributors provided bug fixes, improvements, and new features such that in total 413 JIRA issues and 13 KIPs were resolved.

Kafka Streams becomes official, but the timestamped messages will turn out to be very handy.

K8S Udacity Course

Posted on: Mon 23 May 2016

K8S stands for Kubernetes, which is a container orchestration platform from Google. Translation? Kubernetes is a system for running distributed code with high availability at scale. Looks like there’s a nice bitesized Udacity course on Kubernetes serving as an introduction.

This course is designed to teach you about managing application containers, using Kubernetes. We’ve built this course in partnership with experts such as Kelsey Hightower and Carter Morgan from Google and Netflix’s former Cloud Architect, Adrian Cockcroft (current Technology Fellow at Battery Ventures), who provide critical learning throughout the course.

Mastering highly resilient and scalable infrastructure management is very important, because the modern expectation is that your favorite sites will be up 24/7, and that they will roll out new features frequently and without disruption of the service. Achieving this requires tools that allow you to ensure speed of development, infrastructure stability and ability to scale. Students with backgrounds in Operations or Development who are interested in managing container based infrastructure with Kubernetes are recommended to enroll!

Might have to carve out some time for this one.

Storm Still Going Strong

Posted on: Sun 22 May 2016

Apache Storm is approaching 5 years as an interesting, useful, important vibrant open source project. P. Taylor Goetz, one of the project leads, is doing an overview of Storm in concert with the 1.0 release.

In this series of blog posts, we will provide an in-depth look select features introduced with the release of Apache Storm (Storm) 1.0. To kick off the series, we’ll take a look how Storm has evolved over the years from its beginnings as an open source project, up to the 1.0 milestone release.

The space of computing over and developing against streaming data has grown crowded in the past year or two. Storm is one of those technologies good to know about as it provides a useful baseline of features to discuss and has significant “burn in”. And it’s still getting better!

Kafka Summit Recordings

Posted on: Fri 20 May 2016

These days, I do a lot of work with Apache Kafka. Kafka implements partitioned, replicated, append only logs. If you squint enough, those logs can look like a messaging system. Turns out Kafka is pretty good for a lot of distributed system and “big data” use cases.

I couldn’t make it to the inaugural Kafka Summit, but the organizers have made video recordings and slides available for all of the presentations. Well done!

2016-07-22

Posted on: Thu 19 May 2016

On July 22, 2016, Farina returns with the 8th installment of one of electronic music’s longest running compilations, Mushroom Jazz, celebrating the 25th anniversary of the series.

There is nothing else that can be added to this announcement. You know I’m all over this one.

The New Smack

Posted on: Sat 02 January 2016

You know you’ve been around awhile when you start observing “acronym recycling”.

Circa 2010:

Storage, MAp/reduce, Query

Circa 2016:

Spark, Mesos, Akka, Cassandra, Kafka

When I first heard the term in Ben Lorica’s O’Reilly Data Show podcast episode with Evan Chan, I did a double take. Trawling the interwebs a bit looks like there might be some there there. MeetUps. Slides. Talks. Conferences, sort of. Even manifestos!

Not exactly one-to-one, but definitely square in the same ecosystem. The 2010 Radar article is still surprisingly relevant. And if you think of the trends in “Big Data” over the last 5+ years, SMACK is basically an evolution of SMAQ, refined for the rise of Spark as a compute engine, and updated for the emergence of streaming, unbounded data processing.

SMACK HARD is a little too cute by half though, if you ask me.

Diggin’ On Evil Nine

Posted on: Wed 16 December 2015

I’ve been on a bit of a spending binge over the products from the Fabric London store. When in the mood for some “new to me” music, I trawl their back catalog. To be honest, the purchases have been a combination of buying second market CDs or digital downloads from the Amazon Music Store. Hopefully, this plug will send some purchases their way. But over the course of the past month, I’ve collected over 6 different titles.

The current leader of the pack is Evil Nine’s entry into the FabricLive list. The end-to-end DJ mix is a nice journey into breaks territory. Opens quite well, dips a little, then really picks up steam near the tail end. Highlights are, Technologic, All I Wanna Do Is Break Some Hearts, and Nowhere Girl. Not to mention an inspired outro with The Clash’s London Calling.

The musical style is most reminiscent of stuff from Fatboy Slim, The Chemical Brothers, and an incredible one shot effort, The Dirtchamber Sessions, by The Prodigy. Give Evil Nine a whirl if you fancy that particular flavor.

Spark-TS

Posted on: Mon 14 December 2015

Link parkin’:

Large-scale time-series data shows up across a variety of domains. In this post, I’ll introduce Spark-TS, a library developed by Cloudera’s Data Science team (and in use by customers) that enables analysis of data sets comprising millions of time series, each with millions of measurements. Spark-TS runs atop Apache Spark, and exposes Scala and Python APIs.

Deployed by Cloudera with real customers, according to them. Sorely needed. Appreciate the Python modules, which I hope aren’t too far behind the Scala API.

NodeBox for OpenGL

Posted on: Tue 01 December 2015

A while ago, I started a project called AdoptedArt, where I attempted to transliterate Matt Pearson’s work at AbandonedArt.org into Python. Back then there were two impediments. One, there really weren’t any graphical toolkits that were a solid equivalent of processing. I cobbled something out of pyprocessing but it wasn’t very satisfying. Not to mention the project wasn’t particularly active. Two, my lil’ ole White MacBook really didn’t have enough horsepower to compensate for the Python performance penalty.

AdoptedArt fell by the wayside, but just for giggles, over Thanksgiving I took a lark to see if it could be resurrected. Now I have two things on my side. One, the new MacBook Pro is easily an order of magnitude faster thanks to processing speedups, multiple cores, GPU acceleration, and a big old SSD. Second, NodeBox for OpenGL emerged, adding image manipulation capabilities and hardware acceleration to the NodeBox vector drawing API. Moore’s Law FTW! Plus, the install was painless using Continuum’s Anaconda, even though there was some C based extensions to be built from source.

Bottom line, it only took me a little bit of work to adapt my adoption of AbandonedArt’s first processing sketch, Spirograph into NodeBoxGL. And it ran smooth as silk, with the MacBook barely breaking a sweat.

I’ve got high hopes to revive this project as a creative endeavor and a complete diversion from work stuff. We’ll see how it goes!

Whysabouts

Posted on: Fri 27 November 2015

Just for posterity, some of the reasoning behind coming back to these here parts, besides fulfilling a stated commitment.

TL;DR Commuting and creating. Less of the first, more of the latter.

Death to Schleppy!!

I live in the center-west portion of Loudoun County, roughly 30ish miles from Arlington and thence further into the city. Arlington has been about the best of my regular work place locations as I’ve had stops in downtown DC, Crystal City, Oakton and Rockville, MD.

The Washington DC Metroplex is brutally hostile to commuters. Typically I spend about 75 to 90 minutes getting to work, one way. Doesn’t matter what combo of driving and public transportation, minimum 3 hours in transit every business day, punctuated by bi-weekly traffic disasters just to accentuate the pain.

In June of 2016, I’ll have been doing it for a full decade. Years two through eight, I developed Stockholm syndrome and just rolled with it. In the last year a certain despair about the amount of my life dripping away at this has crept up. Family constraints lock in the residential location. So I’m now resolved to alter my career trajectory as needed to radically reduce my commute.

By May 31st of 2017, my one way trip time to work will be 30 minutes or less. Preferably way less.

What’s the blog relation? To best position myself to Kill Schleppy, I want to sharpen a number of technical skills to increase leverage with whoever employs my talents. Some of this will happen through self-instruction and side projects. (This and lack of software upgrades for Ye Olde MacBook led to the new Macaversary.) What better place than Mass Programming Resistance to record my efforts and build up an independent, web visible portfolio? Maybe even land a “remote first” opportunity. Going to the home office or local co-working space sounds really enticing.

Parallel to that, in Whatabouts I noted the elimination of certain things that I felt were a drag on mental and emotional health. That space can be filled with productive, creative energy. Mass Programming Resistance can once again be the outlet for that energy. Blogging’s dead, blah, blah, blah. But in the past, when I’ve been on a regular routine, there was a sense of purpose and achievement. It works for me.

Conda Envs

Posted on: Wed 25 November 2015

I don’t find Python’s virtualenvs to be as cumbersome as Tim Hopper, especially when combined with virtualenvwrapper, but his argument for using Anaconda environments is compelling:

In 2015, I have almost exclusively used Python installations provided through Continuum Analytics’s Conda/Anaconda platform. I have also switched from using virtualenvs to using conda environments, and I am loving it.

The only contra is that pip and virtualenv are so widely distributed and easily deployed with distro packages that they’re somewhat safe to rely on. Conda, not so much, even though it’s a light lift and spreading.

Interestingly, I probably overlapped with some of Tim’s Qadium colleagues on a DARPA program.

Whatabouts

Posted on: Tue 24 November 2015

Stuff I’m currently interested in and will probably frequently comment on here. First off on the technical front:

Processing of streaming data. On the day job, I’ve deeply embraced Apache Kafka. We haven’t had much success doing the actual data processing, but the Googlish (among others) concepts of unbounded processing are attractive.
Speaking of which Apache Spark (Go Bears!), is still of high attention, especially Spark Streaming.
Distributed consensus algorithms, ala Paxos, ZAB, Raft, et. al. Also toolkits such as Zookeeper, Consul, etcd which leverage these algorithms to additionally enable service discovery.
Containerization, principally Docker but also getting back to LXC and BSD jails, the originals. Unikernels fall into the same space although I’m still trying to make sense of them.
Programming languages. Still primarily a Pythonista, actually contributed some Go to an open source project, getting dragged into Scala and back into Java due to Spark/Kafka, soft spot always for Common Lisp, Scheme and their ancestors
Dev tools and practices. The day job has made me a passable git user, an emerging test driven developer, and well aware of continuous integration and continuous deployment practices. Planning to actually top up my skills on modern SW development environments.
Performance monitoring and analysis. I’ve gladly joined the cult of Brendan Gregg. It’s a very interesting and underserved area, that integrates a lot of interesting technologies, engineering techniques, and analysis methodologies. Proficiency in this domain seems to be a bit of a superpower.

The overall technical theme is what I’ll term “distributed programming in the small”. I’ve come to grips with the realization that my dreams of becoming a massive scale systems builder are probably not going to come true. Google, Amazon, Facebook, Netflix et. al. have moved the goal posts so far out that unless you work at one of those places, or a scientific computing institution, you’re pretty much an amateur. However, there are plenty of use cases that can exploit key technologies from that crowd to build reliable and dynamic data processing systems even without having to reach global scale. Plenty of problems still out there that force you to use more than one computer at a time. But when you go that route, to the extent reasonable, do what the pros do.

Musically still diggin’ on House, DnB, downtempo, trip-hop, and breaks DJ mixes. Find myself going retro a bit more into 80’s, less so 90’s, pop and dance through a Spotify Premium account. I pay for mixes and subscribe for singles.

Crate diggin’ through all the old material I have in the form of old blog posts, RSS feed stars, Twitter faves (F likes), and pinboard bookmarks. Having been around for a while now affords me the space to do some reflection in addition to observation.

Not so much:

TV. Cut off the national drip about a year ago. Mental and emotional health much improved.
Sports. Really dialed it back. It mostly went out the window with the One Eyed Monster. Basically do audio streams of games while conducting other activities, visit the Verizon Center to watch The Wiz with my dad, and check scores on my phone every now and then. Finding the NFL and NCAA despicable.
Science fiction reading. It’s become intermittent due to Life (TM). If I get back on the bandwagon it’ll make a return but “underpromise and overdeliver”.
Movie stuff. Ditto. In the last two calendar years, I’ve seen exactly one feature film in the theater: Mad Max: Fury Road. Highly recommended but again, I don’t get out for entertainment much. There’s a plethora of on-demand at my disposal, but can’t muster the energy to take advantage.
Social media. I’ve been around long enough to remember when the notion of “IRL: In Real Life” made sense and we called it “social software”. Now it’s All Real Life, All The Time and increasingly disruptive of our social fabric, (not in a good way).
Celebrity. You can get plenty of that in other venues.

Consider yourselves, … warned.

GLA Podcast 50

Posted on: Mon 23 November 2015

New Podcast Out! GLA 50 | Step Two Mix https://t.co/LD8gnczAQN #podcast #arturogarces #bramehamo #craighamilton
— Mark Farina (@djmarkfarina) November 13, 2015

I’ve given it two listens so far and good stuff. A welcome return to straight ahead house music.

Whereabouts

Posted on: Mon 23 November 2015

It occurs to me that there is a major event that I have not noted in this venue. About 18 months ago, I gave up my position at Lockheed Martin and moved to a much smaller company Invincea. That web site is almost exclusively about the commercial product side of Invincea, but we also have a federal services division called Invincea Labs.

Labs is in exactly the same DoD Science and Technology research space I worked for LM. We hustle Contracted Research and Development (CRAD) from various agencies looking for technical solutions to bleeding edge problems.

It’s an amusing story of how it came about, but I literally ended at LM on a Friday afternoon and started across the street at Invincea the following Monday. The biggest change is that I went down 3 orders of magnitude in employee head count. Also, Labs has a pure focus on cybersecurity. No more worrying about expensive jet fighters and all that. Lean, mean, and a relaxed attitude have been a refreshing change of pace.

I’m part of the Cyber Analytics team, and Labs has a lot of open positions. We work on all sorts of bleeding edge projects so shoot me an e-mail at bria n.d ennis@invincea.com if any of them seem to fit you.

It’s Been A Long Time…

Posted on: Sat 21 November 2015

…, I shouldn’t have left you,
Without a strong rhyme to step to.
Think of how many weak shows you slept through.
Time’s up. I’m sorry I kept you…

— Rakim

MacBook Pro 2015

It’s a new Macaversary around here.

Gibson’s “The Peripheral”

Posted on: Thu 09 July 2015

I finished reading William Gibson’s The Peripheral about a day ago. As an avowed Gibson fanboy, I’ve oddly got some pretty mixed feelings about the book.

First off, I did this weird thing of pre-ordering from Amazon, so I had the book on the first day of availability. October 28th, 2014. Six. Months. Ago. For whatever reason, I parked the hardcover and never got started. Maybe it was the dread of hauling the hefty tome around.

So taking advantage of some vacation time, I jumped in and devoured it promptly. Ticked all of my Gibson check boxes. But in the end I felt a bit unsatisfied.

Can’t quite put my finger on it, but I really didn’t get a rush. Since Cayce Pollard, I haven’t really felt like a Gibson protagonist has been in much peril. Just a matter of waiting to see how they get out of it. Yeah, Flynne got kidnapped and threatened, but everybody came out clean in the end.

And the Chinese server explanation I found lacking.

Maybe this one just needs to grow on me like the Blue Ant Trilogy.

EMR Spark

Posted on: Sun 21 June 2015

Muy bueno! Spark is now an official part of Amazon Elastic MapReduce.

I’m happy to announce that Amazon EMR now supports Apache Spark. Amazon EMR is a web service that makes it easy for you to process and analyze vast amounts of data using applications in the Hadoop ecosystem, including Hive, Pig, HBase, Presto, Impala, and others. We’re delighted to officially add Spark to this list. Although many customers have previously been installing Spark using custom scripts, you can now launch an Amazon EMR cluster with Spark directly from the Amazon EMR Console, CLI, or API.

Shades of Redis

Posted on: Wed 27 May 2015

Good piece from Charles Leifer highlighting a couple of alternative takes on the redis key/datastructure store:

Recently I’ve learned about a few new Redis-like databases: Rlite, Vedis and LedisDB. Each of these projects offers a slightly different take on the data-structure server you find in Redis, so I thought that I’d take some time and see how they worked. In this post I’ll share what I’ve learned, and also show you how to use these databases with Walrus, as I’ve added support for them in the latest 0.3.0 release.

I’m particularly intrigued by the embedded Rlite store. Seems like something useful for situations slightly less relational than what SQLite can service.

Deep Cassandra

Posted on: Sat 09 May 2015

Andrew Montalenti relates parse.ly’s experience with Cassandra. Lots of interesting tidbits, but the money graf is this:

A well-seasoned technologist friend of mine was not at all surprised when I walked him through some of these issues we had with Cassandra. He said, “You honestly expected that adopting a data store at your scale would not require you to learn all of its internals?” He has a point. After all, we didn’t adopt Elasticsearch until we really grokked Lucene.

Upserts, mmmmmm!

Posted on: Fri 08 May 2015

I don’t know how I missed Craig Kerstiens’ post on upserts in PostgreSQL 9.5, but I’m glad they’re here.

If you’ve followed anything I’ve written about Postgres, you know that I’m a fan. At the same time you know that there’s been one feature that so many other databases have, which Postgres lacks and it causes a huge amount of angst for not being in Postgres… Upsert. Well the day has come, it’s finally committed and will be available in Postgres 9.5.

GearPump

Posted on: Tue 05 May 2015

Since my last extended run of blogging, I’ve really gotten into message system infrastructure and streaming data computation architectures. May have to kick the tires on GearPump

GearPump is a lightweight real-time big data streaming engine. It is inspired by recent advances in the Akka framework and a desire to improve on existing streaming frameworks. … Per initial benchmarks we are able to process 11 million messages/second (100 bytes per message) with a 17ms latency on a 4-node cluster.

That seems like a lot of msgs/sec. Gotta see the specs on that cluster.

Via

GearPump from @Intel is a lightweight real-time big data streaming engine, leverages actors via the akka framework http://t.co/TkyjimgKC7
— Ben Lorica 罗瑞卡 (@bigdata) April 9, 2015

Customizing IPython for Pandas

Posted on: Tue 05 May 2015

Link parkin’

Chris Moffit’s blog post doesn’t have a lot of detail, but the attendant IPython notebook looks chock full of goodness

The combination of IPython + Jupyter + Pandas makes it easy to interact with and display your data. Not surprisingly, these tools are easy to customize and configure for your own needs. This article summarizes some of the most useful and interesting options.

Ignite v Spark

Posted on: Sun 03 May 2015

I’m an unabashed Apache Spark fanboy, but it’s good intellectual hygiene to know about the technical alternatives. Apache Ignite is one that has slipped beneath my radar. Konstantin Boudnik contrasts Ignite and Spark:

Complimentary to my earlier post on Apache Ignite in-memory file-system and caching capabilities I would like to cover the main differentiation points of the Ignite and Spark. I see questions like this coming up repeatedly. It is easier to have them answered, so you don’t need to fish around the net for the answers.

Clearly not from a native English speaker, but definitely worth the read.

Get Famous With Spark

Posted on: Fri 01 May 2015

You too can win fame and fortune using Apache Spark! Helps to invent it and write up a great PhD dissertation.

Matei Zaharia won the 2014 Doctoral Dissertation Award for his innovative solution to tackling the surge in data processing workloads, and accommodating the speed and sophistication of complex multi-stage applications and more interactive ad-hoc queries. His work proposed a new architecture for cluster computing systems, achieving best-in-class performance in a variety of workloads while providing a simple programming model that lets users easily and efficiently combine them.

Go Bears!

Reconciling Streaming Jargon

Posted on: Sun 01 February 2015

I really enjoyed reading Martin Kleppmann’s treatise on varying communities and terminology related to stream processing:

Some people call it stream processing. Others call it Event Sourcing or CQRS. Some even call it Complex Event Processing. Sometimes, such self-important buzzwords are just smoke and mirrors, invented by companies who want to sell you stuff. But sometimes, they contain a kernel of wisdom which can really help us design better systems.

In this talk, we will go in search of the wisdom behind the buzzwords. We will discuss how event streams can help make your application more scalable, more reliable and more maintainable. Founded in the experience of building large-scale data systems at LinkedIn, and implemented in open source projects like Apache Kafka and Apache Samza, stream processing is finally coming of age.

On the day job, I’m on my third deployment of a message queueing system to support prototyping of stream processing algorithms. I’m really starting to appreciate the fundamental differences between various approaches. I can also say there’s no “right way” to do it. Each use case has to be looked at individually and there definitely will be some bespoke customization. Carefully define your correctness and performance guarantees and there’s a chance you’ll get it right.

Dispatches like Kleppmann’s though, are helpful in understanding what the landscape looks like and where you’d like to be.

Sprung For Pinner

Posted on: Sun 25 January 2015

I’ve gotten back into consistently collecting links of interest and started heavily using the fine Pinboard product. Capturing links from iOS devices was a drag though, having to go through a JavaScript bookmarklet. Felt sort of convoluted. Too much friction. But my iPhone and iPad are the places where I run across most of the links I want to stash.

Enter Pinner, a $4.99 app. A little pricey, but it looks very nice and provides a clean experience for browsing Pinboard. Best of all it adds a custom share sheet so posting a link to Pinboard is just one click away.

At What COST?

Posted on: Tue 20 January 2015

Frank McSherry published a useful reminder that one must carefully calibrate the need to deploy “big data” solutions:

Lots of people struggle with the complexities of getting big data systems up and running, when they possibly shouldn’t be using the systems in the first place. The data sets above are certainly not small (billions of edges), but still run just fine on a laptop. Much faster than the distributed systems, at least.

Here are two helpful guidelines (for largely disjoint populations):

If you are going to use a big data system for yourself, see if it is faster than your laptop.

If you are going to build a big data system for others, see that it is faster than my laptop.

This brings back memories of the CMU work on GraphChi, where the processed graphs with billions of edges on a Mac Mini.

I’ll have to dig up Frank’s paper once it gets published.