I’ve been a fan of Apache Spark (Go Bears!) for a while despite not having a real good opportunity to put the toolkit to practical use. Last year I got to AMPCamp 3 and the first Spark Summit. At the latter event, The AMPLab started singing a new tune about the benefits of a unified model for big data processing, moving on from selling in-memory computing.
Cloudera’s Gwen Shapira posted a good case study of the upside:
But the biggest advantage Spark gave us in this case was Spark Streaming, which allowed us to re-use the same aggregates we wrote for our batch application on a real-time data stream. We didn’t need to re-implement the business logic, nor test and maintain a second code base. As a result, we could rapidly deploy a real-time component in the limited time left — and impress not just the users but also the developers and their management.