Armbrust on Spark Structured Streaming

Posted on: Thu 26 May 2016

I enjoyed this O’Reilly Data Podcast conversation with Michael Armbrust regarding Apache Spark 2.0’s Structured Streaming:

With the release of Spark version 2.0, streaming starts becoming much more accessible to users. By adopting a continuous processing model (on an infinite table), the developers of Spark have enabled users of its SQL or DataFrame APIs to extend their analytic capabilities to unbounded streams.

Within the Spark community, Databricks Engineer, Michael Armbrust is well-known for having led the long-term project to move Spark’s interactive analytics engine from Shark to Spark SQL. (Full disclosure: I’m an advisor to Databricks.) Most recently he has turned his efforts to helping introduce a much simpler stream processing model to Spark Streaming (“structured streaming”).

You’ll need a login, but there’s also a deeper dive video from Armbrust and Tathagata Das going into more details of Structured Streaming.

At one point, Ben Lorica asked Armbrust about the dimensions upon which developers should evaluate streaming platforms. The obvious ones (delivery guarantees, latency, throughput) were brought up. I’d add a few more

expressiveness, how convenient is it to express common streaming computations and how possible is it to implement exquisite solutions
agility, the ease with which stream processing code can be correctly updated and re-deployed
monitoring, getting useful performance metrics and debugging information out of the system

Apache Spark Structured Streaming, Kafka Streams, Twitter Heron, Apache Flink. So much to choose from.