In the day job, I was casting about for ways to integrate Apache Spark with the open source search engine Elasticsearch. Basically, I had some megawads of JSON data which Elasticsearch happily inhales, but I needed a compute platform to work with the data. Spark is my weapon of choice.
Turns out there’s a really nice Elasticsearch Hadoop toolkit that includes making Spark RDDs out of Elasticsearch searches. I have to thank Sloan Ahrens for tipping me off with a nice clear explanation of putting the connector in action:
In this post we’re going to continue setting up some basic tools for doing data science. The ultimate goal is to be able to run machine learning classification algorithms against large data sets using Apache Spark™ and Elasticsearch clusters in the cloud.
… we will continue where we left off, by installing Spark on our previously-prepared VM, then doing some simple operations that illustrate reading data from an Elasticsearch index, doing some transformations on it, and writing the results to another Elasticsearch index.