Diving deeper into the Hadoop ecosystem, I’m starting to appreciate languages like Hive and Pig much more. The ability to extend each with UDFs really expands their possibilities.
Here’s an old, but still neat, hack from Jacob Perkins, The Data Chef. He’s folding the capabilities of Mallet, an open source topic modeling toolkit, into the Pig language:
I’m going to use Apache Pig and Mallet, a java based machine learning and natural language processing library to discover topics in the 20 newsgroups data set. This corpus is nice since each document already belongs to a newsgroup (a topic) and so it gives us a way of checking how well our topic discovery is doing. …
So it’s clear that we’re going to need a java udf to do the actual topic clustering. Right? This udf will operate on a DataBag of documents and return a DataBag containing the discovered topics.
Extensibility For The Win.