home ¦ Archives ¦ Atom ¦ RSS

Reconciling Streaming Jargon

I really enjoyed reading Martin Kleppmann’s treatise on varying communities and terminology related to stream processing:

Some people call it stream processing. Others call it Event Sourcing or CQRS. Some even call it Complex Event Processing. Sometimes, such self-important buzzwords are just smoke and mirrors, invented by companies who want to sell you stuff. But sometimes, they contain a kernel of wisdom which can really help us design better systems.

In this talk, we will go in search of the wisdom behind the buzzwords. We will discuss how event streams can help make your application more scalable, more reliable and more maintainable. Founded in the experience of building large-scale data systems at LinkedIn, and implemented in open source projects like Apache Kafka and Apache Samza, stream processing is finally coming of age.

On the day job, I’m on my third deployment of a message queueing system to support prototyping of stream processing algorithms. I’m really starting to appreciate the fundamental differences between various approaches. I can also say there’s no “right way” to do it. Each use case has to be looked at individually and there definitely will be some bespoke customization. Carefully define your correctness and performance guarantees and there’s a chance you’ll get it right.

Dispatches like Kleppmann’s though, are helpful in understanding what the landscape looks like and where you’d like to be.


Sprung For Pinner

I’ve gotten back into consistently collecting links of interest and started heavily using the fine Pinboard product. Capturing links from iOS devices was a drag though, having to go through a JavaScript bookmarklet. Felt sort of convoluted. Too much friction. But my iPhone and iPad are the places where I run across most of the links I want to stash.

Enter Pinner, a $4.99 app. A little pricey, but it looks very nice and provides a clean experience for browsing Pinboard. Best of all it adds a custom share sheet so posting a link to Pinboard is just one click away.


At What COST?

Frank McSherry published a useful reminder that one must carefully calibrate the need to deploy “big data” solutions:

Lots of people struggle with the complexities of getting big data systems up and running, when they possibly shouldn’t be using the systems in the first place. The data sets above are certainly not small (billions of edges), but still run just fine on a laptop. Much faster than the distributed systems, at least.

Here are two helpful guidelines (for largely disjoint populations):

  1. If you are going to use a big data system for yourself, see if it is faster than your laptop.
  2. If you are going to build a big data system for others, see that it is faster than my laptop.

This brings back memories of the CMU work on GraphChi, where the processed graphs with billions of edges on a Mac Mini.

I’ll have to dig up Frank’s paper once it gets published.


Well That Explains It

I’ve been seeing some weird host naming issues on my Mac OS X machine for work. Thought it was an honest to gosh conflict with another machine but it turns out there’s glitchiness in Apple’s latest DNS servers:

Duplicate machine names. We use an old Mac named “nirrti” as a file- and iTunes server. In the pre-10.10 days, once in a blue moon nirrti would rename herself to “nirrti (2)”, presumably because it looked like another machine was already using the name “nirrti”. Under 10.10, this now happens a lot, sometimes getting all the way to nirrti (7). Changing back the computer name in the Sharing pane of the System Preferences usually doesn’t take. Apart from looking bad, this also makes opening network connections and playing iTunes content harder, as you need to connect to the right version of the name or nothing happens.

Good to know, but I wouldn’t go so far as to attempt the modifications described in the article. Seems like a recipe for later pain on further application and operating system upgrades.


GLA Podcast 49

I am all over that. Will definitely be checking it out during tomorrow’s commute.

Fiending for a Mushroom Jazz 8 release though. It’s been over 3 years since Mushroom Jazz 7 hit the street.


spark-kafka

You can also turn a Kafka topic into a Spark RDD

Spark-kafka is a library that facilitates batch loading data from Kafka into Spark, and from Spark into Kafka.

This library does not provide a Kafka Input DStream for Spark Streaming. For that please take a look at the spark-streaming-kafka library that is part of Spark itself.

This could come in handy to pre-ingest some data to build up some history before connecting to a Kafka data stream using Spark Streaming.


Elasticsearch and Spark

Apache Spark Logo In the day job, I was casting about for ways to integrate Apache Spark with the open source search engine Elasticsearch. Basically, I had some megawads of JSON data which Elasticsearch happily inhales, but I needed a compute platform to work with the data. Spark is my weapon of choice.

Turns out there’s a really nice Elasticsearch Hadoop toolkit that includes making Spark RDDs out of Elasticsearch searches. I have to thank Sloan Ahrens for tipping me off with a nice clear explanation of putting the connector in action:

In this post we’re going to continue setting up some basic tools for doing data science. The ultimate goal is to be able to run machine learning classification algorithms against large data sets using Apache Spark™ and Elasticsearch clusters in the cloud.

… we will continue where we left off, by installing Spark on our previously-prepared VM, then doing some simple operations that illustrate reading data from an Elasticsearch index, doing some transformations on it, and writing the results to another Elasticsearch index.


Into the Blockchain

Mastering Bitcoin Cover I’m way late to the Bitcoin party, but think the notion of applications built from blockchain concepts will be a Big Deal (™). Andreas Antonopoulos’ new book Mastering Bitcoin is getting me up to speed. Here’s a taste:

One way to think about the blockchain is like layers in a geological formation, or a glacier core sample. The surface layers may change with the seasons, or even be blown away before they have time to settle. But once you go a few inches deep, geological layers become more and more stable. By the time you look a few hundred feet down, you are looking at a snapshot of the past that has remained undisturbed for millennia or millions of years. In the blockchain, the most recent few blocks may be revised if there is a chain recalculation due to a fork. The top six blocks are like a few inches of topsoil. But once you go deeper into the blockchain, beyond six blocks, blocks are less and less likely to change. After 100 blocks back, there is so much stability that the “coinbase” transaction, the transaction containing newly mined bitcoins, can be spent. A few thousand blocks back (a month) and the blockchain is settled history. It will never change.block, and “top” or “tip” to refer to the most recently added block.

From what I’ve read so far, the book is a nice blend of high level overview and technical details, with code samples no less.


Yes, It Still Works

Yes this blog is still fully operational. As sole owner, proprietor, publisher, and author, I’m committing to more content in 2015. I guarantee it’s going to be a more interesting year in these here parts.


Spark, IPython, Kafka

A couple of good overviews from the fine folks at Cloudera

First, Gwen Shapira & Jeff Holoman on “Apache Kafka for Beginners”

Apache Kafka is creating a lot of buzz these days. While LinkedIn, where Kafka was founded, is the most well known user, there are many companies successfully using this technology.

So now that the word is out, it seems the world wants to know: What does it do? Why does everyone want to use it? How is it better than existing solutions? Do the benefits justify replacing existing systems and infrastructure?

In this post, we’ll try to answers those questions. We’ll begin by briefly introducing Kafka, and then demonstrate some of Kafka’s unique features by walking through an example scenario. We’ll also cover some additional use cases and also compare Kafka to existing solutions.

And Uri Laserson on “How-to: Use IPython Notebook with Apache Spark”

Here I will describe how to set up IPython Notebook to work smoothly with PySpark, allowing a data scientist to document the history of her exploration while taking advantage of the scalability of Spark and Apache Hadoop.


Ramez Naam’s “Nexus”

Nexus Cover Generally, I dislike the technothriller genre (c.f. Daemon), but I generally enjoyed “Nexus” by Ramez Naam. The technical and philosophical aspects of bio-hacking were well done. I wasn’t particularly fond of the technothriller clash of nation states, American exceptionalism, military/intelligence complex sycophancy tropes, but I knew what I was getting into. At least there was some interesting cultural diversity and introspection in the mix.

I may actually pick up the sequel, “Crux”.


Blaze Expressions

“tl;dr Blaze abstracts tabular computation, providing uniform access to a variety of database technologies”

Haven’t gotten a chance to dig in yet, but Continuum Analytics’ new Blaze Expressions library is worthy of further inspection:

Occasionally we run across a dataset that is too big to fit in our computer’s memory. In this case NumPy and Pandas don’t fit our needs and we look to other tools to manage and analyze our data. Popular choices include databases like Postgres and MongoDB, out-of-disk storage systems like PyTables and BColz and the menagerie of tools on top of the Hadoop File System (Hadoop, Spark, Impala and derivatives.) Each of these systems has their own strengths and weaknesses and an experienced data analyst will choose the right tool for the problem at hand. Unfortunately learning how each system works and pushing data into the proper form often takes most of the data scientist’s time.

The startup costs of learning to munge and migrate data between new technologies often dominate biggish-data analytics.

Blaze strives to reduce this friction. Blaze provides a uniform interface to a variety of database technologies and abstractions for migrating data.

I especially like the notion of exploiting multiple different frameworks such as in-memory (Pandas), SQL, NoSQL (MongoDB), and Big Data (Apache Spark) for tabular backend engines.


Apache Spark’s Unified Model

I’ve been a fan of Apache Spark (Go Bears!) for a while despite not having a real good opportunity to put the toolkit to practical use. Last year I got to AMPCamp 3 and the first Spark Summit. At the latter event, The AMPLab started singing a new tune about the benefits of a unified model for big data processing, moving on from selling in-memory computing.

Cloudera’s Gwen Shapira posted a good case study of the upside:

But the biggest advantage Spark gave us in this case was Spark Streaming, which allowed us to re-use the same aggregates we wrote for our batch application on a real-time data stream. We didn’t need to re-implement the business logic, nor test and maintain a second code base. As a result, we could rapidly deploy a real-time component in the limited time left — and impress not just the users but also the developers and their management.


Items Of Note

A bit dated, but hopefully not completely useless:


Can’t Wait

Maybe I’ll have finished re-reading the Blue Ant trilogy by then.


Foxy 538

Welcome back Nate!

The breadth of our coverage will be much clearer at this new version of FiveThirtyEight, which is launching Monday under the auspices of ESPN. We’ve expanded our staff from two full-time journalists to 20 and counting. Few of them will focus on politics exclusively; instead, our coverage will span five major subject areas — politics, economics, science, life and sports.

What I like about this particular post (go read it all, seriously), is the level of humility Silver expresses. A lot of people can, and do, do the math and follow the predictive approaches he espouses. But putting it to the principled service of informing The Public, within the current dynamic of Internet social media, is innovative. Computer Assisted Reporting was just a precursor. As a recovering new media hack I can appreciate all the roots of this iteration of his work.

Plus, I love this attitude:

It’s time for us to start making the news a little nerdier.


8 Years of S3

It feels like more than eight years though, maybe because I’ve been all over it since the beginning:

We launched Amazon S3 on March 14, 2006 with a press release and a simple blog post. We knew that the developer community was interested in and hungry for powerful, scalable, and useful web services and we were eager to see how they would respond.

Of course, I was dead wrong in my analysis. “S3 is not a gamechanger.” What was I thinking? Too much focus on the storage economics and not enough on the business model inflection point.


Python’s wheels

Packaging has always been a bit of a sore spot for Python modules. Maybe wheels are going in the rant direction. Armin Ronacher has written a nice overview of how to put wheels into actual useful practice:

Wheels currently seem to have more traction than eggs. The development is more active, PyPI started to add support for them and because all the tools start to work for them it seems to be the better solution. Eggs currently only work if you use easy_install instead of pip which seems to be something very few people still do.

So there you have it. Python on wheels. It’s there, it kinda works, and it’s probably worth your time.


Pandas, payrolls, and pay stubs

Brandon Rhodes penned a nice, light, practical introduction to Pandas while using “small” data:

I will admit it: I only thought to pull out Pandas when my Python script was nearly complete, because running print on a Pandas data frame would save me the trouble of formatting 12 rows of data by hand.

This post is a brief tour of the final script, written up as an IPython notebook and organized around five basic lessons that I learned about Pandas by applying it to this problem.


Missing Strata West…

Grumble. Not that I’ve ever been to a Strata Conference, but the Twitter feeds of @bigdata and @joe_hellerstein are taunting me.


Pandas 0.13.1

ICYMI, there’s a new Pandas out with a lot of goodies. Python + tabular data processing + high performance == yum!


Diggin’ On Avro

After some initial trepidation, I’m starting to enjoy working with Apache Avro. The schema language and options (avdl, avsc, avpr) are a bit obtuse, but the cross-language interop seems to work as advertised. Which is a good thing.


Spark Summit 2014

This looks like it will be bad timing for me, but as an AMPCamp 2013 and Spark Summit 2013 attendee, I can vouch for the event quality:

We are proud to announce that the 2014 Spark Summit will be held in San Francisco on June 30 – July 2 at the Westin St. Francis. Tickets are on sale now and can be purchased here.

For 2014, the Spark Summit has grown to a 3-day event. We’ll have two days of keynotes and presentations followed by one day of hands-on training. Attendees of the summit can choose between a 2-day conference-only pass or a 3-day conference and training pass.

If you can’t/didn’t get to Strata West 2014 this will be your next, best opportunity to get a deep dive into the Spark ecosystem.


Data Community DC

I don’t know if it’s the best or the biggest, but DC has one damn well organized community of data enthusiasts:

Data Community DC (DC2) is an organization formed in mid-2012 to connect and promoting the work of data professionals in the National Capital Region. We foster education, opportunity, and professional development through high-quality, community-driven events, content, resources, products and services. Our goal is to create a truly open and welcoming community of people who produce, consume, analyze, and work with data — data scientists, analysts, economists, programmers, researchers, and statisticians, regardless of industry, sector, or technology. As of January 2014, we are currently over 5,000 members strong from diverse industries and from a large variety of backgrounds.

But that’s what we do here in the DMV, build bureaucratic organizational structures. Ha, ha! Only serious.


Trifacta Launch

Glad to see Trifacta ship their first product. I had a bit of an insider seat on the Lockheed Martin collaboration. They’ve iterated like crazy since I saw a very primitive version in June. Good luck to Dr. Hellerstein and the team, and of course Go Bears!


Apache Spark 0.9.0

Good to see a release of Apache Spark with GraphX included, even if the graph package is only in alpha:

We are happy to announce the availability of Spark 0.9.0! Spark 0.9.0 is a major release and Spark’s largest release ever, with contributions from 83 developers. This release expands Spark’s standard libraries, introducing a new graph computation package (GraphX) and adding several new features to the machine learning and stream-processing packages. It also makes major improvements to the core engine, including external aggregations, a simplified H/A mode for long lived applications, and hardened YARN support.

Spark is an open source project on the move. Previously, in-memory distributed computation was the big selling point. Now it’s unification of disparate computational models cleanly embedded within the Hadoop ecosystem.


Yet Another MapReduce DSL

Apache Crunch has been around for a while, but the recent addition for support of Apache Spark and a Scala REPL caught my eye:

Running on top of Hadoop MapReduce and Apache Spark, the Apache Crunch™ library is a simple Java API for tasks like joining and data aggregation that are tedious to implement on plain MapReduce. The APIs are especially useful when processing data that does not fit naturally into relational model, such as time series, serialized object formats like protocol buffers or Avro records, and HBase rows and columns. For Scala users, there is the Scrunch API, which is built on top of the Java APIs and includes a REPL (read-eval-print loop) for creating MapReduce pipelines.


AWS Tips

Link parkin’

AWS is one of the most popular cloud computing platforms. It provides everything from object storage (S3), elastically provisioned servers (EC2), databases as a service (RDS), payment processing (DevPay), virtualized networking (VPC and AWS Direct Connect), content delivery networks (CDN), monitoring (CloudWatch), queueing (SQS), and a whole lot more.

In this post I’ll be going over some tips, tricks, and general advice for getting started with Amazon Web Services (AWS). The majority of these are lessons we’ve learned in deploying and running our cloud SaaS product, JackDB, which runs entirely on AWS.


Linden Still Linkin’

Greg Linden’s got a new batch of interesting links. That’s worth coming out of posting hibernation (n.b. not retirement).


So Far, So Good

ITermScreenSnapz004

In the past few days I’ve:

  1. Transitioned my feedreading experience post-GReader
  2. Upgraded my WordPress installation and a couple of plug-ins
  3. Gone from Ubuntu 11.10 (pictured above, 591 days uptime wow!) to Ubuntu 12.10

Everything so far has been pretty painless, other than one lingering bug in feedbin that only seems to affect the feed for Tim Bray’s Ongoing. Unfortunately, this is one of my favorite feeds. Seems a bit suspect that the issue lingers as Bray and his feed have been around for like ever, and a good feed library should process his, of all people’s, correctly. But I’ll chalk it up to feedbin’s growing pains.

And ReadKit is passable, but I wouldn’t exactly call it … zippy … on Ye Olde MacBook.


The Final Call

GReader Logo Give or take a few due to potential timezone adjustments, in 6 hours Google Reader will go dark. Once again, shout out to all GReader staff past and present for delivering a ton of value for nothing out of my pocket. No heapings of scorn from this quarter. Execs made a business decision and I wasn’t exactly a paying customer. It was a good run while it lasted. Special kudos to Mihai Parparita for whipping together the eminently useful readerisdead toolkit on short notice. Somehow it successfully slurped down multiple gigabytes of Reader data for me across multiple accounts.

Moving onward! I’ve decided to go with feedbin.me since it’s approved for use with Mr. Reader. I realized that despite my affection for NetNewsWire, I now do the vast majority of my feedreading on my iPad, either on the couch or interstitially. So tilting towards my favorite reader there means the least dislocation. Meanwhile, Marco Arment somewhat put a stake in the prospects of NetNwsWire. To compensate on the desktop, I’m adding ReadKit to the mix.

However, like dangerousmeta, I waited until the last minute to make up my mind. I’m reserving the right to radically change my mind as I see fit.


And We’re Back

A month or so off felt pretty good. Lots of great stuff out there in the RSSsphere, despite the coming apocalypse. This choice nugget from Cal Newport on building a great career really hit home:

The courage culture paints a tempting picture of how people end up with remarkable lives. It tells a story where you’re the main character, fighting evil forces, and ultimately triumphing after a brief but intense battle.

The reality is decidedly less exciting. Remarkable careers require that you become remarkably good. This takes time. But not necessarily a string of defiant rejections of some mysterious status quo.

As for feed reading, I’ve stashed my feed list, and may just go Brent Simmons and live with the next NetNewsWire.


Slowing Down

This post is of a pair with 540. And due to Hokey Smokes, there’s even a little more spice. I finally get a link and then I’m going to put on the brakes!

Since today is my birthday, I try and reflect on things I can readily change up to stay out of unhealthy ruts or just to keep myself fresh. 540 days in a row is more than enough to prove that I can keep a posting streak alive. The conjunction of birthday, nice round number, and national holiday seems more than auspicious timing to give up that streak. Plus, I’ve been at this blogging thing off and on for well over 10 years. (Remember when it was all about “social software”?)

Even though I disagree a bit with the whole post, Greg Linden recently captured a bit of where I’m at:

I find my blogging here to be too useful to me to stop doing it. I have also embraced microblogging in its many forms. Yet I am left wondering if there is something we are all missing, something shorter than blogging and longer than tweets and different than both, that would encourage thoughtful, useful, relevant mass communication.

We are still far from ideal. A few years ago, it used to be that millions of blog and press articles flew past, some of which might pile up in an RSS reader, a few of which might get read. Now, millions of tweets, thousands of Facebook posts, and millions of articles fly past, some of which might be seen in an app, a few of which might get read. Attention is random; being seen is luck of the draw. We are far from ideal.

I don’t think blogging is dead. I’m not sure blogging was always about journalism. And I personally haven’t embraced microblogging, although Twitter makes for a great link stream. But blogging is too useful, and fun!, for me to stop cold. I will however, be slowing down a bit. Might be a couple of posts a week but probably no less than once per. I will, however, feel no obligation to any given frequency. So if you’ve been using this blog as your daily hit of excitement, I thank you for your attention, but encourage you to add another source or two as a replacement.

And even though the posting streak wasn’t particularly onerous in terms of time, I’ll be trying to turn the same habits of mind to side projects involving coding and data analysis. This also means my content should trend to more technical topics but we’ll see. With this year’s #3 pick in the NBA draft, the Washington Wizards luck is looking up, so they might be even more interesting to talk about in 2013-14. I also have this half-assed idea to do a series of 10 REM posts, reminiscing on 10 years of blogging, by trawling through the archives of Mass Programming Resistance, New Media Hack, and out into the wider web.

Be seeing you!

P.S. Feels like Greg has another start-up thread within him even though there’s already a clear direction for Geeky Ventures!


540

Check this out

Enthought Python Distribution -- www.enthought.com
Version: 7.3-2 (32-bit)

Python 2.7.3 |EPD 7.3-2 (32-bit)| (default, Apr 12 2012, 11:28:34) 
[GCC 4.0.1 (Apple Inc. build 5493)] on darwin
Type "credits", "demo" or "enthought" for more information.
>>> import datetime
>>> datetime.date(2013,05,26) - datetime.date(2011,12,3)
datetime.timedelta(540)
>>>

That’s my way of saying I’ve posted for 540 days straight. Also on the order of 599 out of 600. Yeah me!

In fall of 2011, just on a lark and as an experiment in behavior modification, I set a goal of posting for 365 days straight. I slipped up, got back on the horse and never looked back. Mission accomplished.

Got a few other things to say, so more after the break:

It’s fairly amazing some of the things I’ve managed to push past:

  • Holidays and holiday travel
  • “Vacations”
  • Crunch times at work
  • Work travel
  • Long days of work + kid + social
  • Personal illness
  • Oral surgery
  • When my wife was in the emergency room after a car accident
  • When I was in the emergency room after passing out from dehydration
  • Days when I thought I had nothing to say
  • Days when I really didn’t want to say anything
  • Days when the world had more important things going on

I learned a lot on this road. Every day I tried to push out some content at least of interest to me, and maybe to someone else. If you’re not in it for the money or the fame, that’s the best you can do. Getting worked up about some fictitious “audience” doesn’t do you any good.

A wide, but not overwhelming, variety of source feeds is necessary. Particular tools don’t make much difference, although posting from the iPhone came in handy a few times. I occasionally did the prepared in advance post thing, but 90% of them were fresh baked that day. Queueing ’em up always seemed a bit like cheating.

One final, interesting observation about a time oriented goal like this one, is that you can’t make it go any slower or faster. No way to speed it up and get it over with quicker. No way to put it on pause just because you’ve got an issue in your life. It’s all about grinding out the clock. But in a twist, you start to crave the daily achievement. You wake up thinking about what today’s post should be. And you get a bit fidgety late in the evening if you haven’t yet met deadline. It becomes an addiction, but in this case a good one.


Hokey Smokes!

Yowsa! I actually got a link-out from dangerousmeta! I’m showing my blogging age here, but I’ve noted in the past my admiration for the site. Meanwhile, I don’t do any audience tracking or visit analytics at all for MPR. Pretty much have no idea who’s actually reading this stuff, if anyone. So it’s one of those old time, early ’oughts (yup, I go back that far) thrills to see the site title pop up in another feed.


Essential Formulae

Evan Miller’s statistical material for programmers might come in handy:

As my modest contribution to developer-kind, I’ve collected together the statistical formulas that I find to be most useful; this page presents them all in one place, a sort of statistical cheat-sheet for the practicing programmer.

Most of these formulas can be found in Wikipedia, but others are buried in journal articles or in professors’ web pages. They are all classical (not Bayesian), and to motivate them I have added concise commentary. I’ve also added links and references, so that even if you’re unfamiliar with the underlying concepts, you can go out and learn more. Wearing a red cape is optional.


Deep Into Partitions

Network partitions that is, and their implications for some common, popular, open source datastores. Kyle Kingsbury has cooked up “Call Me Maybe”

This article is part of Jepsen, a series on network partitions. We’re going to learn about distributed consensus, discuss the CAP theorem’s implications, and demonstrate how different databases behave under partition.

In-depth technical content on the Web. Who knew! You have been warned.


CommaFeed As Backup Plan

I was all set to put CommaFeed on the list of potential GReader replacements after seeing a mention coming across the MetaFilter feed. Then I started reading the MeFi comments and this one from Rhaomi really hit home:

It’s not just the interface and UI, which is pretty easy to clone. It’s the staggering infrastructure that powers it — the sophisticated search crawlers scouring the web and delivering near-real-time updates, the industrial-scale server farms that store untold petabytes of searchable text and images relevant to you (much of it from long-vanished sources), the ubiquitous Google name that makes the service a popular platform for innumerable third-party apps, scripts, and extensions.

It’s possible to code up something that looks and feels a lot like Reader in three months, with the same view types and shortcuts. But to replicate its core functionality — fast updates, archive search, stability, universal access, wide interoperability — takes Google-scale engineering I doubt anybody short of Micosoft/Yahoo can emulate. It was very nearly a public service, and its going to be frustrating trying to downsize expectations for such a core web service to what a startup — even a subscription-backed one — can accomplish.

Not to mention the current CommaFeed landing page annoyingly doesn’t have any type of “About” page, just a force funnel to registration. Hey, I like to at least be sweet talked a little before wasting a password!


NewsBlur Recco

Rafe Colburn is bullish on NewsBlur as a replacement for Google Reader, especially after the recent redesign:

The upside of NewsBlur has been that it works really well at its core purpose, fetching feeds and displaying them for the user. It also has solid native mobile clients, enabling you to keep read status in sync across devices.

That’s a good enough endorsement for me. With the clock ticking on the GReader shutdown, I’ll give NewsBlur the first crack at filling the void for me.


Cal Berkeley GraphX

C’mon Bears, cut it out. It’s getting embarrassing how much Spark related output there has been recently. In a good way!

From social networks to targeted advertising, big graphs capture the structure in data and are central to recent advances in machine learning and data mining. Unfortunately, directly applying existing data-parallel tools to graph computation tasks can be cumbersome and inefficient. The need for intuitive, scalable tools for graph computation has lead to the development of new graph-parallel systems (e.g. Pregel, PowerGraph) which are designed to efficiently execute graph algorithms. Unfortunately, these new graph-parallel systems do not address the challenges of graph construction and transformation which are often just as problematic as the subsequent computation. Furthermore, existing graph-parallel systems provide limited fault-tolerance and support for interactive data mining.

We introduce GraphX, which combines the advantages of both data-parallel and graph-parallel systems by efficiently expressing graph computation within the Spark data-parallel framework. We leverage new ideas in distributed graph representation to efficiently distribute graphs as tabular data-structures. Similarly, we leverage advances in data-flow systems to exploit in-memory computation and fault-tolerance. We provide powerful new operations to simplify graph construction and transformation. Using these primitives we implement the PowerGraph and Pregel abstractions in less than 20 lines of code. Finally, by exploiting the Scala foundation of Spark, we enable users to interactively load, transform, and compute on massive graphs.

Need to drill in to see how GraphX stacks up to the current spate of “big data” graph toolkits, especially GraphLab. Ben Lorica reports that GraphX is more oriented towards to programmer productivity as opposed to raw performance:

GraphX is a new, fault-tolerant, framework that runs within Spark. Its core data structure is an immutable graph5 (Resilient Distributed Graph – or RDG), and GraphX programs are a sequence of transformations on RDG’s (with each transformation yielding a new RDG). Transformations on RDG’s can affect nodes, edges, or both (depending on the state of neighboring edges and nodes). GraphX greatly enhances productivity by simplifying a range of tasks (graph loading, construction, transformation, and computations). But it does so at the expense of performance: early prototype algorithms written in GraphX were slower than those written in GraphLab/PowerGraph.

© 2008-2024 C. Ross Jam. Built using Pelican. Theme based upon Giulio Fidente’s original svbhack, and slightly modified by crossjam.