home ¦ Archives ¦ Atom ¦ RSS

Whereabouts

Invincea Wall It occurs to me that there is a major event that I have not noted in this venue. About 18 months ago, I gave up my position at Lockheed Martin and moved to a much smaller company Invincea. That web site is almost exclusively about the commercial product side of Invincea, but we also have a federal services division called Invincea Labs.

Labs is in exactly the same DoD Science and Technology research space I worked for LM. We hustle Contracted Research and Development (CRAD) from various agencies looking for technical solutions to bleeding edge problems.

It’s an amusing story of how it came about, but I literally ended at LM on a Friday afternoon and started across the street at Invincea the following Monday. The biggest change is that I went down 3 orders of magnitude in employee head count. Also, Labs has a pure focus on cybersecurity. No more worrying about expensive jet fighters and all that. Lean, mean, and a relaxed attitude have been a refreshing change of pace.

I’m part of the Cyber Analytics team, and Labs has a lot of open positions. We work on all sorts of bleeding edge projects so shoot me an e-mail at bria n.d ennis@invincea.com if any of them seem to fit you.


It’s Been A Long Time…

…, I shouldn’t have left you,
Without a strong rhyme to step to.
Think of how many weak shows you slept through.
Time’s up. I’m sorry I kept you…

— Rakim

MacBook Pro 2015

It’s a new Macaversary around here.


Gibson’s “The Peripheral”

The Peripheral Cover I finished reading William Gibson’s The Peripheral about a day ago. As an avowed Gibson fanboy, I’ve oddly got some pretty mixed feelings about the book.

First off, I did this weird thing of pre-ordering from Amazon, so I had the book on the first day of availability. October 28th, 2014. Six. Months. Ago. For whatever reason, I parked the hardcover and never got started. Maybe it was the dread of hauling the hefty tome around.

So taking advantage of some vacation time, I jumped in and devoured it promptly. Ticked all of my Gibson check boxes. But in the end I felt a bit unsatisfied.

Can’t quite put my finger on it, but I really didn’t get a rush. Since Cayce Pollard, I haven’t really felt like a Gibson protagonist has been in much peril. Just a matter of waiting to see how they get out of it. Yeah, Flynne got kidnapped and threatened, but everybody came out clean in the end.

And the Chinese server explanation I found lacking.

Maybe this one just needs to grow on me like the Blue Ant Trilogy.


EMR Spark

Muy bueno! Spark is now an official part of Amazon Elastic MapReduce.

I’m happy to announce that Amazon EMR now supports Apache Spark. Amazon EMR is a web service that makes it easy for you to process and analyze vast amounts of data using applications in the Hadoop ecosystem, including Hive, Pig, HBase, Presto, Impala, and others. We’re delighted to officially add Spark to this list. Although many customers have previously been installing Spark using custom scripts, you can now launch an Amazon EMR cluster with Spark directly from the Amazon EMR Console, CLI, or API.


Shades of Redis

Good piece from Charles Leifer highlighting a couple of alternative takes on the redis key/datastructure store:

Recently I’ve learned about a few new Redis-like databases: Rlite, Vedis and LedisDB. Each of these projects offers a slightly different take on the data-structure server you find in Redis, so I thought that I’d take some time and see how they worked. In this post I’ll share what I’ve learned, and also show you how to use these databases with Walrus, as I’ve added support for them in the latest 0.3.0 release.

I’m particularly intrigued by the embedded Rlite store. Seems like something useful for situations slightly less relational than what SQLite can service.


Deep Cassandra

Andrew Montalenti relates parse.ly’s experience with Cassandra. Lots of interesting tidbits, but the money graf is this:

A well-seasoned technologist friend of mine was not at all surprised when I walked him through some of these issues we had with Cassandra. He said, “You honestly expected that adopting a data store at your scale would not require you to learn all of its internals?” He has a point. After all, we didn’t adopt Elasticsearch until we really grokked Lucene.


Upserts, mmmmmm!

I don’t know how I missed Craig Kerstiens’ post on upserts in PostgreSQL 9.5, but I’m glad they’re here.

If you’ve followed anything I’ve written about Postgres, you know that I’m a fan. At the same time you know that there’s been one feature that so many other databases have, which Postgres lacks and it causes a huge amount of angst for not being in Postgres… Upsert. Well the day has come, it’s finally committed and will be available in Postgres 9.5.


GearPump

Since my last extended run of blogging, I’ve really gotten into message system infrastructure and streaming data computation architectures. May have to kick the tires on GearPump

GearPump is a lightweight real-time big data streaming engine. It is inspired by recent advances in the Akka framework and a desire to improve on existing streaming frameworks. … Per initial benchmarks we are able to process 11 million messages/second (100 bytes per message) with a 17ms latency on a 4-node cluster.

That seems like a lot of msgs/sec. Gotta see the specs on that cluster.

Via


Customizing IPython for Pandas

Link parkin’

Chris Moffit’s blog post doesn’t have a lot of detail, but the attendant IPython notebook looks chock full of goodness

The combination of IPython + Jupyter + Pandas makes it easy to interact with and display your data. Not surprisingly, these tools are easy to customize and configure for your own needs. This article summarizes some of the most useful and interesting options.


Ignite v Spark

I’m an unabashed Apache Spark fanboy, but it’s good intellectual hygiene to know about the technical alternatives. Apache Ignite is one that has slipped beneath my radar. Konstantin Boudnik contrasts Ignite and Spark:

Complimentary to my earlier post on Apache Ignite in-memory file-system and caching capabilities I would like to cover the main differentiation points of the Ignite and Spark. I see questions like this coming up repeatedly. It is easier to have them answered, so you don’t need to fish around the net for the answers.

Clearly not from a native English speaker, but definitely worth the read.


Get Famous With Spark

You too can win fame and fortune using Apache Spark! Helps to invent it and write up a great PhD dissertation.

Matei Zaharia won the 2014 Doctoral Dissertation Award for his innovative solution to tackling the surge in data processing workloads, and accommodating the speed and sophistication of complex multi-stage applications and more interactive ad-hoc queries. His work proposed a new architecture for cluster computing systems, achieving best-in-class performance in a variety of workloads while providing a simple programming model that lets users easily and efficiently combine them.

Go Bears!


Reconciling Streaming Jargon

I really enjoyed reading Martin Kleppmann’s treatise on varying communities and terminology related to stream processing:

Some people call it stream processing. Others call it Event Sourcing or CQRS. Some even call it Complex Event Processing. Sometimes, such self-important buzzwords are just smoke and mirrors, invented by companies who want to sell you stuff. But sometimes, they contain a kernel of wisdom which can really help us design better systems.

In this talk, we will go in search of the wisdom behind the buzzwords. We will discuss how event streams can help make your application more scalable, more reliable and more maintainable. Founded in the experience of building large-scale data systems at LinkedIn, and implemented in open source projects like Apache Kafka and Apache Samza, stream processing is finally coming of age.

On the day job, I’m on my third deployment of a message queueing system to support prototyping of stream processing algorithms. I’m really starting to appreciate the fundamental differences between various approaches. I can also say there’s no “right way” to do it. Each use case has to be looked at individually and there definitely will be some bespoke customization. Carefully define your correctness and performance guarantees and there’s a chance you’ll get it right.

Dispatches like Kleppmann’s though, are helpful in understanding what the landscape looks like and where you’d like to be.


Sprung For Pinner

I’ve gotten back into consistently collecting links of interest and started heavily using the fine Pinboard product. Capturing links from iOS devices was a drag though, having to go through a JavaScript bookmarklet. Felt sort of convoluted. Too much friction. But my iPhone and iPad are the places where I run across most of the links I want to stash.

Enter Pinner, a $4.99 app. A little pricey, but it looks very nice and provides a clean experience for browsing Pinboard. Best of all it adds a custom share sheet so posting a link to Pinboard is just one click away.


At What COST?

Frank McSherry published a useful reminder that one must carefully calibrate the need to deploy “big data” solutions:

Lots of people struggle with the complexities of getting big data systems up and running, when they possibly shouldn’t be using the systems in the first place. The data sets above are certainly not small (billions of edges), but still run just fine on a laptop. Much faster than the distributed systems, at least.

Here are two helpful guidelines (for largely disjoint populations):

  1. If you are going to use a big data system for yourself, see if it is faster than your laptop.
  2. If you are going to build a big data system for others, see that it is faster than my laptop.

This brings back memories of the CMU work on GraphChi, where the processed graphs with billions of edges on a Mac Mini.

I’ll have to dig up Frank’s paper once it gets published.


Well That Explains It

I’ve been seeing some weird host naming issues on my Mac OS X machine for work. Thought it was an honest to gosh conflict with another machine but it turns out there’s glitchiness in Apple’s latest DNS servers:

Duplicate machine names. We use an old Mac named “nirrti” as a file- and iTunes server. In the pre-10.10 days, once in a blue moon nirrti would rename herself to “nirrti (2)”, presumably because it looked like another machine was already using the name “nirrti”. Under 10.10, this now happens a lot, sometimes getting all the way to nirrti (7). Changing back the computer name in the Sharing pane of the System Preferences usually doesn’t take. Apart from looking bad, this also makes opening network connections and playing iTunes content harder, as you need to connect to the right version of the name or nothing happens.

Good to know, but I wouldn’t go so far as to attempt the modifications described in the article. Seems like a recipe for later pain on further application and operating system upgrades.


GLA Podcast 49

I am all over that. Will definitely be checking it out during tomorrow’s commute.

Fiending for a Mushroom Jazz 8 release though. It’s been over 3 years since Mushroom Jazz 7 hit the street.


spark-kafka

You can also turn a Kafka topic into a Spark RDD

Spark-kafka is a library that facilitates batch loading data from Kafka into Spark, and from Spark into Kafka.

This library does not provide a Kafka Input DStream for Spark Streaming. For that please take a look at the spark-streaming-kafka library that is part of Spark itself.

This could come in handy to pre-ingest some data to build up some history before connecting to a Kafka data stream using Spark Streaming.


Elasticsearch and Spark

Apache Spark Logo In the day job, I was casting about for ways to integrate Apache Spark with the open source search engine Elasticsearch. Basically, I had some megawads of JSON data which Elasticsearch happily inhales, but I needed a compute platform to work with the data. Spark is my weapon of choice.

Turns out there’s a really nice Elasticsearch Hadoop toolkit that includes making Spark RDDs out of Elasticsearch searches. I have to thank Sloan Ahrens for tipping me off with a nice clear explanation of putting the connector in action:

In this post we’re going to continue setting up some basic tools for doing data science. The ultimate goal is to be able to run machine learning classification algorithms against large data sets using Apache Spark™ and Elasticsearch clusters in the cloud.

… we will continue where we left off, by installing Spark on our previously-prepared VM, then doing some simple operations that illustrate reading data from an Elasticsearch index, doing some transformations on it, and writing the results to another Elasticsearch index.


Into the Blockchain

Mastering Bitcoin Cover I’m way late to the Bitcoin party, but think the notion of applications built from blockchain concepts will be a Big Deal (™). Andreas Antonopoulos’ new book Mastering Bitcoin is getting me up to speed. Here’s a taste:

One way to think about the blockchain is like layers in a geological formation, or a glacier core sample. The surface layers may change with the seasons, or even be blown away before they have time to settle. But once you go a few inches deep, geological layers become more and more stable. By the time you look a few hundred feet down, you are looking at a snapshot of the past that has remained undisturbed for millennia or millions of years. In the blockchain, the most recent few blocks may be revised if there is a chain recalculation due to a fork. The top six blocks are like a few inches of topsoil. But once you go deeper into the blockchain, beyond six blocks, blocks are less and less likely to change. After 100 blocks back, there is so much stability that the “coinbase” transaction, the transaction containing newly mined bitcoins, can be spent. A few thousand blocks back (a month) and the blockchain is settled history. It will never change.block, and “top” or “tip” to refer to the most recently added block.

From what I’ve read so far, the book is a nice blend of high level overview and technical details, with code samples no less.


Yes, It Still Works

Yes this blog is still fully operational. As sole owner, proprietor, publisher, and author, I’m committing to more content in 2015. I guarantee it’s going to be a more interesting year in these here parts.


Spark, IPython, Kafka

A couple of good overviews from the fine folks at Cloudera

First, Gwen Shapira & Jeff Holoman on “Apache Kafka for Beginners”

Apache Kafka is creating a lot of buzz these days. While LinkedIn, where Kafka was founded, is the most well known user, there are many companies successfully using this technology.

So now that the word is out, it seems the world wants to know: What does it do? Why does everyone want to use it? How is it better than existing solutions? Do the benefits justify replacing existing systems and infrastructure?

In this post, we’ll try to answers those questions. We’ll begin by briefly introducing Kafka, and then demonstrate some of Kafka’s unique features by walking through an example scenario. We’ll also cover some additional use cases and also compare Kafka to existing solutions.

And Uri Laserson on “How-to: Use IPython Notebook with Apache Spark”

Here I will describe how to set up IPython Notebook to work smoothly with PySpark, allowing a data scientist to document the history of her exploration while taking advantage of the scalability of Spark and Apache Hadoop.


Ramez Naam’s “Nexus”

Nexus Cover Generally, I dislike the technothriller genre (c.f. Daemon), but I generally enjoyed “Nexus” by Ramez Naam. The technical and philosophical aspects of bio-hacking were well done. I wasn’t particularly fond of the technothriller clash of nation states, American exceptionalism, military/intelligence complex sycophancy tropes, but I knew what I was getting into. At least there was some interesting cultural diversity and introspection in the mix.

I may actually pick up the sequel, “Crux”.


Blaze Expressions

“tl;dr Blaze abstracts tabular computation, providing uniform access to a variety of database technologies”

Haven’t gotten a chance to dig in yet, but Continuum Analytics’ new Blaze Expressions library is worthy of further inspection:

Occasionally we run across a dataset that is too big to fit in our computer’s memory. In this case NumPy and Pandas don’t fit our needs and we look to other tools to manage and analyze our data. Popular choices include databases like Postgres and MongoDB, out-of-disk storage systems like PyTables and BColz and the menagerie of tools on top of the Hadoop File System (Hadoop, Spark, Impala and derivatives.) Each of these systems has their own strengths and weaknesses and an experienced data analyst will choose the right tool for the problem at hand. Unfortunately learning how each system works and pushing data into the proper form often takes most of the data scientist’s time.

The startup costs of learning to munge and migrate data between new technologies often dominate biggish-data analytics.

Blaze strives to reduce this friction. Blaze provides a uniform interface to a variety of database technologies and abstractions for migrating data.

I especially like the notion of exploiting multiple different frameworks such as in-memory (Pandas), SQL, NoSQL (MongoDB), and Big Data (Apache Spark) for tabular backend engines.


Apache Spark’s Unified Model

I’ve been a fan of Apache Spark (Go Bears!) for a while despite not having a real good opportunity to put the toolkit to practical use. Last year I got to AMPCamp 3 and the first Spark Summit. At the latter event, The AMPLab started singing a new tune about the benefits of a unified model for big data processing, moving on from selling in-memory computing.

Cloudera’s Gwen Shapira posted a good case study of the upside:

But the biggest advantage Spark gave us in this case was Spark Streaming, which allowed us to re-use the same aggregates we wrote for our batch application on a real-time data stream. We didn’t need to re-implement the business logic, nor test and maintain a second code base. As a result, we could rapidly deploy a real-time component in the limited time left — and impress not just the users but also the developers and their management.


Items Of Note

A bit dated, but hopefully not completely useless:


Can’t Wait

Maybe I’ll have finished re-reading the Blue Ant trilogy by then.


Foxy 538

Welcome back Nate!

The breadth of our coverage will be much clearer at this new version of FiveThirtyEight, which is launching Monday under the auspices of ESPN. We’ve expanded our staff from two full-time journalists to 20 and counting. Few of them will focus on politics exclusively; instead, our coverage will span five major subject areas — politics, economics, science, life and sports.

What I like about this particular post (go read it all, seriously), is the level of humility Silver expresses. A lot of people can, and do, do the math and follow the predictive approaches he espouses. But putting it to the principled service of informing The Public, within the current dynamic of Internet social media, is innovative. Computer Assisted Reporting was just a precursor. As a recovering new media hack I can appreciate all the roots of this iteration of his work.

Plus, I love this attitude:

It’s time for us to start making the news a little nerdier.


8 Years of S3

It feels like more than eight years though, maybe because I’ve been all over it since the beginning:

We launched Amazon S3 on March 14, 2006 with a press release and a simple blog post. We knew that the developer community was interested in and hungry for powerful, scalable, and useful web services and we were eager to see how they would respond.

Of course, I was dead wrong in my analysis. “S3 is not a gamechanger.” What was I thinking? Too much focus on the storage economics and not enough on the business model inflection point.


Python’s wheels

Packaging has always been a bit of a sore spot for Python modules. Maybe wheels are going in the rant direction. Armin Ronacher has written a nice overview of how to put wheels into actual useful practice:

Wheels currently seem to have more traction than eggs. The development is more active, PyPI started to add support for them and because all the tools start to work for them it seems to be the better solution. Eggs currently only work if you use easy_install instead of pip which seems to be something very few people still do.

So there you have it. Python on wheels. It’s there, it kinda works, and it’s probably worth your time.


Pandas, payrolls, and pay stubs

Brandon Rhodes penned a nice, light, practical introduction to Pandas while using “small” data:

I will admit it: I only thought to pull out Pandas when my Python script was nearly complete, because running print on a Pandas data frame would save me the trouble of formatting 12 rows of data by hand.

This post is a brief tour of the final script, written up as an IPython notebook and organized around five basic lessons that I learned about Pandas by applying it to this problem.


Missing Strata West…

Grumble. Not that I’ve ever been to a Strata Conference, but the Twitter feeds of @bigdata and @joe_hellerstein are taunting me.


Pandas 0.13.1

ICYMI, there’s a new Pandas out with a lot of goodies. Python + tabular data processing + high performance == yum!


Diggin’ On Avro

After some initial trepidation, I’m starting to enjoy working with Apache Avro. The schema language and options (avdl, avsc, avpr) are a bit obtuse, but the cross-language interop seems to work as advertised. Which is a good thing.


Spark Summit 2014

This looks like it will be bad timing for me, but as an AMPCamp 2013 and Spark Summit 2013 attendee, I can vouch for the event quality:

We are proud to announce that the 2014 Spark Summit will be held in San Francisco on June 30 – July 2 at the Westin St. Francis. Tickets are on sale now and can be purchased here.

For 2014, the Spark Summit has grown to a 3-day event. We’ll have two days of keynotes and presentations followed by one day of hands-on training. Attendees of the summit can choose between a 2-day conference-only pass or a 3-day conference and training pass.

If you can’t/didn’t get to Strata West 2014 this will be your next, best opportunity to get a deep dive into the Spark ecosystem.


Data Community DC

I don’t know if it’s the best or the biggest, but DC has one damn well organized community of data enthusiasts:

Data Community DC (DC2) is an organization formed in mid-2012 to connect and promoting the work of data professionals in the National Capital Region. We foster education, opportunity, and professional development through high-quality, community-driven events, content, resources, products and services. Our goal is to create a truly open and welcoming community of people who produce, consume, analyze, and work with data — data scientists, analysts, economists, programmers, researchers, and statisticians, regardless of industry, sector, or technology. As of January 2014, we are currently over 5,000 members strong from diverse industries and from a large variety of backgrounds.

But that’s what we do here in the DMV, build bureaucratic organizational structures. Ha, ha! Only serious.


Trifacta Launch

Glad to see Trifacta ship their first product. I had a bit of an insider seat on the Lockheed Martin collaboration. They’ve iterated like crazy since I saw a very primitive version in June. Good luck to Dr. Hellerstein and the team, and of course Go Bears!


Apache Spark 0.9.0

Good to see a release of Apache Spark with GraphX included, even if the graph package is only in alpha:

We are happy to announce the availability of Spark 0.9.0! Spark 0.9.0 is a major release and Spark’s largest release ever, with contributions from 83 developers. This release expands Spark’s standard libraries, introducing a new graph computation package (GraphX) and adding several new features to the machine learning and stream-processing packages. It also makes major improvements to the core engine, including external aggregations, a simplified H/A mode for long lived applications, and hardened YARN support.

Spark is an open source project on the move. Previously, in-memory distributed computation was the big selling point. Now it’s unification of disparate computational models cleanly embedded within the Hadoop ecosystem.


Yet Another MapReduce DSL

Apache Crunch has been around for a while, but the recent addition for support of Apache Spark and a Scala REPL caught my eye:

Running on top of Hadoop MapReduce and Apache Spark, the Apache Crunch™ library is a simple Java API for tasks like joining and data aggregation that are tedious to implement on plain MapReduce. The APIs are especially useful when processing data that does not fit naturally into relational model, such as time series, serialized object formats like protocol buffers or Avro records, and HBase rows and columns. For Scala users, there is the Scrunch API, which is built on top of the Java APIs and includes a REPL (read-eval-print loop) for creating MapReduce pipelines.


AWS Tips

Link parkin’

AWS is one of the most popular cloud computing platforms. It provides everything from object storage (S3), elastically provisioned servers (EC2), databases as a service (RDS), payment processing (DevPay), virtualized networking (VPC and AWS Direct Connect), content delivery networks (CDN), monitoring (CloudWatch), queueing (SQS), and a whole lot more.

In this post I’ll be going over some tips, tricks, and general advice for getting started with Amazon Web Services (AWS). The majority of these are lessons we’ve learned in deploying and running our cloud SaaS product, JackDB, which runs entirely on AWS.


Linden Still Linkin’

Greg Linden’s got a new batch of interesting links. That’s worth coming out of posting hibernation (n.b. not retirement).

© 2008-2024 C. Ross Jam. Built using Pelican. Theme based upon Giulio Fidente’s original svbhack, and slightly modified by crossjam.