Link parkin’: TimescaleDB
An open-source time-series database optimized for fast ingest and complex queries. Looks, feels, speaks like Postgres.
Link parkin’: TimescaleDB
An open-source time-series database optimized for fast ingest and complex queries. Looks, feels, speaks like Postgres.
First, watch the Alien: Covenant Prologue trailers online, such as “The Crossing”
and “The Last Supper”.
Now you’ve seen the best parts of the entire Alien: Covenant production. None of which appear in the theatrical release.
Use your money to buy a ticket for Guardians of the Galaxy v2.
You’re welcome!
Brendan Gregg is a performance analysis superstar and I’ve already told you eBPF is wicked cool. I’m not going to the Velocity conference but I’ll check out the talk once it shows up on Safari.
What is eBPF and why is it useful?
eBPF is a weird Linux kernel technology that powers low-overhead custom analysis tools, which can be run in production to find performance wins that no other tool can. With it, we can pull out millions of new metrics from the kernel and applications, and explore running software like never before. It’s a superpower. It’ll benefit many people on Linux as they’ll add a toolkit of new analysis tools, or use new plugins for deep monitoring. That’s what I’ll show in my Velocity talk: new tools you can use.
There are four other good questions to go along with the above.
I bought a ticket for the Data Intelligence conference:
The 2017 Data Intelligence conference, which will take place in Mclean, Virginia is the first machine learning gathering for the community using and developing machine learning and data intelligence. It is produced and underwritten by NumFOCUS, the 501(c)(3) nonprofit that supports and promotes world-class, innovative, open source scientific computing. Through the Data Intelligence Conference, NumFOCUS advances its mission of growing the international community of open source developers.
To be honest, the event description is a bit buzzword laden for my taste. I think this conference is a substitute for last year’s PyData conference in DC. This one islocal though, mostly over a weekend, and the entrance fee was the right price. Maybe I’m missing them, but DC based technology events that are grassroots and outside of the Federal space are hard to find. So I’m really looking forward to the conference.
Given how hiccupy (sic) this blog has been, I doubt there’s anyone reading who might also be in attendance, but give me a shout if you do actually exist.
Still in early release, Data Science on the Google Cloud Platform, might be a good read:
Valliappa (Lak) Lakshmanan, Technical Lead for Data & ML Professional Services at Google Cloud, is the author of the upcoming O’Reilly Media book “Data Science on the Google Cloud Platform” (now in Early Release). In the following Q&A, Lak describes his reasons for writing this book, its intended readers, what readers will learn and how to think about the practice of data science on Google Cloud Platform (GCP)-based architecture.
The pull quote is from a post about the book over at the Google Cloud Platform blog.
Knocked off Liu Cixin’s The Three Body Problem this weekend, an extremely entertaining tale of initial alien encounter. While the overall plot and literary execution are outstanding, the key factor is that this is a translation from a popular Chinese work. In general, the shift from Western norms is bracing and in particular, the Communist Revolution in China is woven throughout the story to devastating effect. The overall reverence for science, apparent in the text and Liu’s afterword, is also refreshing.
The tale additionally invokes serious consideration into humanity, inhumanity, and the fate of man on Earth.
My only nit is that at the end, the aliens are heavily anthropomorphized, which didn’t work for me. But I acknowledge the wry symmetry that Liu invoked by doing so.
Apologies for the copyediting twitches.
Derrick Harris used to be a GigaOM reporter on the big data beat. When GigaOM went under he moved on to doing media for Mesosphere.
Looks like Harris launched out on his own again in January, doing a combo of blogging, newsletter, and podcasting. All of it can be found at architecht.io. The interviewee lineup on the podcast looks especially good with some high profile names like Eric Brewer, Mike Olson, Jay Kreps, and Julia Austin.
Graham Cormode, who is probably one of the most knowledgeable people in the world on the topic, has written an insanely good article on data sketching
The aim of this article has been to introduce a selection of recent techniques that provide approximate answers to some general questions that often occur in data analysis and manipulation. In all cases, simple alternative approaches can provide exact answers, at the expense of keeping complete information. The examples shown here have illustrated, however, that in many cases the approximate approach can be faster and more space efficient. The use of these methods is growing. Bloom filters are sometimes said to be one of the core technologies that “big data experts” must know. At the very least, it is important to be aware of sketching techniques to test claims that solving a problem a certain way is the only option. Often, fast approximate sketch-based techniques can provide a different tradeoff.
I say “insanely good” because there is some seriously hairy math behind these techniques. Yet Cormode makes the principles easily accessible to a general, admittedly already technically inclined, audience. As a former instructor, this is an article you could give to a bunch of upperclassmen and then spend two good lectures working through details and implications. No mean feat. Plus, these types of data structures are increasingly important to know about.
As a very early del.icio.us fanboy, and current Pinboard customer, Maciej Ceglowski’s chutzpah impresses me.
Pinboard has acquired Delicious. Here’s what you need to know:
If you’re a Pinboard user, nothing will change. Sad!
If you’re a Delicious user, you will have to find another place to save your bookmarks. The site will stay online. but on June 15, I will put Delicious into read-only mode. You won’t be able to save new bookmarks after that date, or use the API.
Not sure if I’m more surprised that del.icio.us is still live or that it went for so cheap.
Link parkin’:
Simit, A language for computing on sparse systems.
Simit is a new programming language that makes it easy to compute on sparse systems using linear algebra. Simit programs are typically shorter than Matlab programs yet are competitive with hand-optimized code and also run on GPUs.
With Simit you build a graph that describes your sparse system (e.g. a spring system, a mesh or the world wide web). You then compute on the system in two ways: locally or globally. Local computations apply update functions to each vertex or edge of the graph that update local state based on the vertex or the edge and its endpoints. This part of the language is similar to what you find in graph processing framework such as GraphLab and its descendants.
Occupy the Cloud: Distributed Computing for the 99%
Distributed computing remains inaccessible to a large number of users, in spite of many open source platforms and extensive commercial offerings. While distributed computation frameworks have moved beyond a simple map-reduce model, many users are still left to struggle with complex cluster management and configuration tools, even for running simple embarrassingly parallel jobs. We argue that stateless functions represent a viable platform for these users, eliminating cluster management overhead, fulfilling the promise of elasticity. Furthermore, using our prototype implementation, PyWren, we show that this model is general enough to implement a number of distributed computing models, such as BSP, efficiently. Extrapolating from recent trends in network bandwidth and the advent of disaggregated storage, we suggest that stateless functions are a natural fit for data processing in future computing environments.
Actually, PyWren seems like yet another top notch UC Berkeley CS research project. Go Bears!
Even though Prismatic never really worked for me and never caught on in general, I like Bradford Cross’s musings on Twitter. His latest venture recently surfaced, and his was the interview on the first episode of The Architecht podcast I listened to. This reminded me of some thoughts he had on AI startups earlier this year:
With AI in a full-fledged mania, 2017 will be the year of reckoning. Pure hype trends will reveal themselves to have no fundamentals behind them. Paradoxically, 2017 will also be the year of breakout successes from a handful of vertically-oriented AI startups solving full-stack industry problems that require subject matter expertise, unique data, and a product that uses AI to deliver its core value proposition.
Seems like production AI is more of a Formula One type endeavor rather than stock car. Cross might be right, but it would be interesting if an ML equivalent of Hadoop emerged. Interesting, but low probability.
I’ve been poking around for a while on the Interwebs looking for accessible streaming data sources besides the oversubscribed Twitter feeds. Today I stumbled across Satori, with an initial description of the service from their blog:
Why? Because the world of open data needs to change. Right now there is a trove of open, public data available all over the world. But instead of being able to realize its potential, that data is at-rest on a variety of disparate websites across the internet.
By coalescing the world’s open data into streaming live data and making it available for free, we’ll be able to see new solutions to big problems and ideas that haven’t even been thought of yet.
With Satori, any developer with a computer, anywhere in the world can create a free account and have unlimited access to live open data to build the live data apps.
Right now I’d be nervous building anything serious on such a new service, lest they run out of money and abruptly shut down the service. For throwaway noodling and proofs-of-concept it looks like Satori provides something valuable. If nothing else, a convenient feed of Wikipedia edits would be interesting to experiment with.
OEmbed Link rot on URL: https://twitter.com/pacoid/status/850480933534785536
Diggin’ through some old Twitter faves and found that @pacoid is doing online courses covering Natural Language Processing:
Keep in mind, these courses are the opposite of MOOCs. We realized how the industry had swung too far in the wrong direction with Ed Tech, how VC-backed tech startups had taken seriously detrimental short-cuts to attempt scale in learning, how current trends in “education” at scale opposed our ethos and experience at O’Reilly. Our origin story as a company was about peer teaching, with Tim and Dale active at Unix user group meetings. We’ve always been about peer teaching — that’s one reason I was eager to lead this program, calling back to my teaching fellowship many years ago at Stanford, where I’d helped establish a popular peer teaching program there.
If you’re already an O’Reilly Safari subscriber it looks like a great, quick (couple of hours), intro. Unfortunately, all the upcoming sessions are already full with a waitlist! Hopefully, Paco can sneak in a few more sessions this year.
I’ve really gotten into podcasts recently where once I was just not a fan. They make a great alternative to broadcast radio if like me you’re stuck in transit a lot. Thanks to Wes Felter’s blog, the Packet Pushers Podcast came across my radar. Definitely deep, meaty, technical stuff and recommended. Now I feel like I actually understand what NFV means.
Be it resolved that I will complete reading 50 books in the remainder of this calendar year.
To begin this journey, I knocked off Iain M. Banks’ The Player of Games yesterday. I had been slogging through the first half of the book, but then the latter half really moved for me. Previously, I was not a big fan of Banks’ Consider Phlebas. I can recommend The Player of Games, although it too can be a tad disturbing at times.
Older, wiser, definitely grayer. F’in readers! The last decade has been a bit of a roller coaster.
A few regrets, but extremely thankful I’ve managed to last this long and for the many gifts I’ve been blessed with.
“To infinity and beyond,” although it feels like everyone who’s reached this point is on a month-to-month lease.
Quentin Monnet is making it easy to dive into eBPF.
So instead, here is what we will do. After all, I spent some time reading and learning about BPF, and while doing so, I gathered a fair amount of material about BPF: introductions, documentation, but also tutorials or examples. There is a lot to read, but in order to read it, one has to find it first. Therefore, as an attempt to help people who wish to learn and use BPF, the present article introduces a list of resources. These are various kinds of readings, that hopefully will help you dive into the mechanics of this kernel bytecode.
A little virtual machine running inside the kernel is such a wicked concept.
I’m in the market for a new desktop monitor. 4K resolution seems to have hit a reasonable price and quality point. The Wirecutter has a top notch list of suggestions.
And @amplab project is officially over - it was great to see the staff, students, co-directors mike franklin & ion stoica pic.twitter.com/CawDMbWLKB
— Ben Lorica 罗瑞卡 (@bigdata) November 19, 2016
End of Project has arrived for UC Berkeley’s AMPLab. Spark and the other varied projects of the group hit my radar back in July of 2012. The project was run according to Prof. Dave Patterson’s guidance for collaborative research centers. I think it’s fair to say AMPLab was a success.
Looking forward to what comes next and Go Bears!
Duly noting a number of things I’m thankful for this year:
Eventually I’ll get into k8s enough that this trick will be useful to know:
In my opinion, the standard kube-ui is pretty spartan. It doesn’t really give me a good overview of what is going on in my cluster.
Weave Scope is an open source tool that helps you monitor and visualize your cluster. It is currently very beta, but I think it has a lot of potential!
Running it is also super easy.
Plus I really like what the Weave folks have been up to.
Thanks Sandeep!
I’ve already read a bunch of links on this reading list, but there’s some new nuggets. I also like Lars Albertsson’s take on the process of building data intensive systems:
This is a curated recommended read and watch list for scalable data processing. It is primarily aimed towards software architects and developers, but there is material also for people in leadership position as well as data scientists, in particular in the first section. The content has been chosen with a bias towards material that conveys a good understanding of the field as a whole and is relevant for building practical applications.
Mark Litwintschik has been doing yeoman’s work with the New York City Taxi & Limousine Commission data. Over a series of blog posts he’s taken this one dataset and processed it with a number of data management and “big data” technologies, including purchasing an Amazon Redshift cluster:
Over the past few months I’ve been benchmarking a dataset of 1.1 billion taxi journeys made in New York City over a six year period on a number of data stores and cloud services. Among these has been AWS Redshift. Up until now I’ve been using their free-tier and single-compute node clusters. While these make Data Warehousing very inexpensive they aren’t going to be representative of the incredible query speeds which can be achieved with Redshift.
In this post I’ll be looking at how fast a 6-node ds2.8xlarge Redshift Cluster can query over a billion records from the dataset I’ve put together.
Litwintschik’s admirable in how well he documents the steps he takes to actually running queries against such a large dataset. Just getting such data into the right place to work with is challenging. There are lots of places you can trip up doing this stuff and his work can save others a lot of trouble.
Something along these lines is what I’m aspiring to do with Fun With Discogs Data.
Mushroom Jazz Eight has launched! https://t.co/OremgYvlB9 @PledgeMusic #downtempo #chillout #triphop
— Mark Farina (@djmarkfarina) June 3, 2016
Heck yeah, you know I ordered up some of Mark Farina’s forthcoming Mushroom Jazz 8. Now it’s just a matter of what and when I add some other items.
Really been enjoying the weekly Data Machina e-mail newsletter put out by @ds_ldn a.k.a. Data Science London. It’s actually fairly dense, but usefully eclectic, focused on things relevant to “data science”, broadly construed. I always find at least a handful of links that are out and out great.
This may become a recurring aspect of the blog, thus the abbreviation and numbering. I’ve managed to catch up and download the entirety of discogs.com data dump archives to my personal laptop. As of this writing, it’s about 153 Gb of mostly compressed XML data, at varying levels of quality going all the way back to 2008.
I don’t really have much of a plan other than to explore an interesting longitudinal data set. One thing I’m hoping to do is come up with a modernish set of tools to process the data, including normalizing and transforming to other formats. The other goal is to push it all up into Google Cloud Platform and see what working with data in that environment is like. Also, planning to make code and generated data open, since Discogs provides it under an extremely liberal license.
Nice little video overview of service discovery in microservice architectures and how Consul can fill that role. “Why is service discovery important? (And what is Consul?)”. Fair notice, it’s a teaser for an O’Reilly video training course on microservices.
I recently worked on a proposal that heavily incorporated the notion of unikernels. Even still, I’m not really sure I could have explained what they were to even someone else technically proficient.
Enter the Google Cloud Platform Podcast. Listening to Pivotal’s John Feminella I finally heard a clear, clean explanation. Check it out for yourself, but the notion of an automatically constructed, application specific, machine image that can run on a hypervisor nails it for me.
They’re still extremely bleeding edge, but it looks like unikernel based approaches will have a place in the microservices oriented future.
P.S. I just started listening to the GCP Podcast, but I’m encouraged by how informative these first couple of episodes have been.
I enjoyed this O’Reilly Data Podcast conversation with Michael Armbrust regarding Apache Spark 2.0’s Structured Streaming:
With the release of Spark version 2.0, streaming starts becoming much more accessible to users. By adopting a continuous processing model (on an infinite table), the developers of Spark have enabled users of its SQL or DataFrame APIs to extend their analytic capabilities to unbounded streams.
Within the Spark community, Databricks Engineer, Michael Armbrust is well-known for having led the long-term project to move Spark’s interactive analytics engine from Shark to Spark SQL. (Full disclosure: I’m an advisor to Databricks.) Most recently he has turned his efforts to helping introduce a much simpler stream processing model to Spark Streaming (“structured streaming”).
You’ll need a login, but there’s also a deeper dive video from Armbrust and Tathagata Das going into more details of Structured Streaming.
At one point, Ben Lorica asked Armbrust about the dimensions upon which developers should evaluate streaming platforms. The obvious ones (delivery guarantees, latency, throughput) were brought up. I’d add a few more
Apache Spark Structured Streaming, Kafka Streams, Twitter Heron, Apache Flink. So much to choose from.
Like I said, a crowded space.
Last year we announced the introduction of our new distributed stream computation system, Heron. Today we are excited to announce that we are open sourcing Heron under the permissive Apache v2.0 license. Heron is a proven, production-ready, real-time stream processing engine, which has been powering all of Twitter’s real-time analytics for over two years. Prior to Heron, we used Apache Storm, which we open sourced in 2011. Heron features a wide array of architectural improvements and is backward compatible with the Storm ecosystem for seamless adoption.
As I said, I use (and like) Kafka quite a bit. The new release of Kafka 0.10.0, as covered by Confluent’s Neha Narkhede, has a number of interesting features:
I am very excited to announce the availability of the 0.10 release of Apache Kafka and the 3.0 release of the Confluent Platform. This release marks the availability of Kafka Streams, a simple solution to stream processing and Confluent Control Center, the first comprehensive management and monitoring system for Apache Kafka. Around 112 contributors provided bug fixes, improvements, and new features such that in total 413 JIRA issues and 13 KIPs were resolved.
Kafka Streams becomes official, but the timestamped messages will turn out to be very handy.
K8S stands for Kubernetes, which is a container orchestration platform from Google. Translation? Kubernetes is a system for running distributed code with high availability at scale. Looks like there’s a nice bitesized Udacity course on Kubernetes serving as an introduction.
This course is designed to teach you about managing application containers, using Kubernetes. We’ve built this course in partnership with experts such as Kelsey Hightower and Carter Morgan from Google and Netflix’s former Cloud Architect, Adrian Cockcroft (current Technology Fellow at Battery Ventures), who provide critical learning throughout the course.
Mastering highly resilient and scalable infrastructure management is very important, because the modern expectation is that your favorite sites will be up 24/7, and that they will roll out new features frequently and without disruption of the service. Achieving this requires tools that allow you to ensure speed of development, infrastructure stability and ability to scale. Students with backgrounds in Operations or Development who are interested in managing container based infrastructure with Kubernetes are recommended to enroll!
Might have to carve out some time for this one.
Apache Storm is approaching 5 years as an interesting, useful, important vibrant open source project. P. Taylor Goetz, one of the project leads, is doing an overview of Storm in concert with the 1.0 release.
In this series of blog posts, we will provide an in-depth look select features introduced with the release of Apache Storm (Storm) 1.0. To kick off the series, we’ll take a look how Storm has evolved over the years from its beginnings as an open source project, up to the 1.0 milestone release.
The space of computing over and developing against streaming data has grown crowded in the past year or two. Storm is one of those technologies good to know about as it provides a useful baseline of features to discuss and has significant “burn in”. And it’s still getting better!
These days, I do a lot of work with Apache Kafka. Kafka implements partitioned, replicated, append only logs. If you squint enough, those logs can look like a messaging system. Turns out Kafka is pretty good for a lot of distributed system and “big data” use cases.
I couldn’t make it to the inaugural Kafka Summit, but the organizers have made video recordings and slides available for all of the presentations. Well done!
On July 22, 2016, Farina returns with the 8th installment of one of electronic music’s longest running compilations, Mushroom Jazz, celebrating the 25th anniversary of the series.
There is nothing else that can be added to this announcement. You know I’m all over this one.
You know you’ve been around awhile when you start observing “acronym recycling”.
Storage, MAp/reduce, Query
Spark, Mesos, Akka, Cassandra, Kafka
When I first heard the term in Ben Lorica’s O’Reilly Data Show podcast episode with Evan Chan, I did a double take. Trawling the interwebs a bit looks like there might be some there there. MeetUps. Slides. Talks. Conferences, sort of. Even manifestos!
Not exactly one-to-one, but definitely square in the same ecosystem. The 2010 Radar article is still surprisingly relevant. And if you think of the trends in “Big Data” over the last 5+ years, SMACK is basically an evolution of SMAQ, refined for the rise of Spark as a compute engine, and updated for the emergence of streaming, unbounded data processing.
SMACK HARD is a little too cute by half though, if you ask me.
I’ve been on a bit of a spending binge over the products from the Fabric London store. When in the mood for some “new to me” music, I trawl their back catalog. To be honest, the purchases have been a combination of buying second market CDs or digital downloads from the Amazon Music Store. Hopefully, this plug will send some purchases their way. But over the course of the past month, I’ve collected over 6 different titles.
The current leader of the pack is Evil Nine’s entry into the FabricLive list. The end-to-end DJ mix is a nice journey into breaks territory. Opens quite well, dips a little, then really picks up steam near the tail end. Highlights are, Technologic, All I Wanna Do Is Break Some Hearts, and Nowhere Girl. Not to mention an inspired outro with The Clash’s London Calling.
The musical style is most reminiscent of stuff from Fatboy Slim, The Chemical Brothers, and an incredible one shot effort, The Dirtchamber Sessions, by The Prodigy. Give Evil Nine a whirl if you fancy that particular flavor.
Link parkin’:
Large-scale time-series data shows up across a variety of domains. In this post, I’ll introduce Spark-TS, a library developed by Cloudera’s Data Science team (and in use by customers) that enables analysis of data sets comprising millions of time series, each with millions of measurements. Spark-TS runs atop Apache Spark, and exposes Scala and Python APIs.
Deployed by Cloudera with real customers, according to them. Sorely needed. Appreciate the Python modules, which I hope aren’t too far behind the Scala API.
A while ago, I started a project called AdoptedArt, where I attempted to transliterate Matt Pearson’s work at AbandonedArt.org into Python. Back then there were two impediments. One, there really weren’t any graphical toolkits that were a solid equivalent of processing. I cobbled something out of pyprocessing but it wasn’t very satisfying. Not to mention the project wasn’t particularly active. Two, my lil’ ole White MacBook really didn’t have enough horsepower to compensate for the Python performance penalty.
AdoptedArt fell by the wayside, but just for giggles, over Thanksgiving I took a lark to see if it could be resurrected. Now I have two things on my side. One, the new MacBook Pro is easily an order of magnitude faster thanks to processing speedups, multiple cores, GPU acceleration, and a big old SSD. Second, NodeBox for OpenGL emerged, adding image manipulation capabilities and hardware acceleration to the NodeBox vector drawing API. Moore’s Law FTW! Plus, the install was painless using Continuum’s Anaconda, even though there was some C based extensions to be built from source.
Bottom line, it only took me a little bit of work to adapt my adoption of AbandonedArt’s first processing sketch, Spirograph into NodeBoxGL. And it ran smooth as silk, with the MacBook barely breaking a sweat.
I’ve got high hopes to revive this project as a creative endeavor and a complete diversion from work stuff. We’ll see how it goes!
© 2008-2024 C. Ross Jam. Built using Pelican. Theme based upon Giulio Fidente’s original svbhack, and slightly modified by crossjam.