home ¦ Archives ¦ Atom ¦ RSS

2022 Books Completed, Part 4

It’s been a minute, or two, since we’ve done this. From May to August 15 titles were closed out. That’ll take this year’s total to 25. 40 finished is conceivable.

Lots to get to. Let’s get stuck in.

read more ...


The Peripheral on Netflix

Like many of his other novels, William Gibson’s The Peripheral grew on me after first reading. In late October, Amazon Prime will be airing a TV series adaptation. The trailer looks good, but then again the trailer always looks good.

read more ...


Streak Busted

Yesterday I busted my 434 day music listening streak, dangit! 😢

I hung my listening habit off of my exercise habit, but had a little change up yesterday. Went for a walk in the morning instead of the evening and rushed out the door to meet the daily schedule. Then never got around to picking out a tracklist and firing it up.

Ah well. We go again.

Said the Manchester City fan, heh.


The Million Playlist Dataset

Link parkin’: Spotify Million Playlist Dataset Challenge

The Spotify Million Playlist Dataset Challenge consists of a dataset and evaluation to enable research in music recommendations. It is a continuation of the RecSys Challenge 2018, which ran from January to July 2018. The dataset contains 1,000,000 playlists, including playlist titles and track titles, created by users on the Spotify platform between January 2010 and October 2017. The evaluation task is automatic playlist continuation: given a seed playlist title and/or initial set of tracks in a playlist, to predict the subsequent tracks in that playlist. This is an open-ended challenge intended to encourage research in music recommendations, and no prizes will be awarded (other than bragging rights).

Mmmmm… data. Unfortunately, terms of service limit usage to participation in the challenge contest. So no data munging and redistributing.


Podcasting 2.0

TIL: Podcast Index

The Podcast Index is here to preserve, protect and extend the open, independent podcasting ecosystem.

We do this by enabling developers to have access to an open, categorized index that will always be available for free, for any use.

Has an API. Interestingly, the entire podcast database (guessing not full content of enclosures isn’t included) can be downloaded. Mmmmmm… data.

The team behind the Podcast Index is leading a Podcasting 2.0 “movement” to add some features to RSS to improve the podcast experience for publishers and listeners.

Podcasting 2.0 is a set of forward looking ideas combined with the technology to realize them. It’s a vision for what the podcast listener experience can and should be. That experience has stagnated for over a decade, with almost all of the improvements coming in isolated sections of the ecosystem. There hasn’t been a single, unified vision from the podcasting community acting together with one voice. So, we’ve ended up with fragments of innovation across the podcasting landscape with no central driving goal in mind. Podcasting 2.0 is the expression of what that goal could be.

Bonus project: Sucky pulls down an RSS feed and all of its enclosures.

Hat tip to a great Changelog podcast episode, natch, discussing RSS with Ben Ubois, the creator of one of my favorite services feedbin.


Python CLI Authentication

My default starting point for building a new piece of software is to create a command line interface (CLI) app. I’ve even got my own personalized Python cookiecutter template to generate them quickly. Independent of some personal preferences that this supports, like options for logging, thanks to the click argument parsing toollkit, the autogenerated tool immediately integrates well with the wider UNIX ecosystem.

Buuuut, I’ve never gotten a good handle on how to integrate this approach with the modern Web API OAuth+API token methodology. OAuth gives me a headache 😆.

I’m still digesting this article from the folks at notia.ai, entitled “Building an authenticated Python CLI,” but on first read it seems really well done.

When building out the Notia client, we found a real lack of resources around building a persistently authenticated Python library.

To address this, we are going to be building an interactive, authenticated Python CLI that uses the Twitter API to fetch the top Machine Learning tweets of the week! You can see the final result in the video demo above - or you can skip to the final code here.

Building this CLI will let us explore concepts like authenticating a local device between uses, accepting CLI arguments with Click, and displaying our data interactively with Rich.

I’ll definitely be taking their advice in building some future X-to-sqlite applications.

Due credit to Simon Willison for providing the upstream basis of my cookiecutter template.


stfnal, Word of the Day

Chanced across a completely new word to me: stfnal

stfnal (comparative more stfnal, superlative most stfnal)

Of or pertaining to scientifiction or science fiction.

Via a Cory Doctorow review of A Half-Built Garden, which I am running out and buying tomorrow in hardcover.


Xonsh, history, sqlite

xonsh has been growing on me as an interactive shell. One area I haven’t delved into much is the history capabilities. I guess I shouldn’t be too surprised that sqlite makes an appearance:

Xonsh has a second built-in history backend powered by sqlite (other than the JSON version mentioned all above in this tutorial). It shares the same functionality as the JSON version in most ways, except it currently doesn’t support the history diff action and does not store the output of commands, as the json-backend does. E.g. xonsh.history[-1].out will always be None.

The Sqlite history backend can provide a speed advantage in loading history into a just-started xonsh session. The JSON history backend may need to read potentially thousands of json files and the sqlite backend only reads one. Note that this does not affect startup time, but the amount of time before all history is available for searching.

Combine with the sqlite’s full text search capabilities for even more entertainment.


Welcome Back Data Machina!

Data Machina is a weekly link newsletter on AI/ML topics broadly construed. It’s distinctive in the variety of areas it touches on ( e.g. specific language sections for Python, Scala, Lisp/Clojure, R, plus segments on datasets and distributed systems), along with the sheer number of links. I was enjoying the content way back when it started out on TinyLetter, but it went on hiatus and thence behind a paywall, so effectively disappeared from my radar.

Why subscribe to Data Machina?

Data Machina brings you a highly curated selection of the best in Machine Learning, AI, Data Science, and Data Engineering every week, 52 weeks per year.

Loaded with useful, unique, and interesting content, Data Machina is read by thousands of AI/ML professionals and researchers around the world.

Data Machina is published in a minimalistic, easy-to-read format, with pure, simple text, and structured in clearly marked sections so you can scan them quickly without being disturbed by ads, banners, icons, images or other annoying stuff.

Now it’s back in a free version and looking as good as ever. Glad to make your acquaintance again!

Also, yet another plug for feedreading email newsletters. The latest versions of Data Machina quietly popped up in my feeds unannounced. No muss, no fuss. Just back to reading the high quality linkfest in its best habitat for me.


Mopidy Mystery

I was hoping to fool around with Mopidy as an audio playback engine because it’s written in Python and supports the MPD protocol according to the documentation. When I went to install it using homebrew on my MacBook Air, the latest version had problems with its plugins, wherein I discovered there was already an outstanding issue on GitHub. Unfortunately, a solution didn’t look promising, but at least I chimed in my interest.

So off I go, working on other things and forgetting about the problem. Lo and behold, another user reports the real source of the issue and a convenient fix. With an export GST_PLUGIN_PATH=/opt/homebrew/lib/gstreamer-1.0/ now my Mopidy server works perfectly and can playback audio on my MacBook. Score one for just registering interest on GitHub!!

Mystery solved. Onwards to implementing my own database driven, dynamically created playlists.


Steampipe

TIL Steampipe. From the intro announcement:

Steampipe, a new open source project from Turbot, enables cloud pros (e.g. software developers, operations engineers and security teams) to query their favorite cloud services with SQL. It has quickly become one of our favorite tools in-house and we hope it finds a way into your tool box as well.

The heart of Steampipe is an intuitive command line interface (CLI) that solves the challenges encountered when asking questions of cloud resources and services. Traditional tools and custom scripts that provide visibility into these services are cumbersome, inconsistent across providers and painful to maintain. Steampipe provides a consistent, explorable and interactive approach across IaaS, PaaS and SaaS services.

Via an O’Reilly Radar Post by Jon Udell. Glad to see him back!!


pgloader

TIL pgloader

pgloader has two modes of operation. It can either load data from files, such as CSV or Fixed-File Format; or migrate a whole database to PostgreSQL.

pgloader supports several RDBMS solutions as a migration source, and fetches information from the catalog tables over a connection to then create an equivalent schema in PostgreSQL. This means that you can migrate to PostgreSQL in a single command-line!

Via a Twilio blog post linked from the PyCoders Weekly newsletter, Issue 533


Actually Mutagen

Actually, mutagen is probably the right tool for the MP4 metadata job at hand, especially with the EasyMP4 class available

Mutagen is a Python module to handle audio metadata. It supports ASF, FLAC, MP4, Monkey’s Audio, MP3, Musepack, Ogg Opus, Ogg FLAC, Ogg Speex, Ogg Theora, Ogg Vorbis, True Audio, WavPack, OptimFROG, and AIFF audio files. All versions of ID3v2 are supported, and all standard ID3v2.4 frames are parsed. It can read Xing headers to accurately calculate the bitrate and length of MP3s. ID3 and APEv2 tags can be edited regardless of audio format. It can also manipulate Ogg streams on an individual packet/page level.


mp4v2

Link parkin’: mp4v2

A C/C++ library to create, modify and read MP4 files

This is the new MP4v2 project, a fork of the abandoned MP4v2 library project now archived at Google Code.

Seems a little more convenient, vice ffmpeg for working with .m4a files as converted for or ripped by Apple’s Music.app. While primarily a library, there are a few cli tools such mp4info and mp4tags.


sqlite-comprehend

Per usual Simon Willison has pushed out yet another impressive SQLite oriented tool: sqlite-comprehend

I built a new tool this week: sqlite-comprehend, which passes text from a SQLite database through the AWS Comprehend entity extraction service and stores the returned entities.

My attention was caught by multiple aspects:

  • The usage of many pieces of his toolkit but especially db-to-sqlite to grab data out of PostgreSQL, since I have some interesting data in guess what … PostgreSQL
  • Outsourcing entity extraction out to AWS Comprehend
  • The application of SQLite’s full text search capabilities
  • And of course Simon’s way of writing this all up, which I aspire to emulate. I’m getting there with potential content.

Bottom line, I think it’s eminently possible to take my Discogs tables and Fabric views, export them into a single SQLite / Datasette instance, and have an easily searchable Discogs artifact that’s simple to distribute as one SQLite file on a CDN.

Don’t know if anyone else would use it, but it’s an itch I’d like to scratch.


TIL Bytewax

TIL about Bytewax

Bytewax is an open source Python framework for building highly scalable dataflows in a streaming or batch context.

read more ...


Discogs View Cleanup

This week I learned about PostgreSQL’s conditional expressions in general and the COALESCE expression in particular. A big part of the grungingess of my Discogs Postgres views is dealing with the data’s usage of alternative name variations or anvs in the fabric_track_artists view which are quite often NULL. This propagates into a crappy ad hoc value for the track_artists via abuse of concat_ws. I’ve got a pretty good feeling that can be handled more elegantly with a COALESCE.

A couple of other things that need investigating:

  1. The regexps for fabric vs fabriclive should be collapsed into one
  2. Rename the fabric_live column to a more general fabric_series and compute it from the title column
  3. Reexamine the UNION statements to see if they can be handled by a more appropriate join

Lots of redundancy that can be cleaned up.


NetNewsWire SQLite Schema

Just for giggles, following up on my pondering regarding the SQLite schema within NetNewsWire, I poked around in the DB and pulled the schemas:

read more ...


Feedbin to sqlite

Thinking out loud. With a nice library around the Feedbin API it wouldn’t be too hard to grab the data and stuff it into SQLite. Alternatively, a Feedbin account could be registered with NetNewsWire and then the underlying SQLite DB inspected.

The former seems more elegant while the latter is radically pragmatic.

If for nothing else, poking around in the NetNewsWire SQLite DB probably illustrates a highly performant SQL data model and schemas for RSS data and feed management. Or even better, just read the source code


zsv, and others

Link parkin’, since I have a largish collection of largish csv files I’m interested in processing: zsv

Preliminary performance results compare favorably vs other CSV utilities (xsv, tsv-utils, csvkit, mlr (miller) etc). Below were results on a pre-M1 macOS MBA; on most platforms zsvlib was 2x faster, though in some cases the advantage was smaller e.g. 15-25%)


X to SQLite

Link parkin’. A couple of publicly available modules for taking personal data and stuffing it into an SQLite database, congruent with Simon Willison’s Dogsheep initiative

Dogsheep is a collection of tools for personal analytics using SQLite and Datasette.


TIL ffprobe

I’ve been working on automating metadata additions to my Fabric collection using information from Discogs. I was poking around for cli ways, especially via ffmpeg, to add the info to a music file and chanced across a really useful gist.

A quick guide on how to read/write/modify ID3 metadata tags for audio / media files using ffmpeg.

At the bottom of the gist is a mention of ffprobe. Much more appropriate for the task at hand, especially since it can generate output in JSON

ffprobe gathers information from multimedia streams and prints it in human- and machine-readable fashion.

… ffprobe output is designed to be easily parsable by a textual filter, and consists of one or more sections of a form defined by the selected writer, which is specified by the print_format option.

Although to be fair, ffprobe doesn’t seem to be able to write metadata to a file.

Bonus! ffprobe tips


HealthKit Data Hacking

Don’t know why, but I was reminded that Simon Willison had a neat utility to process data exported from Apple HealthKit data, specifically from an Apple Watch, healthkit-to-sqlite

Convert an Apple Healthkit export zip to a SQLite database

Includes export instructions from your watch. Worked surprisingly well.

The neat part is that you can then use Datasette and the Datasette cluster map plugin to visualize outdoor workouts on a map. And of course there’s always good old, exploratory data analysis using sqlite and Pandas.


Why xonsh?

I was dorking around last night and came up with the following in about 15 minutes of work in xonsh command line session.

from pathlib import Path
for fname in !(ls *.wav):
    p = Path(fname.strip())
    ffmpeg -i @(p) -c:a libfdk_aac -b:a 128k @(f"{p.stem}.m4a")

🎉 💥 💥 🎉

read more ...


PostgreSQL Timestamps

At the day job, I got sucked into trying to understand two PostgreSQL data types, timestamp and timestamptz. Thought I knew what I was doing, then read the docs and came away even more confused. Luckily, the folks at Cybertec had a pretty recent blog post on just this topic Time Zone Management in PostgreSQL.

Next to character encoding, time zones are among the least-loved topics in computing. In addition, PostgreSQL’s implementation of timestamp with time zone is somewhat surprising. So I thought it might be worth to write up an introduction to time zone management and recommendations for its practical use.

The punchline …

Even though it is easy to get confused with time zones, you can steer clear of most problems if you use timestamp with timezone everywhere, stick with IANA time zone names and make sure to set the TimeZone parameter to the time zone on the client side. Then PostgreSQL will do all the heavy lifting for you.

But really, read the whole thing. There’s a lot of nuance and the proper handling of timezones in Postgres is definitely not obvious. I may actually circle back and illustrate what dragged me into this tarpit and how I currently understand things.


Fly.io blog love

I think I discovered their blog during my last writing hiatus, so time to give ’em some love. The fine folks at fly.io have been doing excellent technical blogging for a few years now. Thomas Ptacek’s stuff, like this one on process isolation, is the pinnacle but there’s all around good material from many quarters. For example, here are some quality posts on Firecracker


musicapp cli

It’s a small start, but my musicapp command line utility, for working with XML out of Apple’s Music.app, can be used in a one liner with sqlite-utils:

read more ...


pg_timetable

Link parkin’: pg_timetable

pg_timetable is an advanced job scheduler for PostgreSQL, offering many advantages over traditional schedulers such as cron and others. It is completely database driven and provides a couple of advanced concepts.

In any serious development effort, I’m likely to have PostgreSQL in the stack so might as well take advantage of it for scheduled tasks too. One less piece of kit to worry about.

Plus: Pavlo Golub’s series of blog posts on pg_timetable. Pavlo is the creator of pg_timetable. Part of CYBERTEC’s PostgreSQL Professional Services.


sqlite-utils

Link parkin’: sqlite-utils.

CLI tool and Python utility functions for manipulating SQLite databases

This library and command-line utility helps create SQLite databases from an existing collection of data.

Can’t believe I haven’t stashed Simon Willison’s insanely useful toolkit on this here blog. Makes it insanely easy to do stuff with sqlite databases from the command line and from within Python. For example

If you have data as JSON, you can use sqlite-utils insert tablename to insert it into a database. The table will be created with the correct (automatically detected) columns if it does not already exist.


LEADS - The League of Embeddable, Alternative DataStores

PostgreSQL is totally awesome. But sometimes it’s more useful to have pure file(s) storage and query for your data. Herewith a collection of data storage engines that somewhat cover the space of more well-known engines:

read more ...


Working With Git and Pip

Previously I mentioned libpytunes and went to kick the tires. I thought it was published on PyPI but turns out it wasn’t. So here I am going pip install libpytunes and wondering why I can’t subsequently do a import libpytunes.

I’ve always known you can do pip install from a git repository, but a while back Adam Johnson wrote up some of the details. There are plenty of other good overviews out there, (e.g. Simon Willison’s), this one just caught my eye recently.

Now pip install git+https://github.com/liamks/libpytunes actually installs the module and my import statement works as expected. Bonus, you can put git+https://github.com/liamks/libpytunes into requirements.txt and setup.py files as well, to achieve similar results.

Unfortunately the liamks version got hit by a trivial API change in plistlib in Python 3.9, so there was still breakage on my end, but Anirudh Acharya has a forked repo with the necessary one liner fix. Of course I used pip install git..., and now my Music.app experiments are proceeding apace.


Music Library Exporter

Link parkin’: Music Library Exporter

Music Library Exporter allows you to export your library and playlists from the native macOS Music app.

The library is exported in an XML format, and is compatible with other applications, services, and tools that rely on the Music (previously iTunes) XML library format.

🎉 🎊 🥳 BONUS!! CLI SUPPORT 🥳 🎊 🎉

Aside from the main Music Library Exporter application, this project also includes a command-line program called library-generator.

Now licking my chops for some serious Music.app automation, although I’m a little nervous about compatibility. Will give it a test drive and report back.


GitHub Archive

Can’t believe I’ve never posted about GH Archive

Open-source developers all over the world are working on millions of projects: writing code & documentation, fixing & submitting bugs, and so forth. GH Archive is a project to record the public GitHub timeline, archive it, and make it easily accessible for further analysis.

There’s a solid 10+ years of freely available GitHub spewed JSON to practice data spelunking, system benchmarking, and query hacking against.

Update: The CNCF DevStats project puts that data to interesting use through application of actual CNCF projects. So meta!

This is a toolset to visualize GitHub archives using HA Postgres databases and Grafana dashboards. Everything is open source so that it can be used by other CNCF and non-CNCF open source projects. The only requirement is that project must be hosted on a public GitHub repository/repositories. Project is deployed using Equinix bare metal Kubernetes nodes and deployed using a Helm chart. It uses many more CNCF projects under the hood.

At least Google couldn’t find any such post on my site. Maybe that’s a hint to implement some real local search. 🤣


PyOPML

Link parkin’: PyOPML

Welcome! This documentation is about PyOPML, a Python package meant to read, manipulate and write OPML 2.0 files.

Stashed in light of previously mentioning the OPML that the Overcast app generates.


YASnippet

I’ve been looking for a way to jump start my blog post authoring process via text expansion or templating within Emacs. Just haven’t had the time to go digging. Luckily a potential solution came to me: YASnippet

YASnippet is a template system for Emacs. It allows you to type an abbreviation and automatically expand it into function templates. … The snippet syntax is inspired from TextMate’s syntax, you can even import most TextMate templates to YASnippet.

Source code and info from EmacsWiki.

YouTube demo embed after this break …

read more ...


Hacking iTerm2

Even though I’ve expounded on the coolness of xonsh, I haven’t put it to use in full anger yet. I’m thinking it might best be leveraged as an interactive shell workspace, sort of like Jupyter Notebooks but without the Web browser bits and much more of a CLI.

But I’ve been thinking about how to make launching a new space as cheap and mindless as possible. Enter scripting the iTerm2 terminal emulator using its Python API. From an example of scripting iTerm2 straight from another command line:

This script demonstrates two concepts:

  1. Launching iTerm2 using PyObjC and running the script only after it is launched.

  2. Creating a window that runs a command.

So if the command is “kickoff xonsh with some args,” either in an existing or newly created virtualenv, it becomes almost trivial to fire off new interactive spaces.


20 Years of Simon Willison

Just had to acknowledge Simon Willison’s 20th anniversary of blogging. I have a very loose tie in that I used to have an appointment at the Medill School of Journalism. At the time, the concept of Content Management Systems wasn’t ingrained in media circles. Working with a colleague, we introduced a rudimentary platform that was used as part of every MS student’s stint in Medill’s downtown Chicago newsroom.

So of course when I first heard of Django and how it was born in a newsroom, I had to have Adrian Holovaty come up to Evanston and give a talk. For a while, Adrian and I lived relatively close by in Chicago and would bump into each other at ChiPy meetups. Thus the extremely loose tie to Simon.

I really enjoy and admire Simon’s current stream of work, especially on Datasette. Despite the volume, his blog is one of a handful that I look forward to with anticipation for new content. That guy can crank out some code, but also has good taste in problems, and will go down a few layers into the technology. Here’s to many more posts to come!


Messin’ With Music.app Data

I wanted to start liberating my OS X Music.app data, noticing that you can “Export… > iTunes Libray” to spit out XML to the file system. Next stop was parsing the XML. Hang on, there’s gotta already be a Python module(s) for that right?

itunesLibrary was the first port of call. It inhaled my 18 MB of XML library file, but the object interface didn’t click with me. It was sort of Pythonic dictionary like, but not quite.

Turns out the exported XML is just an Apple property list (plist) file and there’s a plist parsing library in the Python standard library. libpytunes is a thin wrapper around plistlib. I need to give it a longer test drive, but it seems a bit more complete than itunesLibrary.

Beyond intermittently doing a manual slog through menus, I was hoping there might be a way to automate this via Apple’s scripting machinery. Doug’s Applescripts is the go to in this space, especially for things related to the OS X Music app. Apparently, Doug is not sanguine on leveraging the XML format or Apple’s replacement:

So anyway, the XML has finally gone away, effectively, since it is no longer automatically exported.

I’ve been trying to incorporate the ITLibrary framework into my projects whenever I can, especially for apps that need to display lots of tracks or playlists (like Media Folder Files Not Added).

But ITLibrary was apparently last updated for macOS 10.13. And now that iTunes has been split out into the media apps, it’s usefulness over the XML file has not been improved.

(And please. Don’t let me hear anyone suggest some groovy way of exporting the XML automatically. Forget about the XML. Unless it’s for backups or something.)

And mind, this was in October 2019. I suspect much hasn’t changed since.

But for me, the XML is enough in the short term. If I ever get working daily snapshots I’ll be happy. The real fun will begin when transmogrifying the data into something ingestable into an SQL engine and thence marrying with Discogs Data.

Apropos Elle Driver, “You know I’ve always liked that word … ’sanguine’ … so rarely have an opportunity to use it in a sentence.”


LF AI & Data

Link parkin’: LF AI & Data Foundation.

The LF AI & Data Foundation supports open source projects within artificial intelligence and the data space.

This overview deck provides a lot more detail.

Didn’t realize that the Linux Foundation had an Artificial Intelligence and Data thrust, with a bushel of projects under its umbrella. A few of them are having a pretty big industry impact.

Discovered this via poking around at the Milvus repository:

Milvus is an open-source vector database built to power embedding similarity search and AI applications. Milvus makes unstructured data search more accessible, and provides a consistent user experience regardless of the deployment environment.

Milvus 2.0 is a cloud-native vector database with storage and computation separated by design. All components in this refactored version of Milvus are stateless to enhance elasticity and flexibility. For more architecture details, see Milvus Architecture Overview.


HYTRADBOI?

This looks like an interesting set of presentations: “Have You Tried Rubbing A Database On It?”

HYTRADBOI was a day of lightning talks about turning a data-centric lens onto familiar problems to yield strange new solutions (and maybe exciting new problems). Talks ranged from wild ideas and unlikely experiments to cutting-edge research and production war stories.

Being loosely affiliated with the Berkeley Computer Systems Mafia (TM) the general theme of “let’s talk intersection of data management and everything else,” is near and dear to my heart. Gotta give a shout out to CIDR as an analog, but strictly from the academic R&D perspective. And definitely without the radical format experiment!

Really useful background information and conference postmortem

Background

Fundamentally, every computer system is about storing, moving and transforming data. The line between operating system, database and programming language is somewhat arbitrary - a product of specific problems, available hardware and historical accident.

But today the problems and the hardware have changed dramatically, and as a result we’re starting to see people experimenting with redrawing the lines.

Postmortem

A lot of people expressed a surprising amount of support for the conference. I think there is a lot of pent up demand for a database conference that isn’t just SaaS ads. That support meant that people were very willing to help promote the conference and were forgiving of the many technical issues. Many people bought tickets knowing that they wouldn’t be able to attend, because they wanted something like this to exist.

Via Simon Willison

© C. Ross Jam. Built using Pelican. Theme based upon Giulio Fidente’s original svbhack, and slightly modified by crossjam.