DBOS Stateful Workflows

Posted on: Thu 26 December 2024

DBOS (the company) has come across my radar via two vectors. Ben Lorica had a newsletter update and podcast interview with the founders. Meanwhile the open source DBOS framework was mentioned on a recent episode of Python Bytes.

This could be a useful platform for my purposes, although I’m a bit hesitant since it’s new and not “boring”. However, it builds on top of PostgreSQL and is the result of a well grounded research effort out of MIT and Stanford.

The interview was a bit “salesy” but not overbearing. Also, Lorica tended to steer the discussion in the direction of AI and agentic use cases.

From the documentation introduction

DBOS is a serverless platform for building reliable backend applications. Add lightweight annotations to your app to durably execute it, making it resilient to any failure. Then, deploy your app to the cloud with a single command.

As a data engineer, the framework is probably worth a test drive to compare against Prefect and/or build your own systems.

uv scripts

Posted on: Wed 25 December 2024

Trey Hunner, a seasoned Python trainer and developer, has been experimenting with uv for a number of tasks including short script development while also replacing pipx and pyenv, two tools I routinely use. He hasn’t completely embraced a substitution but uv does provide an intriguing upside:

The biggest win that I’ve experienced from uv so far is the ability to run an executable script and have any necessary dependencies install automagically.

This doesn’t mean that I never make Python package out of my Python scripts anymore… but I do so much more rarely. I used to create a Python package out of a script as soon as it required third-party dependencies. Now my “do I really need to turn this into a proper package” bar is set much higher.

This relies on a Python packaging standard extension mechanism for embedding metadata in a script.

Pushing this to the limit, would it be possible to declare xonsh as a dependency and then write the rest of the script in that shell language?

Via PythonBytes episode 415

LaunchBar Actions Editor

Posted on: Tue 24 December 2024

Did a little more thinking about LaunchBar Actions and drifted into considering how they could combine with Simon Willison’s llm framework. Simon mostly demonstrates llm as a powerful CLI tool for interacting with modern AI models. The CLI is porcelain around extremely sophisticated Python module plumbing. LaunchBar could make a nice alternative porcelain.

So how does one go about creating LaunchBar Actions? The code examples I’ve seen are pretty straightforward. I could likely reverse engineer the directory structure and code interface even if documentation didn’t exist. However, the fine folks at Objective Development added a LaunchBar Action Editor in version 6.4 of the tool. Well what do you know!? 😲

And of course there is documentation for authoring actions.

MarkItDown

Posted on: Tue 24 December 2024

Link parkin’: MarkItDown

MarkItDown is a utility for converting various files to Markdown (e.g., for indexing, text analysis, etc). It supports:

PDF

PowerPoint

Word

Excel

Images (EXIF metadata and OCR)

Audio (EXIF metadata and speech transcription)

HTML

Text-based formats (CSV, JSON, XML)

ZIP files (iterates over contents)

This should definitely come in handy.

Via Daring Fireball

Fastmail Hardware

Posted on: Mon 23 December 2024

Rob Mueller CTO, blogged a piece that caught my eye about how Fastmail, a long standing hosted email service, builds on their own hardware as opposed to public cloud infrastucture. Part of the reason is that Fastmail predates the availability of the major public cloud providers. Even at that they’ve never seen a reason to move despite the persistent cloud evangelism that has swept the tech industry.

A big part is the storage pricing of the blob services, S3 and its competitive spinoffs. This interests me because I have a pile of Discogs data that I’ve been hoarding. Pushing it all into the cloud feels appealing but I don’t want to break the bank.

Mueller breaks down the surprising costs of storage at Fastmail scale. Here’s the money graf on the choice:

It’s interesting seeing the spread of prices here. Some also have a bunch of weird edge cases as well. e.g. “The S3 Glacier Flexible Retrieval and S3 Glacier Deep Archive storage classes require an additional 32 KB of data per object”. Given the large retrieval time and extra overhead per-object, you’d probably want to store small incremental backups in regular S3, then when you’ve gathered enough, build a biggish object to push down to Glacier. This adds implementation complexity.

Ultimately though, if you know your use case intimately, have a decent handle on the cost envelope, and can employ enough operational expertise, doing it yourself is an advantage.

Bonus, TIL about Wasabi.

Launchbar Actions for Kagi

Posted on: Mon 23 December 2024

A few months back, I ponied up for a high end subscription to the Kagi search engine. Meanwhile, the LaunchBar app is embedded in my fingertips. I was poking around for ways to plug Kagi into LaunchBar (using Kagi search ’natch). Found two decent looking plugins:

Bash, by Andrew Sardone
JavaScript, by Florian Eckerstorfer

These look simple enough that I should spend some time to learn how to write LaunchBar actions myself. If there isn’t an analogous Kagi action in Python there should be.

Embeddings with llm-gguf

Posted on: Sun 22 December 2024

I find the ability to create multi-dimensional embedding vectors from deep learning models quite fascinating. There’s an obvious application pattern in Retrieval Augmented Generation (RAG) with current LLMs. However, useful embedding models come in a much wider range of scales and capabilities then general language models. In principle, it’s quite possible to train custom embedding models at a reasonable cost in terms of compute hardware, data scale, and time.

Last month, Simon Willison updated his llm-gguf plugin to support creating embeddings from GGUF models specifically for embeddings.

The LLM docs have extensive coverage of things you can then do with this model, like embedding every row in a CSV file / file in a directory / record in a SQLite database table and running similarity and semantic search against them.

This could come in handy since I have a few piles of content laying around where using embeddings to supplement search and retrieval would be an interesting experiment.

I Stand Corrected

Posted on: Sat 21 December 2024

Previously, I thought The Atlantic’s “Space Telescope Advent Calendar” only collected images from 2024. After my daily source investigation of today’s selection, leading to this discovery on Flickr uploaded in 2017, turns out anything is fair game.

We regret the error.

DataChain

Posted on: Fri 20 December 2024

Link parkin’: DataChain

DataChain is a Python-based AI-data warehouse for transforming and analyzing unstructured data like images, audio, videos, text and PDFs. It integrates with external storage (e.g. S3, GCP, Azure, HuggingFace) to process data efficiently without data duplication and manages metadata in an internal database for easy and efficient querying.

Playwright, Python, Requests-HTML

Posted on: Fri 20 December 2024

I was trying to lift and shift a web scraping script from my laptop to a Linux VM. The script uses the Requests-HTML and works fine OMM. On the VM, not so much. I just couldn’t get the right Ubuntu dependencies installed so a headless Chrome browser integrated correctly.

Enter Requests-HTML-Playwright which uses playwright-python to wrap the playwright library. Should do the trick right? Wrong. The HTMLSession from the requests_html_playwright module always failed.

HOWEVER!, playwright-python apparently installs the correct Ubuntu packages to make Requests-HTML operate properly. Go figure.

Victory?

Jeff Barr Wraps

Posted on: Thu 19 December 2024

Way back in the day, I linked out to Jeff Barr, AWS’s first blogger. He’s been at it forever on behalf of the public cloud provider, but I was following even before that. In the early days of RSS, I was into what he was doing with syndic8.com.

Now Barr is moving on after 20 years. A commendable run. I always admired his writing as extremely competent, technical, accessible, human, and gracious.

It has been a privilege to be able to “live in the future” and to get to learn and write about so many of our innovations over the last two decades: message queuing, storage, on-demand computing, serverless, and quantum computing to name just a few and to leave many others out. It has also been a privilege to be able to meet and to hear from so many of you that have faithfully read and (hopefully) learned from my content over the years. I treasure those interactions and your kind words, and I keep both in mind when I write.

While Barr was blogging about S3 in its early days, I had a few things to say about the service.

Just a few thoughts on Amazon’s new web service S3, which is cheap, reliable, plentiful Internet based storage.

S3 is not a game changer.

That didn’t age well.

Bravo AWS for avoiding bitrot on this 18 year old link.

Festivitas

Posted on: Wed 18 December 2024

Apropos of the season, I plunked down a few Euros for Festivitas. Does what it says on the tin:

Festivitas brings the holiday spirit to your Mac with festive lights for the dock and menu bar 🎄

Now the only question is how long after New Year’s are you allowed to keep your lights up?

Via Daring Fireball

Automating the Advent Calendar

Posted on: Tue 17 December 2024

Quoting myself, regarding The Atlantic’s “2024 Space Telescope Advent Calendar”.

I’m taking matters into my own hands, hunting the sources down, and stashing them away. I’ll figure out how to get them displayed in a gallery just because.

And I do mean my own hands, because I see no good way to automate this process.

Maybe this is a job for an LLM? Extract the credits, telescope, and summarize The Atlantic text. Run a search against Flickr (where I’ve found all of the images). Generate code for a photo gallery. Done.

Worth a shot if I have Copious Spare Holiday Time (TM).

A New CPU, Follow Up

Posted on: Mon 16 December 2024

Previously I made some assertions regarding the Changelog team reconfiguring their podcast business. At the tail end of last week, they released a “Changelog & Friends: Kaizen!” pod episode that talked a lot about that topic. Confirmed were a couple of my suspicions.

I was dead on target about the move to YouTube. The Kaizen! episode had a lot of nuance. The audience had been asking for it and they felt they were leaving a bit on the table, especially relative to other creators in the developer content space. Gerhard also teased out the various different ways folks receive such content, from quick hitter TikTok types to immersive deep divers. Also the ability to see the participants facial reactions and other physical gestures came up as a bonus. Not really my thing, but I could see all the props and gimmicks from the cable sports debate shows and YouTube channels coming to bear.

Hit the mark as well on the production weight. Adam and Jerod were definitely missing the slack needed to birth new concepts and do more hacking. Sounds like this will free them up a bit.

The jury’s still out about my pod network claim. Adam did mention being able to give brands, especially with more nascent companies, additional options with the ability to connect an increased number of podcasts. My search/indexing bit is still hanging out there. I should do some research to see if this product doesn’t exist already, but loosely coupled, vertical network oriented, podcast search feels like it should be a thing. Market opportunity?

Personally, YouTube videos as the primary upsell still isn’t for me. I use pods as background material while multitasking on other things. That’s just about impossible to do with video content. Jerod mentioned in a comment that they’ll have to evolve Changelog++. Let’s see what comes of it.

flyte

Posted on: Mon 16 December 2024

Link parkin’: flyte

The infinitely scalable and flexible workflow orchestration platform that seamlessly unifies data, ML and analytics stacks.

From the flyte docs:

Flyte is an open-source, Kubernetes-native workflow orchestrator implemented in Go. It enables highly concurrent, scalable and reproducible workflows for data processing, machine learning and analytics.

Flyte provides first-class support for Python and has a community-driven Java and Scala SDK. Data Scientists and ML Engineers in the industry use Flyte to create:

Data pipelines for processing petabyte-scale data.

Analytics workflows for business and finance use cases.

Machine learning pipelines for logistics, image processing, and cancer diagnostics.

Missed posting yesterday, so this is a bonus makeup. I can at least get a post a day on average.

Lance and LanceDB

Posted on: Sat 14 December 2024

Link parkin‘: Lance is the file format and LanceDB is the query engine on top:

Lance is a columnar data format that is easy and fast to version, query and train on. It’s designed to be used with images, videos, 3D point clouds, audio and of course tabular data. It supports any POSIX file systems, and cloud storage like AWS S3 and Google Cloud Storage.

…

LanceDB is an open-source vector database for AI that’s designed to store, manage, query and retrieve embeddings on large-scale multi-modal data. The core of LanceDB is written in Rust 🦀 and is built on top of Lance, an open-source columnar data format designed for performant ML workloads and fast random access.

Both the database and the underlying data format are designed from the ground up to be easy-to-use, scalable and cost-effective.

Been popping up in my feeds recently.

Airshow as a Foundation?

Posted on: Fri 13 December 2024

I mentioned Airshow as part of a foundation for retrocast. It’s a podcast player from Feedbin.com. The app can use Feedbin as a synch mechanism. In principle, the podcast feeds of Airshow could be a special part of the Feedbin API, unifying RSS feeds and podcasts within one service. Would be convenient for development purposes.

Unfortunately, it’s not looking particularly positive for a distinct Airshow API segment of the feedbin API. There’s probably a few ways to kludge it though, possibly by doing a podcast search on the feed source url and eventually tagging podcasts with a specific label. Might be a nudge to investigate a hackable open source feed aggregator.

Alas, the Airshow app is driving me up a wall with a hyper sensitive touch interface on iOS.

retrocast

Posted on: Thu 12 December 2024

Thought exercise in action here.

I’ve been peripherally aware of uv for a bit. Now I’m ready to dive in deeper. But the buzz on uv has been building for a bit. What’s the aggregate knowledge in RSS feeds and podcast subscriptions on uv?

If only I had a personal tool that could retrospectively search my personal content to answer the request “Build a precis on uv for me”.

The ground is fertile for a tool I’m calling retrocast that would fit that bill. It would be a personal app hooking into all your feeds and constantly indexing the content. Layer in some text search, semantic search, knowledge graphs, and LLMs to craft products, not just 10 blue links.

Some components for inclusion and inspiration

The Feedbin API using Feedbin’s backend as a feed database
Airshow and Overcasat as podcasat players that make their data accessible
Local embedded dbs and search engines such as SQLite, Kuzu, and meilisearch
Local LLMs via ollama and llamafile
NotebookLM and Kagi Assistant for UX ideas
Simon Willison’s llm to glue it all together

Limbo Rewrites SQLite

Posted on: Wed 11 December 2024

Limbo is attempting to deliver an (100%?) SQLite compatible db engine, implemented in Rust. As a big time SQLite fan, I have to say innerestin:

2 years ago, we forked SQLite. We were huge fans of the embedded nature of SQLite, but longed for a more open model of development. libSQL was born as an Open Contribution project, and we invited the community to build it with us.

As a result, libSQL is an astounding success. With over 12k Github stars, 85 contributors, and features like native replication and vector search, libSQL is the engine that powers the Turso platform.

Today we are announcing a more ambitious experiment: what could we achieve, if we were to completely rewrite SQLite in a memory-safe language (Rust)? With the Limbo project, now available at github.com/tursodatabase/limbo, we are now trying to answer that question.

Via Simon Willison

A New CPU

Posted on: Tue 10 December 2024

I subscribe to a lot of podcast feeds, but don’t listen to that many. And I’m definitely pretty selective about which ones I support financially.

One of those is Changelog++ and including support from very early on in its existence. Adam and Jerod had a broad potpourri of feeds covering a wide range of topics and the Changelog News newsletter is entertaining.

Had.

So it is with a raised eyebrow that I note the recent announcement that Adam and Jerod are shutting down their production of secondary pods such as Practical AI, Ship It!, Go Time, etc. Apparently all of them will live on in some new form. Oddly, the core Changelog podcasts have become less attractive (personality driven, big time commitment), while the standout, must listen for me is Practical AI (fortyish minutes typically, technical area of recent interest).

The announcement feels pretty transparent about the reasons why. Even so I’m going to speculate on some of the drivers.

A significant factor is “pivoting to YouTube”, which I’ve seen happen on a few other podcasts I follow. Instead of pods being strictly audio, and closer to radio, the core is a YouTube channel with the audio episodes in RSS secondary. Yet another case of it’s where the audience, attention, and money seems to be, to my chagrin.

C. f. How YouTube Ate Podcasting

Another aspect is that media production is a grind. Editing is time consuming and not energizing from what I’ve heard from folks in the trenches. I’m sure the Changelog team was running out of gas doing production for others, on top of their shows, on top of their family commitments. Adding to the dilemma are the effects of outsourced production services and the incoming impact of AI on production.

Lastly, the ability to leverage network effects allows for some separation without abandonment, a.k.a. the Changelog Podcast Universe

CPU.fm (Changelog Podcast Universe) is our new home for spin-offs and collaborations. It’s a burgeoning network of developer pods with a big vision, but we’re just getting started.

…

More friends: Through CPU.fm, we’ll define an index of developer pods worth your attention. Expect cross-posting and collaboration with our extended pod friends network.

Emphasis mine on index. I wouldn’t be surprised if that also meant some whizbang podcast search, taking advantage of the Changelog archives, with a heavy slather of AI all over.

Obviously as a paying customer I’ll be interested to see how things work out and if Changelog++ patrons get any kind of enticement for continuing support. Personally, I’m not really interested in dishing out strictly for core Changelog content, especially if all the upside is on YouTube, but we’ll see. In any event, I wish the guys the best.

The uv hotness

Posted on: Mon 09 December 2024

uv is multi-tool for managing various aspects of Python packaging and deployment. From the docs:

An extremely fast Python package and project manager, written in Rust.

Recently, based upon my gestalt reading of the Python media ecosystem (blogs, newsletters, podcasts, and developer docs), uv is on an extreme upward trend of attention if not adoption. From everything I’ve heard it feels like the right tooling for this particular moment, so time to get on board. Stashing a few links for reading up:

Will Kahn-Greene on “Switching from pyenv to uv”. Pertinent since I’m currently a heavy user of pyenv.

Basic nuts and bolts around uv virtualenvs in uv: An In-Depth Guide to Python’s Fast and Ambitious New Package Manager

A comprehensive guide on why and how to start using uv—the package manager (and much more) that’s taken the Python world by storm.

Hynek Sclhawack on Docker and uv. Bonus videos is it the future?, it is the future!.

Will probably update as more good links get discovered.

wallabag, read-it-later

Posted on: Sun 08 December 2024

Link parkin’: wallabag

From the docs

wallabag is a read-it-later application: it saves a web page by keeping content only. Elements like navigation or ads are deleted.

I was into omnivore for some of my read-it-later and newsletter destination needs. But the team behind it got acquired and decided to shut down, a bit abruptly in fact. I was able to export all my content pretty straightforwardly, but it was annoying. wallabag is open source but there’s a well recommended hosted service, wallabag.it.

PIGSTY

Posted on: Sat 07 December 2024

Link parkin’: PIGSTY a batteries included distribution of PostgreSQL, extensions, and adjacent systems.

“PostgreSQL In Great STYle”: Postgres, Infras, Graphics, Service, Toolbox, it’s all Yours.”

—— Battery-Included, Local-First PostgreSQL Distribution as an Open-Source RDS Alternative

I’m usually a little wary of things that have a lot of moving parts, but this might be a reasonable approach to going above and beyond basic Ubuntu package management to deploy PostgreSQL for my home lab.

2024 Space Telescope Advent

Posted on: Fri 06 December 2024

The Atlantic has a nice December feature where they select a space telescope image (Hubble or Webb) each December day from the past year. They’re all collected in a 2024 Space Telescope Advent Calendar gallery. The images are always stunning but I have one peeve.

Since they’re all US government media, they’re all essentially in the public domain (at most liberal Creative Commons). Would it kill The Atlantic to provide a link back to the original source? Shame on you!

I’m taking matters into my own hands, hunting the sources down, and stashing them away. I’ll figure out how to get them displayed in a gallery just because.

And I do mean my own hands, because I see no good way to automate this process.

AWS S3 versus Cloudflare R2

Posted on: Thu 05 December 2024

A comparison of object storage systems on AWS and Cloudflare, by Sylvain Kerkour. Let’s go straight to the punchline.

Honestly, I see only a few reasons to use S3 today: if ~40ms of latency really matters, if you already have a mature AWS-only architecture and inter-region traffic fees are not making a dent in your profits, or if it’s really hard to bring another vendor into your infrastructure for organizational reasons. That’s all.

Maybe 90%+ of projects and organizations will be better served by Cloudflare R2.

Trust but verify on that one. However, it’s nice that there are multiple object storage options. In addition to R2, Backblaze runs B2, Akamai/Linode has its own object storage, ditto Digital Ocean Spaces, Vultr object storage, etc.

I suspect that the offerings of the smaller players are best thought of as internal S3 substitutes for building within those environments, rather than globally facing S3 endpoints. YMMV.

NuExtract 1.5

Posted on: Wed 04 December 2024

Link parkin’: NuExtract 1.5

We introduce NuExtract 1.5, the new version of our foundation model for structured extraction. NuExtract 1.5 is multilingual, can handle arbitrarily long documents, and outperforms GPT-4o in English while being 500 times smaller. As usual, we release it under MIT license.

…

Before diving into the details of what’s new, let’s discuss what this is all about. NuExtract is a family of small open-source models that do only one thing: they extract information from documents and return a structured output (JSON). It turns out that, because they only do this one thing, they are very good at it.

Models built by NuMind, NuMind on Hugging Face

Via Simon Willison

Aryn Unstructured Analytics

Posted on: Tue 03 December 2024

Aryn popped across my radar due to an interview with Matt Welsh (Go Bears) in the Data Exchange Podcast.

In this episode, we explore how AI is revolutionizing programming and software development. We discuss AI-assisted tools like GitHub Copilot and ChatGPT, the rise of natural language programming, and AI’s expanding role in code review and operating systems. We also address challenges in trust and verification of AI-generated code, advanced data extraction techniques, and innovative frameworks for unstructured analytics.

I’ve become somewhat interested in using domain specific coding copilots and assistants to tackle various data challenges. Welsh’s perspective on the topic is quite useful.

sycamore seems to be at the center of what they’re building. Here’s a recent pre-print on their work.

The Design of an LLM-powered Unstructured Analytics System (blog post)

LLMs demonstrate an uncanny ability to process unstructured data, and as such, have the potential to go beyond search and run complex, semantic analyses at scale. We describe the design of an unstructured analytics system, Aryn, and the tenets and use cases that motivate its design. With Aryn, users can specify queries in natural language and the system automatically determines a semantic plan and executes it to compute an answer from a large collection of unstructured documents using LLMs. At the core of Aryn is Sycamore, a declarative document processing engine, built using Ray, that provides a reliable distributed abstraction called DocSets. Sycamore allows users to analyze, enrich, and transform complex documents at scale. Aryn also comprises Luna, a query planner that translates natural language queries to Sycamore scripts, and the Aryn Partitioner, which takes raw PDFs and document images, and converts them to DocSets for downstream processing. Using Aryn, we demonstrate a real world use case for analyzing accident reports from the National Transportation Safety Board (NTSB), and discuss some of the major challenges we encountered in deploying Aryn in the wild.

docling

Posted on: Mon 02 December 2024

Link parkin’ docling:

Docling parses documents and exports them to the desired format with ease and speed.

GitHub repo and technical report from IBM:

This technical report introduces Docling, an easy to use, self-contained, MIT-licensed open-source package for PDF document conversion. It is powered by state-of-the-art specialized AI models for layout analysis (DocLayNet) and table structure recognition (TableFormer), and runs efficiently on commodity hardware in a small resource budget. The code interface allows for easy extensibility and addition of new features and models.

Prompted via a mention about spacy-layout

NotebookLM and Steven Berlin Johnson

Posted on: Sun 01 December 2024

Simon Willison recently linked to a post about NotebookLM by Steven Johnson. I’ve started to get interested in NotebookLM but was even more curious about Steven Johnson since I had lost track of a favorite author Steven Berlin Johnson.

Turns out they were one and the same. Even better, Johnson has been deeply involved in the creation and development of NotebookLM.

Milvus Lite

Posted on: Sun 02 June 2024

Link parkin’: Milvus Lite

Introducing Milvus Lite: the Lightweight Version of Milvus

Milvus is an open-source vector database purpose-built to index, store, and query embedding vectors generated by deep neural networks and other machine learning (ML) models at billions of scales. It has become a popular choice for many companies, researchers, and developers who must perform similarity searches on large-scale datasets.

However, some users may find the full version of Milvus too heavy or complex. To address this problem, Bin Ji, one of the most active contributors in the Milvus community, built Milvus Lite, a lightweight version of Milvus.

…

Introducing Milvus Lite: Start Building a GenAI Application in Seconds

Milvus Lite supports all the basic operations available in Milvus, such as creating collections and inserting, searching, and deleting vectors. It will soon support advanced features like hybrid search. Milvus Lite loads data into memory for efficient searches and persists it as an SQLite file.

GitHub repo

Running Local LLMs

Posted on: Sat 01 June 2024

Link parkin’: 50+ Open-Source Options for Running LLMs Locally

Vince Lam put together a comprehensive resource for running LLMs on your own hardware:

There are many open-source tools for hosting open weights LLMs locally for inference, from the command line (CLI) tools to full GUI desktop applications. Here, I’ll outline some popular options and provide my own recommendations. I have split this post into the following sections:

All-in-one desktop solutions for accessibility

LLM inference via the CLI and backend API servers

Front-end UIs for connecting to LLM backends

GitHub Repo, helpful Google Sheet

Twenty Plus Years …

Posted on: Mon 19 February 2024

Hola Peeps!

It’s been a moment. The archives record my last post was late September, 2023. Yikes! Suffice to say quite a bit happened after I mentioned going on sabbatical, (TL;DR elderly dad got ill and needed a lot of assistance getting back on his feet ; an extended, elaborate, distracting recruiting process for an intriguing employment position fell through) but I survived. Can’t say I achieved what I set out to, but life hits you with a curve ball every now and then.

The last month or so I’ve really started to get back into my own personal technical diversions. Any longtime followers will have a good idea what that entails. Of course, I’ll have to write about them, although fair warning, I’ll also be messing about quite a bit with experimental publishing to other sites I own. More to come.

To the title of this particular post. Between Mass Programming Resistance and New Media Hack I’ve pushed out at least a few blogs every year since January 18 2003. Employing my overeducated abilities, this one will extend the streak to 22 years in a row. 😲 Definitely bursty, but it’s some sort of an accomplishment. Many other bloggers, prominent or not, have fallen by the wayside. Just says I’m something of an Internet old-timer. If they only knew.

Having been through a few social media eras, and not really into TheSocials (tm) of this moment, I plan to keep plugging away at this here site, passing along nuggets and commenting on various and sundry that catches my eye. Definitely good therapy.

To 20 more years. Forza!

beanstalkd

Posted on: Mon 25 September 2023

TIL beanstalkd

Beanstalk is a simple, fast work queue.

Its interface is generic, but was originally designed for reducing the latency of page views in high-volume web applications by running time-consuming tasks asynchronously.

Apparently it’s been around forever. And I called myself a messaging nerd. Shame.

My Emerging Knowledge Toolchain

Posted on: Fri 22 September 2023

Due to research demands of my self-imposed sabbatical, I wound up stacking up a whole bunch of browser tabs across multiple computers, windows, browsers, etc. It was a bit appalling.

The breadth of the types was outstanding as well. Long form web articles. GitHub repos of interest. Academic papers from arXiv. Academic papers from other places. PDFs from all across the web. Email newsletters stuck in multiple GMail accounts. And plenty of plain old web pages for one project of note or another.

Not to mention the bits and bobs trapped in my RSS feed reader stars and my podcast app.

I finally got some time this week to sit down and begin dealing with this knowledge sprawl. There had been some prior investigation on tools worth picking up and growing into. Here’s where I am at the moment:

Omnivore, for stashing long web articles to read later
Zotero, for academic reference and paper management
Zotfile, to shift papers from Zotero into a Dropbox folder
Pinboard, for parking links of interest
Goodnotes, an app recommended by a colleague for annotating PDFs
Apple’s “Books” app for reading technical books

So far so good. All of these apps bridge macOS and iOS which is great since I’m deeply wedded to the Apple ecosystem and I especially want to put my iPad to good use. They also seem to have reasonably decent browser apps as well. Omnivore and Zotero have solid cross browser plugin extensions that make it easy to shovel content in directly from a web page being viewed. For Pinboard, I just do it the old fashioned way, with a JavaScript bookmarklet. I guess I shouldn’t be surprised that saving bookmarks on Pinboard has advanced a bit. Might be time to try yet another browser extension.

Early doors, but mopping up those open tabs is going to plan. Next I have to make all that capture productive. Leaning towards pulling obsidian into the mix since it has nice integration with Omnivore. Then I’m contemplating some linkblog experimentation via various APIs in combination with Pelican’s static site generation capabilities. Fun times.

Some jq Stuff

Posted on: Sun 17 September 2023

jq is one of my favorite command line tools and there’s a bit of news. There’s a jq 1.7 finally:

After a five year hiatus we’re back with a GitHub organization, with new admins and new maintainers who have brought a great deal of energy to make a long-awaited and long-needed new release. We’re very grateful for all the new owners, admins, and maintainers. Special thanks go to Owen Ou (@owenthereal) for pushing to set up a new GitHub organization for jq, Stephen Dolan (@stedolan) for transferring the jq repository to the new organization, @itchyny for doing a great deal of work to get the release done, Mattias Wadman (@wader) and Emanuele Torre (@emanuele6) for many PRs and code reviews. Many others also contributed PRs, issues, and code reviews as well, and you can find their contributions in the Git log and on the closed issues and PRs page.

Also, is there anything Postgres can’t do? Enter pgJQ as a supplement to Postgres’ built-in jsonb support.

The pgJQ extension embeds the standard jq compiler and brings the much loved jq lang to Postgres.

It adds a jqprog data type to express jq programs and a jq(jsonb, jqprog) function to execute them on jsonb objects. It works seamlessly with standard jsonb functions, operators, and jsonpath.

Very much feels like alpha software, but could still be a useful addition to one’s toolbox.

That One Discogs Release

Posted on: Sat 12 August 2023

Any significantly large, human created, dataset is going to get some weird entries. In the Discogs Data Dumps, there’s a release (careful, following that link might blow up your browser) with a title that’s a whole bunch of Unicode characters and the word “Unicode”. The title is a little under 628K (yes, six hundred twenty eight thousand) octets.

Does it matter at all? Well, if you stuff that record into a PostgreSQL database and then build an index on the title column, you’ll get a sad trombone.

I’m thinking of hacking my personal fork of discogs-xml2db to have an option for limiting field size.

MusicBrainz Database

Posted on: Thu 10 August 2023

Link parkin’: MusicBrainz Database

The MusicBrainz Database is built on the PostgreSQL relational database engine and contains all of MusicBrainz’ music metadata. This data includes information about artists, release groups, releases, recordings, works, and labels, as well as the many relationships between them. The database also contains a full history of all the changes that the MusicBrainz community has made to the data.

Mmmmmmm, data.

What Are Embeddings?

Posted on: Wed 09 August 2023

Vicki Boykis put together a free primer (delivered in PDF) on vector space embeddings.

Peter Norvig urges us to teach ourselves programming in ten years. In this spirit, after several years of working with embeddings, foundational data structures in deep learning models, I realized it’s not trivial to have a good conceptual model of them. Moreover, when I did want to learn more, there was no good, general text I could refer to as a starting point. Everything was either too deep and academic or too shallow and content from vendors in the space selling their solution. So I started a project to understand the fundamental building blocks of machine learning and natural language processing, particularly as they relate to recommendation systems today. The results of this project are the PDF on this site, which is aimed at a generalist audience and not trying to sell you anything except the idea that vectors are cool. I’ve also been working on Viberary to implement these ideas in practice.

The post also points to plenty of follow on educational material. I’m particularly intrigued by the supplemental content related to deep learning and recommender systems.

Building Your Own Chatbot

Posted on: Sat 05 August 2023

Mozilla ran an interesting, meticulously documented, time-bounded effort to create an in-house, open source, large language model based chatbot.

With this goal in mind, a small team within Mozilla’s innovation group recently undertook a hackathon at our headquarters in San Francisco. Our objective: build a Mozilla internal chatbot prototype, one that’s…

Completely self-contained, running entirely on Mozilla’s cloud infrastructure, without any dependence on third-party APIs or services.

Built with free, open source large language models and tooling.

Imbued with Mozilla’s beliefs, from trustworthy AI to the principles espoused by the Mozilla Manifesto.

Heads up. I’m on sabbatical, attempting to do a bit of education on deep learning, generative models, and AI broadly. Case studies like these are really useful.

Despite the hype train on generative AI, there’s promise, … and peril. Expect to see quite a bit more of this in my feed.

llm and Llama 2

Posted on: Fri 04 August 2023

The prolific Simon Willison has put together llm, a Python library and CLI for messing around with AI:

A CLI utility and Python library for interacting with Large Language Models, including OpenAI, PaLM and local models installed on your own machine.

Eminently convenient is a plugin mechanism that supports usage of local open source models.

I followed the directions on the tin and was able to run the latest Llama 2 model on my M2 MacBook within about 15 minutes. Most of that time was spent waiting for downloads.

However, the model does run as slowly as advertised, taking about 20 seconds to respond to a prompt. Still, it’s nice to not be beholden to our Big Tech overlords for ungoverned experimentation. OpenAPI access is definitely not cheap!