home ¦ Archives ¦ Atom ¦ RSS

TalkPython Deep Dives

Michael Kennedy has added Episode Deep Dives to Talk Python.

Have you ever listened to a podcast episode and later wished there was a quick, detailed way to revisit the best points—without scrubbing through the entire recording again? That’s exactly why we launched Episode Deep Dives at Talk Python. This feature provides a rich, structured walkthrough of each podcast episode, giving you a snapshot of all the essential details and takeaways in one convenient place. Whether you’re curious about a specific tool our guest mentioned or you want to recall the main ideas days after listening, Episode Deep Dives makes it easy to dive right back into the heart of the conversation.

The quick look is really good. Kennedy doesn’t go into depth on how they’re created (“Each one of these deep dives does take some time and work to generate.”), but I suspect there’s some LLM machinations in there, supplementing a bit of manual human labor. Even so, dives already exist for over a year’s worth of episodes.

I also listened to the embedded “conversation”. My curiosity was piqued as to the participants whose identities weren’t provided. It was pretty entertaining and provocative for about 10 minutes and then it got repetitive. Also, none of the speakers actually mentioned names and at a few points I was confused whether there were two or three distinct voices. I’m pretty sure it’s an AI generated audio overview, ala NotebookLM. Whatever it was, a couple of interesting notions around automatic content generation and community building. Needed some editing.

Looking forward to taking advantage of these deep dives, a behind the scenes Talk Python episode, and the proliferation of the feature to other podcasts.


What. Is. Happening?

It’s the first day of 2025. If there’s anyone paying attention, this blog has come back to life once again. Why?

Previously I hinted at some personal challenges that were seemingly surmounted, freeing up emotional and cognitive energy. That was a false dawn. Lots of other issues cropped up in work and life. Nothing cataclysmic but definitely draining. No need to detail them here. Some are resolved and some are ongoing.

I need an outlet though to provide some release. I’m also aiming to generate some supplemental material for the overall technical portfolio. Why?

The answer, as to everything in technology for 2024, is LLMs. In my day job, I work on MLOps for data intensive workloads in the medical informatics space. (I’m easy to find on LinkedIn if you need details.) Being in tech long term, I’m also just generally interested.

I’m of the “promise and peril” persuasion in relation to LLMs. Skeptical of much of the perilous hype but hopeful regarding some of the technology promise. Simon Willison has a great recap of lessons learned about LLMs in 2024 and possibly says it best (apologies for the heavy quoting):

A drum I’ve been banging for a while is that LLMs are power-user tools—they’re chainsaws disguised as kitchen knives. They look deceptively simple to use—how hard can it be to type messages to a chatbot?—but in reality you need a huge depth of both understanding and experience to make the most of them and avoid their many pitfalls.

If anything, this problem got worse in 2024.

We’ve built computer systems you can talk to in human language, that will answer your questions and usually get them right! … depending on the question, and how you ask it, and whether it’s accurately reflected in the undocumented and secret training set.

There’s a flipside to this too: a lot of better informed people have sworn off LLMs entirely because they can’t see how anyone could benefit from a tool with so many flaws. The key skill in getting the most out of LLMs is learning to work with tech that is both inherently unreliable and incredibly powerful at the same time. This is a decidedly non-obvious skill to acquire!

There is so much space for helpful education content here, but we need to do do a lot better than outsourcing it all to AI grifters with bombastic Twitter threads.

I think telling people that this whole field is environmentally catastrophic plagiarism machines that constantly make things up is doing those people a disservice, no matter how much truth that represents. There is genuine value to be had here, but getting to that value is unintuitive and needs guidance.

Those of us who understand this stuff have a duty to help everyone else figure it out.

So I’m back to, among other things, dig into this LLM stuff with purpose and intent.


le Carré, Smiley, and The Circus

Instead of the common comprehensive look at the past year, I’m going to close with focused observation on one specific thing I did this year.

Read a lot of spy novels.

read more ...


HuggingChat

Link parkin’: HuggingChat UI

A chat interface using open source models, eg OpenAssistant or Llama. It is a SvelteKit app and it powers the HuggingChat app on hf.co/chat.

From the docs:

Open source chat interface with support for tools, web search, multimodal and many API providers. The app uses MongoDB and SvelteKit behind the scenes. Try the live version of the app called HuggingChat on hf.co/chat or setup your own instance.

via Jose Antonio Lanz at Decrypt

If you haven’t noticed a theme yet, I’ve become interested in frontend user interfaces for LLM applications.


LibreChat

Link parkin’: LibreChat

LibreChat is the ultimate open-source app for all your AI conversations, fully customizable and compatible with any AI provider — all in one sleek interface

All the gloss on the home page is a bit much for me, but the real proof is in grabbing the source code and test driving.


Acorn 8

Announced a few days back by Gus Mueller:

Acorn 8 has been released!

This is a major update of Acorn, and is currently on a time-limited sale for $19.99. It’s still a one time purchase to use as long as you’d like, and as usual, the full release notes are available. I want to highlight some of my favorite things below.

I don’t do much image editing, but when I do, I use Acorn. Do Gus a solid and buy a copy.


OpenwebUI

Link parkin’: Open WebUI, a well done, open source, frontend to LLM APIs. From the Open WebUI docs:

Open WebUI is an extensible, feature-rich, and user-friendly self-hosted AI interface designed to operate entirely offline. It supports various LLM runners, including Ollama and OpenAI-compatible APIs.

Interestingly the project has an academic looking paper on building a platform for LLM evaluation and auditing attached.

Via a positive overview from Simon Willison


Podcast Aggregation Self-Hosting

Link parkin’: a few self-hostable podcast frameworks:

PinePods

PinePods is a Rust based app that can sync podcasts for individual accounts that relies on a central database with a web frontend and apps available on multiple platforms

Podgrab

Podgrab is a is a self-hosted podcast manager which automatically downloads latest podcast episodes. It is a light-weight application built using GO.

PodFetch

PodFetch is a simple, lightweight, fast, and efficient podcast downloader for hosting your own podcasts.


DBOS Stateful Workflows

DBOS (the company) has come across my radar via two vectors. Ben Lorica had a newsletter update and podcast interview with the founders. Meanwhile the open source DBOS framework was mentioned on a recent episode of Python Bytes.

This could be a useful platform for my purposes, although I’m a bit hesitant since it’s new and not “boring”. However, it builds on top of PostgreSQL and is the result of a well grounded research effort out of MIT and Stanford.

The interview was a bit “salesy” but not overbearing. Also, Lorica tended to steer the discussion in the direction of AI and agentic use cases.

From the documentation introduction

DBOS is a serverless platform for building reliable backend applications. Add lightweight annotations to your app to durably execute it, making it resilient to any failure. Then, deploy your app to the cloud with a single command.

As a data engineer, the framework is probably worth a test drive to compare against Prefect and/or build your own systems.


uv scripts

Trey Hunner, a seasoned Python trainer and developer, has been experimenting with uv for a number of tasks including short script development while also replacing pipx and pyenv, two tools I routinely use. He hasn’t completely embraced a substitution but uv does provide an intriguing upside:

The biggest win that I’ve experienced from uv so far is the ability to run an executable script and have any necessary dependencies install automagically.

This doesn’t mean that I never make Python package out of my Python scripts anymore… but I do so much more rarely. I used to create a Python package out of a script as soon as it required third-party dependencies. Now my “do I really need to turn this into a proper package” bar is set much higher.

This relies on a Python packaging standard extension mechanism for embedding metadata in a script.

Pushing this to the limit, would it be possible to declare xonsh as a dependency and then write the rest of the script in that shell language?

Via PythonBytes episode 415


LaunchBar Actions Editor

Did a little more thinking about LaunchBar Actions and drifted into considering how they could combine with Simon Willison’s llm framework. Simon mostly demonstrates llm as a powerful CLI tool for interacting with modern AI models. The CLI is porcelain around extremely sophisticated Python module plumbing. LaunchBar could make a nice alternative porcelain.

So how does one go about creating LaunchBar Actions? The code examples I’ve seen are pretty straightforward. I could likely reverse engineer the directory structure and code interface even if documentation didn’t exist. However, the fine folks at Objective Development added a LaunchBar Action Editor in version 6.4 of the tool. Well what do you know!? 😲

And of course there is documentation for authoring actions.


MarkItDown

Link parkin’: MarkItDown

MarkItDown is a utility for converting various files to Markdown (e.g., for indexing, text analysis, etc). It supports:

  • PDF
  • PowerPoint
  • Word
  • Excel
  • Images (EXIF metadata and OCR)
  • Audio (EXIF metadata and speech transcription)
  • HTML
  • Text-based formats (CSV, JSON, XML)
  • ZIP files (iterates over contents)

This should definitely come in handy.

Via Daring Fireball


Fastmail Hardware

Rob Mueller CTO, blogged a piece that caught my eye about how Fastmail, a long standing hosted email service, builds on their own hardware as opposed to public cloud infrastucture. Part of the reason is that Fastmail predates the availability of the major public cloud providers. Even at that they’ve never seen a reason to move despite the persistent cloud evangelism that has swept the tech industry.

A big part is the storage pricing of the blob services, S3 and its competitive spinoffs. This interests me because I have a pile of Discogs data that I’ve been hoarding. Pushing it all into the cloud feels appealing but I don’t want to break the bank.

Mueller breaks down the surprising costs of storage at Fastmail scale. Here’s the money graf on the choice:

It’s interesting seeing the spread of prices here. Some also have a bunch of weird edge cases as well. e.g. “The S3 Glacier Flexible Retrieval and S3 Glacier Deep Archive storage classes require an additional 32 KB of data per object”. Given the large retrieval time and extra overhead per-object, you’d probably want to store small incremental backups in regular S3, then when you’ve gathered enough, build a biggish object to push down to Glacier. This adds implementation complexity.

Ultimately though, if you know your use case intimately, have a decent handle on the cost envelope, and can employ enough operational expertise, doing it yourself is an advantage.

Bonus, TIL about Wasabi.


Launchbar Actions for Kagi

A few months back, I ponied up for a high end subscription to the Kagi search engine. Meanwhile, the LaunchBar app is embedded in my fingertips. I was poking around for ways to plug Kagi into LaunchBar (using Kagi search ’natch). Found two decent looking plugins:

These look simple enough that I should spend some time to learn how to write LaunchBar actions myself. If there isn’t an analogous Kagi action in Python there should be.


Embeddings with llm-gguf

I find the ability to create multi-dimensional embedding vectors from deep learning models quite fascinating. There’s an obvious application pattern in Retrieval Augmented Generation (RAG) with current LLMs. However, useful embedding models come in a much wider range of scales and capabilities then general language models. In principle, it’s quite possible to train custom embedding models at a reasonable cost in terms of compute hardware, data scale, and time.

Last month, Simon Willison updated his llm-gguf plugin to support creating embeddings from GGUF models specifically for embeddings.

The LLM docs have extensive coverage of things you can then do with this model, like embedding every row in a CSV file / file in a directory / record in a SQLite database table and running similarity and semantic search against them.

This could come in handy since I have a few piles of content laying around where using embeddings to supplement search and retrieval would be an interesting experiment.


I Stand Corrected

Previously, I thought The Atlantic’s “Space Telescope Advent Calendar” only collected images from 2024. After my daily source investigation of today’s selection, leading to this discovery on Flickr uploaded in 2017, turns out anything is fair game.

We regret the error.


DataChain

Link parkin’: DataChain

DataChain is a Python-based AI-data warehouse for transforming and analyzing unstructured data like images, audio, videos, text and PDFs. It integrates with external storage (e.g. S3, GCP, Azure, HuggingFace) to process data efficiently without data duplication and manages metadata in an internal database for easy and efficient querying.


Playwright, Python, Requests-HTML

I was trying to lift and shift a web scraping script from my laptop to a Linux VM. The script uses the Requests-HTML and works fine OMM. On the VM, not so much. I just couldn’t get the right Ubuntu dependencies installed so a headless Chrome browser integrated correctly.

Enter Requests-HTML-Playwright which uses playwright-python to wrap the playwright library. Should do the trick right? Wrong. The HTMLSession from the requests_html_playwright module always failed.

HOWEVER!, playwright-python apparently installs the correct Ubuntu packages to make Requests-HTML operate properly. Go figure.

Victory?


Jeff Barr Wraps

Way back in the day, I linked out to Jeff Barr, AWS’s first blogger. He’s been at it forever on behalf of the public cloud provider, but I was following even before that. In the early days of RSS, I was into what he was doing with syndic8.com.

Now Barr is moving on after 20 years. A commendable run. I always admired his writing as extremely competent, technical, accessible, human, and gracious.

It has been a privilege to be able to “live in the future” and to get to learn and write about so many of our innovations over the last two decades: message queuing, storage, on-demand computing, serverless, and quantum computing to name just a few and to leave many others out. It has also been a privilege to be able to meet and to hear from so many of you that have faithfully read and (hopefully) learned from my content over the years. I treasure those interactions and your kind words, and I keep both in mind when I write.

While Barr was blogging about S3 in its early days, I had a few things to say about the service.

Just a few thoughts on Amazon’s new web service S3, which is cheap, reliable, plentiful Internet based storage.

S3 is not a game changer.

That didn’t age well.

Bravo AWS for avoiding bitrot on this 18 year old link.


Festivitas

Apropos of the season, I plunked down a few Euros for Festivitas. Does what it says on the tin:

Festivitas brings the holiday spirit to your Mac with festive lights for the dock and menu bar 🎄

Now the only question is how long after New Year’s are you allowed to keep your lights up?

Via Daring Fireball


Automating the Advent Calendar

Quoting myself, regarding The Atlantic’s “2024 Space Telescope Advent Calendar”.

I’m taking matters into my own hands, hunting the sources down, and stashing them away. I’ll figure out how to get them displayed in a gallery just because.

And I do mean my own hands, because I see no good way to automate this process.

Maybe this is a job for an LLM? Extract the credits, telescope, and summarize The Atlantic text. Run a search against Flickr (where I’ve found all of the images). Generate code for a photo gallery. Done.

Worth a shot if I have Copious Spare Holiday Time (TM).


A New CPU, Follow Up

Previously I made some assertions regarding the Changelog team reconfiguring their podcast business. At the tail end of last week, they released a “Changelog & Friends: Kaizen!” pod episode that talked a lot about that topic. Confirmed were a couple of my suspicions.

I was dead on target about the move to YouTube. The Kaizen! episode had a lot of nuance. The audience had been asking for it and they felt they were leaving a bit on the table, especially relative to other creators in the developer content space. Gerhard also teased out the various different ways folks receive such content, from quick hitter TikTok types to immersive deep divers. Also the ability to see the participants facial reactions and other physical gestures came up as a bonus. Not really my thing, but I could see all the props and gimmicks from the cable sports debate shows and YouTube channels coming to bear.

Hit the mark as well on the production weight. Adam and Jerod were definitely missing the slack needed to birth new concepts and do more hacking. Sounds like this will free them up a bit.

The jury’s still out about my pod network claim. Adam did mention being able to give brands, especially with more nascent companies, additional options with the ability to connect an increased number of podcasts. My search/indexing bit is still hanging out there. I should do some research to see if this product doesn’t exist already, but loosely coupled, vertical network oriented, podcast search feels like it should be a thing. Market opportunity?

Personally, YouTube videos as the primary upsell still isn’t for me. I use pods as background material while multitasking on other things. That’s just about impossible to do with video content. Jerod mentioned in a comment that they’ll have to evolve Changelog++. Let’s see what comes of it.


flyte

Link parkin’: flyte

The infinitely scalable and flexible workflow orchestration platform that seamlessly unifies data, ML and analytics stacks.

From the flyte docs:

Flyte is an open-source, Kubernetes-native workflow orchestrator implemented in Go. It enables highly concurrent, scalable and reproducible workflows for data processing, machine learning and analytics.

Flyte provides first-class support for Python and has a community-driven Java and Scala SDK. Data Scientists and ML Engineers in the industry use Flyte to create:

  • Data pipelines for processing petabyte-scale data.
  • Analytics workflows for business and finance use cases.
  • Machine learning pipelines for logistics, image processing, and cancer diagnostics.

Missed posting yesterday, so this is a bonus makeup. I can at least get a post a day on average.


Lance and LanceDB

Link parkin‘: Lance is the file format and LanceDB is the query engine on top:

Lance is a columnar data format that is easy and fast to version, query and train on. It’s designed to be used with images, videos, 3D point clouds, audio and of course tabular data. It supports any POSIX file systems, and cloud storage like AWS S3 and Google Cloud Storage.

LanceDB is an open-source vector database for AI that’s designed to store, manage, query and retrieve embeddings on large-scale multi-modal data. The core of LanceDB is written in Rust 🦀 and is built on top of Lance, an open-source columnar data format designed for performant ML workloads and fast random access.

Both the database and the underlying data format are designed from the ground up to be easy-to-use, scalable and cost-effective.

Been popping up in my feeds recently.


Airshow as a Foundation?

I mentioned Airshow as part of a foundation for retrocast. It’s a podcast player from Feedbin.com. The app can use Feedbin as a synch mechanism. In principle, the podcast feeds of Airshow could be a special part of the Feedbin API, unifying RSS feeds and podcasts within one service. Would be convenient for development purposes.

Unfortunately, it’s not looking particularly positive for a distinct Airshow API segment of the feedbin API. There’s probably a few ways to kludge it though, possibly by doing a podcast search on the feed source url and eventually tagging podcasts with a specific label. Might be a nudge to investigate a hackable open source feed aggregator.

Alas, the Airshow app is driving me up a wall with a hyper sensitive touch interface on iOS.


retrocast

Thought exercise in action here.

I’ve been peripherally aware of uv for a bit. Now I’m ready to dive in deeper. But the buzz on uv has been building for a bit. What’s the aggregate knowledge in RSS feeds and podcast subscriptions on uv?

If only I had a personal tool that could retrospectively search my personal content to answer the request “Build a precis on uv for me”.

The ground is fertile for a tool I’m calling retrocast that would fit that bill. It would be a personal app hooking into all your feeds and constantly indexing the content. Layer in some text search, semantic search, knowledge graphs, and LLMs to craft products, not just 10 blue links.

Some components for inclusion and inspiration


Limbo Rewrites SQLite

Limbo is attempting to deliver an (100%?) SQLite compatible db engine, implemented in Rust. As a big time SQLite fan, I have to say innerestin:

2 years ago, we forked SQLite. We were huge fans of the embedded nature of SQLite, but longed for a more open model of development. libSQL was born as an Open Contribution project, and we invited the community to build it with us.

As a result, libSQL is an astounding success. With over 12k Github stars, 85 contributors, and features like native replication and vector search, libSQL is the engine that powers the Turso platform.

Today we are announcing a more ambitious experiment: what could we achieve, if we were to completely rewrite SQLite in a memory-safe language (Rust)? With the Limbo project, now available at github.com/tursodatabase/limbo, we are now trying to answer that question.

Via Simon Willison


A New CPU

I subscribe to a lot of podcast feeds, but don’t listen to that many. And I’m definitely pretty selective about which ones I support financially.

One of those is Changelog++ and including support from very early on in its existence. Adam and Jerod had a broad potpourri of feeds covering a wide range of topics and the Changelog News newsletter is entertaining.

Had.

So it is with a raised eyebrow that I note the recent announcement that Adam and Jerod are shutting down their production of secondary pods such as Practical AI, Ship It!, Go Time, etc. Apparently all of them will live on in some new form. Oddly, the core Changelog podcasts have become less attractive (personality driven, big time commitment), while the standout, must listen for me is Practical AI (fortyish minutes typically, technical area of recent interest).

The announcement feels pretty transparent about the reasons why. Even so I’m going to speculate on some of the drivers.

A significant factor is “pivoting to YouTube”, which I’ve seen happen on a few other podcasts I follow. Instead of pods being strictly audio, and closer to radio, the core is a YouTube channel with the audio episodes in RSS secondary. Yet another case of it’s where the audience, attention, and money seems to be, to my chagrin.

C. f. How YouTube Ate Podcasting

Another aspect is that media production is a grind. Editing is time consuming and not energizing from what I’ve heard from folks in the trenches. I’m sure the Changelog team was running out of gas doing production for others, on top of their shows, on top of their family commitments. Adding to the dilemma are the effects of outsourced production services and the incoming impact of AI on production.

Lastly, the ability to leverage network effects allows for some separation without abandonment, a.k.a. the Changelog Podcast Universe

CPU.fm (Changelog Podcast Universe) is our new home for spin-offs and collaborations. It’s a burgeoning network of developer pods with a big vision, but we’re just getting started.

More friends: Through CPU.fm, we’ll define an index of developer pods worth your attention. Expect cross-posting and collaboration with our extended pod friends network.

Emphasis mine on index. I wouldn’t be surprised if that also meant some whizbang podcast search, taking advantage of the Changelog archives, with a heavy slather of AI all over.

Obviously as a paying customer I’ll be interested to see how things work out and if Changelog++ patrons get any kind of enticement for continuing support. Personally, I’m not really interested in dishing out strictly for core Changelog content, especially if all the upside is on YouTube, but we’ll see. In any event, I wish the guys the best.


The uv hotness

uv is multi-tool for managing various aspects of Python packaging and deployment. From the docs:

An extremely fast Python package and project manager, written in Rust.

Recently, based upon my gestalt reading of the Python media ecosystem (blogs, newsletters, podcasts, and developer docs), uv is on an extreme upward trend of attention if not adoption. From everything I’ve heard it feels like the right tooling for this particular moment, so time to get on board. Stashing a few links for reading up:

Will Kahn-Greene on “Switching from pyenv to uv”. Pertinent since I’m currently a heavy user of pyenv.

Basic nuts and bolts around uv virtualenvs in uv: An In-Depth Guide to Python’s Fast and Ambitious New Package Manager

A comprehensive guide on why and how to start using uv—the package manager (and much more) that’s taken the Python world by storm.

Hynek Sclhawack on Docker and uv. Bonus videos is it the future?, it is the future!.

Will probably update as more good links get discovered.


wallabag, read-it-later

Link parkin’: wallabag

From the docs

wallabag is a read-it-later application: it saves a web page by keeping content only. Elements like navigation or ads are deleted.

I was into omnivore for some of my read-it-later and newsletter destination needs. But the team behind it got acquired and decided to shut down, a bit abruptly in fact. I was able to export all my content pretty straightforwardly, but it was annoying. wallabag is open source but there’s a well recommended hosted service, wallabag.it.


PIGSTY

Link parkin’: PIGSTY a batteries included distribution of PostgreSQL, extensions, and adjacent systems.

“PostgreSQL In Great STYle”: Postgres, Infras, Graphics, Service, Toolbox, it’s all Yours.”

—— Battery-Included, Local-First PostgreSQL Distribution as an Open-Source RDS Alternative

I’m usually a little wary of things that have a lot of moving parts, but this might be a reasonable approach to going above and beyond basic Ubuntu package management to deploy PostgreSQL for my home lab.


2024 Space Telescope Advent

The Atlantic has a nice December feature where they select a space telescope image (Hubble or Webb) each December day from the past year. They’re all collected in a 2024 Space Telescope Advent Calendar gallery. The images are always stunning but I have one peeve.

Since they’re all US government media, they’re all essentially in the public domain (at most liberal Creative Commons). Would it kill The Atlantic to provide a link back to the original source? Shame on you!

I’m taking matters into my own hands, hunting the sources down, and stashing them away. I’ll figure out how to get them displayed in a gallery just because.

And I do mean my own hands, because I see no good way to automate this process.


AWS S3 versus Cloudflare R2

A comparison of object storage systems on AWS and Cloudflare, by Sylvain Kerkour. Let’s go straight to the punchline.

Honestly, I see only a few reasons to use S3 today: if ~40ms of latency really matters, if you already have a mature AWS-only architecture and inter-region traffic fees are not making a dent in your profits, or if it’s really hard to bring another vendor into your infrastructure for organizational reasons. That’s all.

Maybe 90%+ of projects and organizations will be better served by Cloudflare R2.

Trust but verify on that one. However, it’s nice that there are multiple object storage options. In addition to R2, Backblaze runs B2, Akamai/Linode has its own object storage, ditto Digital Ocean Spaces, Vultr object storage, etc.

I suspect that the offerings of the smaller players are best thought of as internal S3 substitutes for building within those environments, rather than globally facing S3 endpoints. YMMV.


NuExtract 1.5

Link parkin’: NuExtract 1.5

We introduce NuExtract 1.5, the new version of our foundation model for structured extraction. NuExtract 1.5 is multilingual, can handle arbitrarily long documents, and outperforms GPT-4o in English while being 500 times smaller. As usual, we release it under MIT license.

Before diving into the details of what’s new, let’s discuss what this is all about. NuExtract is a family of small open-source models that do only one thing: they extract information from documents and return a structured output (JSON). It turns out that, because they only do this one thing, they are very good at it.

Models built by NuMind, NuMind on Hugging Face

Via Simon Willison


Aryn Unstructured Analytics

Aryn popped across my radar due to an interview with Matt Welsh (Go Bears) in the Data Exchange Podcast.

In this episode, we explore how AI is revolutionizing programming and software development. We discuss AI-assisted tools like GitHub Copilot and ChatGPT, the rise of natural language programming, and AI’s expanding role in code review and operating systems. We also address challenges in trust and verification of AI-generated code, advanced data extraction techniques, and innovative frameworks for unstructured analytics.

I’ve become somewhat interested in using domain specific coding copilots and assistants to tackle various data challenges. Welsh’s perspective on the topic is quite useful.

sycamore seems to be at the center of what they’re building. Here’s a recent pre-print on their work.

The Design of an LLM-powered Unstructured Analytics System (blog post)

LLMs demonstrate an uncanny ability to process unstructured data, and as such, have the potential to go beyond search and run complex, semantic analyses at scale. We describe the design of an unstructured analytics system, Aryn, and the tenets and use cases that motivate its design. With Aryn, users can specify queries in natural language and the system automatically determines a semantic plan and executes it to compute an answer from a large collection of unstructured documents using LLMs. At the core of Aryn is Sycamore, a declarative document processing engine, built using Ray, that provides a reliable distributed abstraction called DocSets. Sycamore allows users to analyze, enrich, and transform complex documents at scale. Aryn also comprises Luna, a query planner that translates natural language queries to Sycamore scripts, and the Aryn Partitioner, which takes raw PDFs and document images, and converts them to DocSets for downstream processing. Using Aryn, we demonstrate a real world use case for analyzing accident reports from the National Transportation Safety Board (NTSB), and discuss some of the major challenges we encountered in deploying Aryn in the wild.


docling

Link parkin’ docling:

Docling parses documents and exports them to the desired format with ease and speed.

GitHub repo and technical report from IBM:

This technical report introduces Docling, an easy to use, self-contained, MIT-licensed open-source package for PDF document conversion. It is powered by state-of-the-art specialized AI models for layout analysis (DocLayNet) and table structure recognition (TableFormer), and runs efficiently on commodity hardware in a small resource budget. The code interface allows for easy extensibility and addition of new features and models.

Prompted via a mention about spacy-layout


NotebookLM and Steven Berlin Johnson

Simon Willison recently linked to a post about NotebookLM by Steven Johnson. I’ve started to get interested in NotebookLM but was even more curious about Steven Johnson since I had lost track of a favorite author Steven Berlin Johnson.

Turns out they were one and the same. Even better, Johnson has been deeply involved in the creation and development of NotebookLM.


Milvus Lite

Link parkin’: Milvus Lite

Introducing Milvus Lite: the Lightweight Version of Milvus

Milvus is an open-source vector database purpose-built to index, store, and query embedding vectors generated by deep neural networks and other machine learning (ML) models at billions of scales. It has become a popular choice for many companies, researchers, and developers who must perform similarity searches on large-scale datasets.

However, some users may find the full version of Milvus too heavy or complex. To address this problem, Bin Ji, one of the most active contributors in the Milvus community, built Milvus Lite, a lightweight version of Milvus.

Introducing Milvus Lite: Start Building a GenAI Application in Seconds

Milvus Lite supports all the basic operations available in Milvus, such as creating collections and inserting, searching, and deleting vectors. It will soon support advanced features like hybrid search. Milvus Lite loads data into memory for efficient searches and persists it as an SQLite file.

GitHub repo


Running Local LLMs

Link parkin’: 50+ Open-Source Options for Running LLMs Locally

Vince Lam put together a comprehensive resource for running LLMs on your own hardware:

There are many open-source tools for hosting open weights LLMs locally for inference, from the command line (CLI) tools to full GUI desktop applications. Here, I’ll outline some popular options and provide my own recommendations. I have split this post into the following sections:

  1. All-in-one desktop solutions for accessibility
  2. LLM inference via the CLI and backend API servers
  3. Front-end UIs for connecting to LLM backends

GitHub Repo, helpful Google Sheet


Twenty Plus Years …

Hola Peeps!

It’s been a moment. The archives record my last post was late September, 2023. Yikes! Suffice to say quite a bit happened after I mentioned going on sabbatical, (TL;DR elderly dad got ill and needed a lot of assistance getting back on his feet ; an extended, elaborate, distracting recruiting process for an intriguing employment position fell through) but I survived. Can’t say I achieved what I set out to, but life hits you with a curve ball every now and then.

The last month or so I’ve really started to get back into my own personal technical diversions. Any longtime followers will have a good idea what that entails. Of course, I’ll have to write about them, although fair warning, I’ll also be messing about quite a bit with experimental publishing to other sites I own. More to come.

To the title of this particular post. Between Mass Programming Resistance and New Media Hack I’ve pushed out at least a few blogs every year since January 18 2003. Employing my overeducated abilities, this one will extend the streak to 22 years in a row. 😲 Definitely bursty, but it’s some sort of an accomplishment. Many other bloggers, prominent or not, have fallen by the wayside. Just says I’m something of an Internet old-timer. If they only knew.

Having been through a few social media eras, and not really into TheSocials (tm) of this moment, I plan to keep plugging away at this here site, passing along nuggets and commenting on various and sundry that catches my eye. Definitely good therapy.

To 20 more years. Forza!

© 2008-2024 C. Ross Jam. Built using Pelican. Theme based upon Giulio Fidente’s original svbhack, and slightly modified by crossjam.