That One Discogs Release

Posted on: Sat 12 August 2023

Any significantly large, human created, dataset is going to get some weird entries. In the Discogs Data Dumps, there’s a release (careful, following that link might blow up your browser) with a title that’s a whole bunch of Unicode characters and the word “Unicode”. The title is a little under 628K (yes, six hundred twenty eight thousand) octets.

Does it matter at all? Well, if you stuff that record into a PostgreSQL database and then build an index on the title column, you’ll get a sad trombone.

I’m thinking of hacking my personal fork of discogs-xml2db to have an option for limiting field size.