home ¦ Archives ¦ Atom ¦ RSS

That One Discogs Release

Any significantly large, human created, dataset is going to get some weird entries. In the Discogs Data Dumps, there’s a release (careful, following that link might blow up your browser) with a title that’s a whole bunch of Unicode characters and the word “Unicode”. The title is a little under 628K (yes, six hundred twenty eight thousand) octets.

Does it matter at all? Well, if you stuff that record into a PostgreSQL database and then build an index on the title column, you’ll get a sad trombone.

I’m thinking of hacking my personal fork of discogs-xml2db to have an option for limiting field size.

© C. Ross Jam. Built using Pelican. Theme based upon Giulio Fidente’s original svbhack, and slightly modified by crossjam.