Any significantly large, human created, dataset is going to get some weird entries. In the Discogs Data Dumps, there’s a release (careful, following that link might blow up your browser) with a title that’s a whole bunch of Unicode characters and the word “Unicode”. The title is a little under 628K (yes, six hundred twenty eight thousand) octets.
Does it matter at all? Well, if you stuff that record into a
PostgreSQL database and then build an index on the title
column,
you’ll get a sad trombone.
I’m thinking of hacking my personal fork of discogs-xml2db to have an option for limiting field size.