After months of sometimes painful experimentation here’s a technique others may find useful. This is for those who need to ingest millions of JSON objects, collected from the wild, stored as line ordered text files, into a PostgreSQL table.
- Transform your JSON lines into base64 hex encoded lines
- Start a transaction
- Create a temp table that deletes on commit
- COPY from your hex encoded file into the temp table
- Run an INSERT into your destination table that draws from a SELECT that base64 decodes the JSON column
- COMMIT, dropping the the temp table
- Profit!!
- For bonus points, use the temp table to delete duplicate records or other cleanup
The benefit of this approach is that it saves you from dealing with escaping and encoding horrors. Basically, JSON can contain any character so it’s almost impossible to safely outsmart stupid users and PgSQL strictness. Once you’re over a million records, you’ll get bit by one or the other.
Reliably applied to over 300 million Tweets from the Twitter streaming API.
Update. If you’re doing this with Python make sure to use psycopg2 since you can control the transaction isolation and it supports COPY.