The next question I have about the Discogs Data is what’s the total
amount to download? An initial step is updating my URL gathering
script to grab the Content-Length
header from an http probe of
each url and start generating csv compatible output.
Below is the tweakage. The script has moved from straight Python to a
xonsh
script. The most interesting piece is line 6
where I shell out to use httpie
to make a HEAD request against
the url. The @
splices the value of the furl
variable into the
command line from the outer Python context, while !
captures the
output of the inner subcommand for processing in the outer Python context.
1 2 3 4 5 6 7 8 9 10 | urls.sort() print(f"{len(urls)} in total", file=sys.stderr) for i, url in enumerate(urls): clength = -1 furl = "http:" + url[1] for line in !(http --headers HEAD @(furl)): if line.strip().startswith("Content-Length"): h, v = line.split() clength = int(v) print(f"{url[0]}, http:{url[1]}, {clength}") |
Still need to pull in the csv
module for proper output generation. I
also want to add the http headers as a column of json data, which will
definitely need correct csv
escaping. Once that’s all in place, it
should be possible to pipe the output into Simon Willison’s
sqlite-utils
to create an SQLite DB, then run a query to compute the
total storage.
This is why I love xonsh
. I had to do a little documentation
reading to get the right sigils (!
, @
) in the right places, but
once done I can still actually comprehend what the script is doing. If
I keep working at it, this should become mental muscle memory and make
writing such scripts completely natural. Might be better off learning
bash
in depth, but this fits my brain way cleaner.
Using questions to be asked of data seems to be a good driver for me to actually write code. Boy do I have plenty of questions for this data.