I wanted to know how many distinct data files (checksums and
compressed XML data) were referenced from
discogs.data.com. This is prelude to
trying to do a, extremely polite, crawl of all the files for some
longitudinal analysis. So I threw together a little script and learned
a few things.
The answer turned out be 782.
The script is below. Definitely lots of ways it could be better. The
key lesson was finding the Requests-HTML library to help deal
with the embedded JavaScript that the Discogs team uses to render the
list of files. N.b. sleep=1
from line 13 is necessary to get the
HTML rendering properly.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25 | import sys
from requests_html import HTMLSession
if __name__ == "__main__":
session = HTMLSession()
urls = []
for year in range(8, 23):
url = f"https://discogs-data-dumps.s3.us-west-2.amazonaws.com/index.html?prefix=data/20{year:02}"
resp = session.get(url)
resp.html.render(sleep=1)
urls.extend(
[
(2000 + year, l)
for l in resp.html.links
if l.endswith(".xml.gz") or l.endswith("_CHECKSUM.txt")
]
)
urls.sort()
print(f"{len(urls)} in total", file=sys.stderr)
for i, url in enumerate(urls):
print(f"{url[0]}, http:{url[1]}")
|
And here’s a CLI run:
| (discogsdata) crossjam@gabrielhounds:~/repos/discogsdata$ ./discogs_data_urls.xsh | wc -l
/Users/crossjam/.local/pipx/venvs/xonsh/lib/python3.10/site-packages/requests_html.py:727: DeprecationWarning: There is no current event loop
self.loop = asyncio.get_event_loop()
/Users/crossjam/.local/pipx/venvs/xonsh/lib/python3.10/site-packages/websockets/legacy/client.py:488: DeprecationWarning: remove loop argument
warnings.warn("remove loop argument", DeprecationWarning)
/Users/crossjam/.local/pipx/venvs/xonsh/lib/python3.10/site-packages/websockets/legacy/protocol.py:206: DeprecationWarning: remove loop argument
warnings.warn("remove loop argument", DeprecationWarning)
782 in total
782
|
Small victories.