home ¦ Archives ¦ Atom ¦ RSS

782 Discogs Data Files

I wanted to know how many distinct data files (checksums and compressed XML data) were referenced from discogs.data.com. This is prelude to trying to do a, extremely polite, crawl of all the files for some longitudinal analysis. So I threw together a little script and learned a few things.

The answer turned out be 782.


The script is below. Definitely lots of ways it could be better. The key lesson was finding the Requests-HTML library to help deal with the embedded JavaScript that the Discogs team uses to render the list of files. N.b. sleep=1 from line 13 is necessary to get the HTML rendering properly.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import sys

from requests_html import HTMLSession

if __name__ == "__main__":
    session = HTMLSession()

urls = []

for year in range(8, 23):
    url = f"https://discogs-data-dumps.s3.us-west-2.amazonaws.com/index.html?prefix=data/20{year:02}"
    resp = session.get(url)
    resp.html.render(sleep=1)
    urls.extend(
        [
            (2000 + year, l)
            for l in resp.html.links
            if l.endswith(".xml.gz") or l.endswith("_CHECKSUM.txt")
        ]
    )

urls.sort()
print(f"{len(urls)} in total", file=sys.stderr)
for i, url in enumerate(urls):
    print(f"{url[0]}, http:{url[1]}")

And here’s a CLI run:

1
2
3
4
5
6
7
8
9
(discogsdata) crossjam@gabrielhounds:~/repos/discogsdata$ ./discogs_data_urls.xsh | wc -l
/Users/crossjam/.local/pipx/venvs/xonsh/lib/python3.10/site-packages/requests_html.py:727:  DeprecationWarning: There is no current event loop
  self.loop = asyncio.get_event_loop()
/Users/crossjam/.local/pipx/venvs/xonsh/lib/python3.10/site-packages/websockets/legacy/client.py:488: DeprecationWarning: remove loop argument
  warnings.warn("remove loop argument", DeprecationWarning)
/Users/crossjam/.local/pipx/venvs/xonsh/lib/python3.10/site-packages/websockets/legacy/protocol.py:206: DeprecationWarning: remove loop argument
  warnings.warn("remove loop argument", DeprecationWarning)
782 in total
  782

Small victories.

© 2008 ‐ 2023 C. Ross Jam. Built using Pelican. Theme based upon Giulio Fidente’s original svbhack, and slightly modified by crossjam.