-
Notifications
You must be signed in to change notification settings - Fork 985
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regular dump of PyPI database #1478
Comments
+1 |
I made an implementation of my own that uses asyncio. Because PyPI limits the amount of requests the implementation uses 5 concurrent requests and so it takes about an hour to build a new repo. https://github.com/FRidh/make-pypi-dump/tree/master Unpacked json is about 1.2 GB and the git repo something like 1.8 when unpacked. When compressed, it is about 300 MB. Cloning the repo is an order of magnitude faster than downloading a GitHub-generated tarball 😄 |
I was puzzled by your comment, but now I see that besides providing a tool, you put a full copy of the json files in that second github repo |
@FRidh Thanks for your request! I'm marking this as somewhat low priority, but only because I would want to see more people saying "yes I want this" before prioritizing it above features that several users want -- or, I'd like to see some additional people saying "yes, if this feature existed, I'd be able to substantially change a current approach I use that causes significant load to PyPI, and thus reduce load on PyPI". |
@brainwane considering this is a rather useful feature for those doing integration (read: distributions) it will effect many more people ;-) |
@FRidh That's great! And I'm open to changing the prioritization of this issue as I see more about the ratio of people commenting, +1'ing, and speaking up on distutils-sig about this versus other issues. Thanks for the update. |
In IRC today, one of our developers wanted more packages on their local dev environment, and @ewdurbin said,
To me, this issue #1478 is about people wanting a regular dump of the genuine PyPI artifact store and/or database, and the thing we discussed today in IRC is about creating a better developer experience by creating/improving a smaller purpose-made DB that developers use in the Warehouse developer environment. So I'd suggest we have two different issues, unless I'm misunderstanding you, Ernest? |
From #347:
A cron/celery task that dumps the JSON metadata for every package would definitely save money over making parallel requests for every package's JSON package. Such a dump could be downloaded with rsync or zsync or git or bup or just HTTP. Where should we host a regular Warehouse db dump?
Maybe:
|
https://github.com/nathforge/pypi-data is broken as the s3 file is not available now. Any plans on providing regular dumps? |
This is still low priority. For folks in this issue: would something like #7403 solve this for you? |
FYI, looks like FRidh/pypi-dump was taken down due to an incorrect DMCA notice from Nielsen and can probably be countered: https://github.com/github/dmca/blob/87e0e7bb43def2a20bbe3bbfe4e2a3cc4228ca02/2018/2018-09-20-Nielsen.md |
There is a different scraper for this now: https://github.com/DavHau/pypi-crawlers. But it also takes quite a while to download everything: https://github.com/DavHau/nix-pypi-fetcher |
Are there backup scripts published that pypa uses? Maybe they could be modified to upload some tables to s3 or so? We would basically need all the models from this file: https://github.com/pypa/warehouse/blob/master/warehouse/packaging/models.py Could you publish an estimation on how big the project/releases tables are in your production database? How is your database system organized. Is it mirrored/sharded? |
Quick update here: #7403 is nearly complete (just backfilling historical metadata) so depending on what you want to do with the data, that may be ready soon.
What's the use case for this? There seem to be a couple different use cases in this issue. OP was trying to reduce the load on the API which is not a concern for us, we would prefer you use the API in that use case. We have backups but they do not separate sensitive user data from publicly-accessible data.
Total size of the DB is ~22GB. The size of the data you want is hard to estimate due to a) not knowing exactly what you want and b) the data being spread across multiple related tables, but I would estimate that it would be 50-75% of the total database size. |
The usecase is the same as the OP. We want to bulk import all python packages in all version/source hashes with all dependencies to build alternative packaging tools based on Nix. If you don't mind about the load we can also parallelize requests further with ipv6 addresses to get around the 5 concurrent requests limit. |
There is no such limit on the JSON API, what is making you think this limit exists? We are talking about the JSON API, correct? |
@FRidh said in #1478 (comment)
but maybe this is not true anymore? |
Sorry, I missed that. It's definitely not true anymore. It's possible that it was true at the time, we switched to an entirely new implementation of PyPI in April 2018, but I don't recall. |
There is by the way another use case repology/repology-updater#278. Related issue here is #347. |
Is there still a potential to serve up the full 22GB database of data for PyPI? I've mirrored all packages with bandersnatch, and assume that with the proper metadata from prod it'd be able to make an offline warehouse instance fully emulate the actual pypi instance, could be wrong on that, but that is my use case. |
@rkbennett I don't think we can accommodate publishing full dumps of the PyPI database. Managing keeping it up to date, diffs, and schema changes is a significant overhead that is likely to result in a bad experience for consumers of the database and additional work for limited capacity of admins, moderators, and maintainers. Additionally, the warehouse codebase is not intended to be run for any purpose other than powering pypi.org and that use case is not supported. @di should be able to provide some context on the new Big Query table in our public dataset that includes release information though. This may be of use to mirroring use cases. |
You might be better served by using https://github.com/devpi/devpi instead. |
@ewdurbin It seems to me that this issue had been closed with a reasoning not really related to original request. What was originally requested, if I'm not wrong, was just the packages metadata dump, the same thing which is available through JSON API. @rkbennett, however, looks to have been talking about the database dump as in SQL dump, and that was reasonably rejected, however that is not completely related to the original issue and does not justify closing it. Unlike an SQL dump, metadata dump does not need any of "managing keeping it up to date, diffs, and schema changes" and "additional work for limited capacity of admins, moderators, and maintainers" - it presumably just needs something as simple as setting up a cron job which dumps the same JSON metadata JSON API would return, but for all packages at once, save it into a large JSON file and probably compress it. I second the original request for metadata dump, as I need all packages metadata for https://repology.org F/OSS service which reports outdated package versions in a lot of package repositories to their maintainers, tracks updates, security issues and so on. Adding data from PyPI would instantly provide latest version information for all python modules provided as native packages on a lot of systems/distros, which is a huge benefit to users and maintainers of these. Without the dump, I, as well as any other consumer of package metadata, would have to set up a service which regularly requests info on all PyPI packages from the API. Not only this duplicates an effort for every data consumer, but also add extra load on the API, and still is unreliable. Using google BigQuery is not acceptable either due not reasons of privacy, dependency on proprietary vendor, failure to register an account due to various reasons invented by google, and inability to provide an application/service which works with PyPI metadata out of box, without the need to get and fill in any credentials. |
Similar #8802. |
@FRidh Sorry that our existing datasets won't work for you. Like @ewdurbin mentioned, we don't have enough resources at this time to provide an alternative.
Our API is heavily cached and handles a significant amount of traffic. It's unlikely that we'd notice any additional load from someone using our APIs for this purpose. In addition, our mirroring support is specifically designed for this use case. If these APIs are somehow unreliable at the moment, I'd suggest you file a bug report so we can get that resolved. |
Would it be possible to have regular dumps of all metadata that is now available via the API's?
To automate the updating of Python packages in our distribution (Nix/NixOS) I'm currently using the PyPI API's to obtain the meta data. To reduce the load on the API and make the updating faster I determine what packages changed since last time and only retrieve the data for those packages. This works but seems rather fragile. Having a daily dump would be very useful.
The text was updated successfully, but these errors were encountered: