Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regular dump of PyPI database #1478

Closed
FRidh opened this issue Oct 16, 2016 · 25 comments
Closed

Regular dump of PyPI database #1478

FRidh opened this issue Oct 16, 2016 · 25 comments
Labels
developer experience Anything that improves the experience for Warehouse devs feature request

Comments

@FRidh
Copy link

FRidh commented Oct 16, 2016

Would it be possible to have regular dumps of all metadata that is now available via the API's?

To automate the updating of Python packages in our distribution (Nix/NixOS) I'm currently using the PyPI API's to obtain the meta data. To reduce the load on the API and make the updating faster I determine what packages changed since last time and only retrieve the data for those packages. This works but seems rather fragile. Having a daily dump would be very useful.

@nealmcb
Copy link

nealmcb commented Apr 13, 2017

+1
Note that for now, https://github.com/nathforge/pypi-data can download all the json metadata from pypi and keep it updated. Though note nathforge/pypi-data#2

@FRidh
Copy link
Author

FRidh commented Jun 4, 2017

I made an implementation of my own that uses asyncio. Because PyPI limits the amount of requests the implementation uses 5 concurrent requests and so it takes about an hour to build a new repo.

https://github.com/FRidh/make-pypi-dump/tree/master
https://github.com/FRidh/pypi-dump

Unpacked json is about 1.2 GB and the git repo something like 1.8 when unpacked. When compressed, it is about 300 MB. Cloning the repo is an order of magnitude faster than downloading a GitHub-generated tarball 😄

@nealmcb
Copy link

nealmcb commented Jun 4, 2017

I was puzzled by your comment, but now I see that besides providing a tool, you put a full copy of the json files in that second github repo pypi-dump. And it looks like if we clone the second repo once, we can use your python program in the first repo make-pypi-dump to just grab the updates since the state.json file in the second repo.
Thanks!
Though if that's all correct, I'd suggest adding a .py extension to the make-pypi-dump file, and pointing back and forth in README.md's between the make-pypi-dump and pypi-dump repos.

@brainwane
Copy link
Contributor

@FRidh Thanks for your request!

I'm marking this as somewhat low priority, but only because I would want to see more people saying "yes I want this" before prioritizing it above features that several users want -- or, I'd like to see some additional people saying "yes, if this feature existed, I'd be able to substantially change a current approach I use that causes significant load to PyPI, and thus reduce load on PyPI".

@FRidh
Copy link
Author

FRidh commented Dec 7, 2017

@brainwane considering this is a rather useful feature for those doing integration (read: distributions) it will effect many more people ;-)

@brainwane
Copy link
Contributor

@FRidh That's great! And I'm open to changing the prioritization of this issue as I see more about the ratio of people commenting, +1'ing, and speaking up on distutils-sig about this versus other issues. Thanks for the update.

@brainwane brainwane added the developer experience Anything that improves the experience for Warehouse devs label Feb 27, 2018
@brainwane
Copy link
Contributor

In IRC today, one of our developers wanted more packages on their local dev environment, and @ewdurbin said,

We have some tasks to improve local Dev dump for situations just like this; something like a user user with a mix of packages and a user admin with admin privileges rather than having to hunt/peck or upload your own stuff

To me, this issue #1478 is about people wanting a regular dump of the genuine PyPI artifact store and/or database, and the thing we discussed today in IRC is about creating a better developer experience by creating/improving a smaller purpose-made DB that developers use in the Warehouse developer environment. So I'd suggest we have two different issues, unless I'm misunderstanding you, Ernest?

@westurner
Copy link

westurner commented Mar 14, 2018

From #347:

Zsync is like rsync over HTTP
http://zsync.moria.org.uk/

zsync provides transfers that are nearly as efficient as rsync -z or cvsup, without the need to run a special server application. All that is needed is an HTTP/1.1-compliant web server.

[...]

Single meta-file — zsync downloads are offered by building a .zsync file, which contains the meta-data needed by zsync. This file contains the precalculated checksums for the rsync algorithm; it is generated on the server, once, and is then used by any number of downloaders.

A cron/celery task that dumps the JSON metadata for every package would definitely save money over making parallel requests for every package's JSON package.

Such a dump could be downloaded with rsync or zsync or git or bup or just HTTP.

Where should we host a regular Warehouse db dump?
Where should we host a regular Warehouse JSON dump?

Maybe:

  • Generate .zsync delta checksums with a fallback to the full dump?
    • That'd save a lot of bandwidth and cache RAM.

@ChillarAnand
Copy link

https://github.com/nathforge/pypi-data is broken as the s3 file is not available now.
https://github.com/FRidh/pypi-dump is taken down.

Any plans on providing regular dumps?

@di
Copy link
Member

di commented Feb 26, 2020

This is still low priority.

For folks in this issue: would something like #7403 solve this for you?

@di
Copy link
Member

di commented Feb 26, 2020

FYI, looks like FRidh/pypi-dump was taken down due to an incorrect DMCA notice from Nielsen and can probably be countered: https://github.com/github/dmca/blob/87e0e7bb43def2a20bbe3bbfe4e2a3cc4228ca02/2018/2018-09-20-Nielsen.md

@Mic92
Copy link

Mic92 commented Jul 27, 2020

There is a different scraper for this now: https://github.com/DavHau/pypi-crawlers. But it also takes quite a while to download everything: https://github.com/DavHau/nix-pypi-fetcher

@Mic92
Copy link

Mic92 commented Jul 27, 2020

Are there backup scripts published that pypa uses? Maybe they could be modified to upload some tables to s3 or so? We would basically need all the models from this file: https://github.com/pypa/warehouse/blob/master/warehouse/packaging/models.py Could you publish an estimation on how big the project/releases tables are in your production database? How is your database system organized. Is it mirrored/sharded?

@di
Copy link
Member

di commented Jul 27, 2020

Quick update here: #7403 is nearly complete (just backfilling historical metadata) so depending on what you want to do with the data, that may be ready soon.

Are there backup scripts published that pypa uses? Maybe they could be modified to upload some tables to s3 or so?

What's the use case for this? There seem to be a couple different use cases in this issue. OP was trying to reduce the load on the API which is not a concern for us, we would prefer you use the API in that use case.

We have backups but they do not separate sensitive user data from publicly-accessible data.

Could you publish an estimation on how big the project/releases tables are in your production database?

Total size of the DB is ~22GB. The size of the data you want is hard to estimate due to a) not knowing exactly what you want and b) the data being spread across multiple related tables, but I would estimate that it would be 50-75% of the total database size.

@Mic92
Copy link

Mic92 commented Jul 27, 2020

Quick update here: #7403 is nearly complete (just backfilling historical metadata) so depending on what you want to do with the data, that may be ready soon.

Are there backup scripts published that pypa uses? Maybe they could be modified to upload some tables to s3 or so?

What's the use case for this? There seem to be a couple different use cases in this issue. OP was trying to reduce the load on the API which is not a concern for us, we would prefer you use the API in that use case.

We have backups but they do not separate sensitive user data from publicly-accessible data.

Could you publish an estimation on how big the project/releases tables are in your production database?

Total size of the DB is ~22GB. The size of the data you want is hard to estimate due to a) not knowing exactly what you want and b) the data being spread across multiple related tables, but I would estimate that it would be 50-75% of the total database size.

The usecase is the same as the OP. We want to bulk import all python packages in all version/source hashes with all dependencies to build alternative packaging tools based on Nix. If you don't mind about the load we can also parallelize requests further with ipv6 addresses to get around the 5 concurrent requests limit.

@di
Copy link
Member

di commented Jul 27, 2020

If you don't mind about the load we can also parallelize requests further with ipv6 addresses to get around the 5 concurrent requests limit.

There is no such limit on the JSON API, what is making you think this limit exists? We are talking about the JSON API, correct?

@Mic92
Copy link

Mic92 commented Jul 27, 2020

@FRidh said in #1478 (comment)

I made an implementation of my own that uses asyncio. Because PyPI limits the amount of requests the implementation uses 5 concurrent requests and so it takes about an hour to build a new repo.

FRidh/make-pypi-dump@master
FRidh/pypi-dump

Unpacked json is about 1.2 GB and the git repo something like 1.8 when unpacked. When compressed, it is about 300 MB. Cloning the repo is an order of magnitude faster than downloading a GitHub-generated tarball smile

but maybe this is not true anymore?

@di
Copy link
Member

di commented Jul 27, 2020

Sorry, I missed that. It's definitely not true anymore. It's possible that it was true at the time, we switched to an entirely new implementation of PyPI in April 2018, but I don't recall.

@FRidh
Copy link
Author

FRidh commented Jul 28, 2020

There is by the way another use case repology/repology-updater#278. Related issue here is #347.

@rkbennett
Copy link

Is there still a potential to serve up the full 22GB database of data for PyPI? I've mirrored all packages with bandersnatch, and assume that with the proper metadata from prod it'd be able to make an offline warehouse instance fully emulate the actual pypi instance, could be wrong on that, but that is my use case.

@ewdurbin
Copy link
Member

@rkbennett I don't think we can accommodate publishing full dumps of the PyPI database. Managing keeping it up to date, diffs, and schema changes is a significant overhead that is likely to result in a bad experience for consumers of the database and additional work for limited capacity of admins, moderators, and maintainers.

Additionally, the warehouse codebase is not intended to be run for any purpose other than powering pypi.org and that use case is not supported.

@di should be able to provide some context on the new Big Query table in our public dataset that includes release information though. This may be of use to mirroring use cases.

@pradyunsg
Copy link
Contributor

offline warehouse instance fully emulate the actual pypi instance, could be wrong on that, but that is my use case.

You might be better served by using https://github.com/devpi/devpi instead.

@AMDmi3
Copy link

AMDmi3 commented Nov 9, 2020

@ewdurbin It seems to me that this issue had been closed with a reasoning not really related to original request.

What was originally requested, if I'm not wrong, was just the packages metadata dump, the same thing which is available through JSON API. @rkbennett, however, looks to have been talking about the database dump as in SQL dump, and that was reasonably rejected, however that is not completely related to the original issue and does not justify closing it.

Unlike an SQL dump, metadata dump does not need any of "managing keeping it up to date, diffs, and schema changes" and "additional work for limited capacity of admins, moderators, and maintainers" - it presumably just needs something as simple as setting up a cron job which dumps the same JSON metadata JSON API would return, but for all packages at once, save it into a large JSON file and probably compress it.

I second the original request for metadata dump, as I need all packages metadata for https://repology.org F/OSS service which reports outdated package versions in a lot of package repositories to their maintainers, tracks updates, security issues and so on. Adding data from PyPI would instantly provide latest version information for all python modules provided as native packages on a lot of systems/distros, which is a huge benefit to users and maintainers of these.

Without the dump, I, as well as any other consumer of package metadata, would have to set up a service which regularly requests info on all PyPI packages from the API. Not only this duplicates an effort for every data consumer, but also add extra load on the API, and still is unreliable. Using google BigQuery is not acceptable either due not reasons of privacy, dependency on proprietary vendor, failure to register an account due to various reasons invented by google, and inability to provide an application/service which works with PyPI metadata out of box, without the need to get and fill in any credentials.

@FRidh
Copy link
Author

FRidh commented Nov 9, 2020

Similar #8802.

@di
Copy link
Member

di commented Nov 9, 2020

@FRidh Sorry that our existing datasets won't work for you. Like @ewdurbin mentioned, we don't have enough resources at this time to provide an alternative.

Not only this duplicates an effort for every data consumer, but also add extra load on the API, and still is unreliable.

Our API is heavily cached and handles a significant amount of traffic. It's unlikely that we'd notice any additional load from someone using our APIs for this purpose. In addition, our mirroring support is specifically designed for this use case.

If these APIs are somehow unreliable at the moment, I'd suggest you file a bug report so we can get that resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
developer experience Anything that improves the experience for Warehouse devs feature request
Projects
None yet
Development

No branches or pull requests