Regular dump of PyPI database #1478

FRidh · 2016-10-16T15:41:58Z

Would it be possible to have regular dumps of all metadata that is now available via the API's?

To automate the updating of Python packages in our distribution (Nix/NixOS) I'm currently using the PyPI API's to obtain the meta data. To reduce the load on the API and make the updating faster I determine what packages changed since last time and only retrieve the data for those packages. This works but seems rather fragile. Having a daily dump would be very useful.

nealmcb · 2017-04-13T18:12:02Z

+1
Note that for now, https://github.com/nathforge/pypi-data can download all the json metadata from pypi and keep it updated. Though note nathforge/pypi-data#2

FRidh · 2017-06-04T16:28:02Z

I made an implementation of my own that uses asyncio. Because PyPI limits the amount of requests the implementation uses 5 concurrent requests and so it takes about an hour to build a new repo.

https://github.com/FRidh/make-pypi-dump/tree/master
https://github.com/FRidh/pypi-dump

Unpacked json is about 1.2 GB and the git repo something like 1.8 when unpacked. When compressed, it is about 300 MB. Cloning the repo is an order of magnitude faster than downloading a GitHub-generated tarball 😄

nealmcb · 2017-06-04T17:59:42Z

I was puzzled by your comment, but now I see that besides providing a tool, you put a full copy of the json files in that second github repo pypi-dump. And it looks like if we clone the second repo once, we can use your python program in the first repo make-pypi-dump to just grab the updates since the state.json file in the second repo.
Thanks!
Though if that's all correct, I'd suggest adding a .py extension to the make-pypi-dump file, and pointing back and forth in README.md's between the make-pypi-dump and pypi-dump repos.

brainwane · 2017-12-07T14:01:02Z

@FRidh Thanks for your request!

I'm marking this as somewhat low priority, but only because I would want to see more people saying "yes I want this" before prioritizing it above features that several users want -- or, I'd like to see some additional people saying "yes, if this feature existed, I'd be able to substantially change a current approach I use that causes significant load to PyPI, and thus reduce load on PyPI".

FRidh · 2017-12-07T15:09:59Z

@brainwane considering this is a rather useful feature for those doing integration (read: distributions) it will effect many more people ;-)

brainwane · 2017-12-07T15:19:44Z

@FRidh That's great! And I'm open to changing the prioritization of this issue as I see more about the ratio of people commenting, +1'ing, and speaking up on distutils-sig about this versus other issues. Thanks for the update.

brainwane · 2018-02-27T17:29:57Z

In IRC today, one of our developers wanted more packages on their local dev environment, and @ewdurbin said,

We have some tasks to improve local Dev dump for situations just like this; something like a user user with a mix of packages and a user admin with admin privileges rather than having to hunt/peck or upload your own stuff

To me, this issue #1478 is about people wanting a regular dump of the genuine PyPI artifact store and/or database, and the thing we discussed today in IRC is about creating a better developer experience by creating/improving a smaller purpose-made DB that developers use in the Warehouse developer environment. So I'd suggest we have two different issues, unless I'm misunderstanding you, Ernest?

westurner · 2018-03-14T14:56:46Z

From #347:

Zsync is like rsync over HTTP
http://zsync.moria.org.uk/

zsync provides transfers that are nearly as efficient as rsync -z or cvsup, without the need to run a special server application. All that is needed is an HTTP/1.1-compliant web server.

[...]

Single meta-file — zsync downloads are offered by building a .zsync file, which contains the meta-data needed by zsync. This file contains the precalculated checksums for the rsync algorithm; it is generated on the server, once, and is then used by any number of downloaders.

A cron/celery task that dumps the JSON metadata for every package would definitely save money over making parallel requests for every package's JSON package.

Such a dump could be downloaded with rsync or zsync or git or bup or just HTTP.

Where should we host a regular Warehouse db dump?
Where should we host a regular Warehouse JSON dump?

https://help.github.com/articles/what-is-my-disk-quota/ (max repo size: 1GB)
- They recommend Dropbox for SQL dumps.
https://help.github.com/articles/distributing-large-binaries/ (max: 2GB)
https://docs.fastly.com/guides/performance-tuning/improving-caching-performance-with-large-files (max: 5GB)
- Fastly hosts Warehouse downloads with the Fastly CDN
  - https://docs.fastly.com/

Maybe:

Generate .zsync delta checksums with a fallback to the full dump?
- That'd save a lot of bandwidth and cache RAM.

ChillarAnand · 2020-02-26T15:06:34Z

https://github.com/nathforge/pypi-data is broken as the s3 file is not available now.
https://github.com/FRidh/pypi-dump is taken down.

Any plans on providing regular dumps?

di · 2020-02-26T15:25:47Z

This is still low priority.

For folks in this issue: would something like #7403 solve this for you?

di · 2020-02-26T16:34:25Z

FYI, looks like FRidh/pypi-dump was taken down due to an incorrect DMCA notice from Nielsen and can probably be countered: https://github.com/github/dmca/blob/87e0e7bb43def2a20bbe3bbfe4e2a3cc4228ca02/2018/2018-09-20-Nielsen.md

Mic92 · 2020-07-27T20:19:14Z

There is a different scraper for this now: https://github.com/DavHau/pypi-crawlers. But it also takes quite a while to download everything: https://github.com/DavHau/nix-pypi-fetcher

Mic92 · 2020-07-27T20:20:49Z

Are there backup scripts published that pypa uses? Maybe they could be modified to upload some tables to s3 or so? We would basically need all the models from this file: https://github.com/pypa/warehouse/blob/master/warehouse/packaging/models.py Could you publish an estimation on how big the project/releases tables are in your production database? How is your database system organized. Is it mirrored/sharded?

di · 2020-07-27T21:45:56Z

Quick update here: #7403 is nearly complete (just backfilling historical metadata) so depending on what you want to do with the data, that may be ready soon.

Are there backup scripts published that pypa uses? Maybe they could be modified to upload some tables to s3 or so?

What's the use case for this? There seem to be a couple different use cases in this issue. OP was trying to reduce the load on the API which is not a concern for us, we would prefer you use the API in that use case.

We have backups but they do not separate sensitive user data from publicly-accessible data.

Could you publish an estimation on how big the project/releases tables are in your production database?

Total size of the DB is ~22GB. The size of the data you want is hard to estimate due to a) not knowing exactly what you want and b) the data being spread across multiple related tables, but I would estimate that it would be 50-75% of the total database size.

Mic92 · 2020-07-27T21:52:24Z

Quick update here: #7403 is nearly complete (just backfilling historical metadata) so depending on what you want to do with the data, that may be ready soon.

Are there backup scripts published that pypa uses? Maybe they could be modified to upload some tables to s3 or so?

What's the use case for this? There seem to be a couple different use cases in this issue. OP was trying to reduce the load on the API which is not a concern for us, we would prefer you use the API in that use case.

We have backups but they do not separate sensitive user data from publicly-accessible data.

Could you publish an estimation on how big the project/releases tables are in your production database?

Total size of the DB is ~22GB. The size of the data you want is hard to estimate due to a) not knowing exactly what you want and b) the data being spread across multiple related tables, but I would estimate that it would be 50-75% of the total database size.

The usecase is the same as the OP. We want to bulk import all python packages in all version/source hashes with all dependencies to build alternative packaging tools based on Nix. If you don't mind about the load we can also parallelize requests further with ipv6 addresses to get around the 5 concurrent requests limit.

di · 2020-07-27T21:59:34Z

If you don't mind about the load we can also parallelize requests further with ipv6 addresses to get around the 5 concurrent requests limit.

There is no such limit on the JSON API, what is making you think this limit exists? We are talking about the JSON API, correct?

Mic92 · 2020-07-27T22:19:59Z

@FRidh said in #1478 (comment)

I made an implementation of my own that uses asyncio. Because PyPI limits the amount of requests the implementation uses 5 concurrent requests and so it takes about an hour to build a new repo.

FRidh/make-pypi-dump@master
FRidh/pypi-dump

Unpacked json is about 1.2 GB and the git repo something like 1.8 when unpacked. When compressed, it is about 300 MB. Cloning the repo is an order of magnitude faster than downloading a GitHub-generated tarball smile

but maybe this is not true anymore?

di · 2020-07-27T22:38:25Z

Sorry, I missed that. It's definitely not true anymore. It's possible that it was true at the time, we switched to an entirely new implementation of PyPI in April 2018, but I don't recall.

FRidh · 2020-07-28T07:45:00Z

There is by the way another use case repology/repology-updater#278. Related issue here is #347.

rkbennett · 2020-08-09T21:13:36Z

Is there still a potential to serve up the full 22GB database of data for PyPI? I've mirrored all packages with bandersnatch, and assume that with the proper metadata from prod it'd be able to make an offline warehouse instance fully emulate the actual pypi instance, could be wrong on that, but that is my use case.

ewdurbin · 2020-08-10T14:09:37Z

@rkbennett I don't think we can accommodate publishing full dumps of the PyPI database. Managing keeping it up to date, diffs, and schema changes is a significant overhead that is likely to result in a bad experience for consumers of the database and additional work for limited capacity of admins, moderators, and maintainers.

Additionally, the warehouse codebase is not intended to be run for any purpose other than powering pypi.org and that use case is not supported.

@di should be able to provide some context on the new Big Query table in our public dataset that includes release information though. This may be of use to mirroring use cases.

pradyunsg · 2020-08-10T14:45:32Z

offline warehouse instance fully emulate the actual pypi instance, could be wrong on that, but that is my use case.

You might be better served by using https://github.com/devpi/devpi instead.

AMDmi3 · 2020-11-09T15:35:57Z

@ewdurbin It seems to me that this issue had been closed with a reasoning not really related to original request.

What was originally requested, if I'm not wrong, was just the packages metadata dump, the same thing which is available through JSON API. @rkbennett, however, looks to have been talking about the database dump as in SQL dump, and that was reasonably rejected, however that is not completely related to the original issue and does not justify closing it.

Unlike an SQL dump, metadata dump does not need any of "managing keeping it up to date, diffs, and schema changes" and "additional work for limited capacity of admins, moderators, and maintainers" - it presumably just needs something as simple as setting up a cron job which dumps the same JSON metadata JSON API would return, but for all packages at once, save it into a large JSON file and probably compress it.

I second the original request for metadata dump, as I need all packages metadata for https://repology.org F/OSS service which reports outdated package versions in a lot of package repositories to their maintainers, tracks updates, security issues and so on. Adding data from PyPI would instantly provide latest version information for all python modules provided as native packages on a lot of systems/distros, which is a huge benefit to users and maintainers of these.

Without the dump, I, as well as any other consumer of package metadata, would have to set up a service which regularly requests info on all PyPI packages from the API. Not only this duplicates an effort for every data consumer, but also add extra load on the API, and still is unreliable. Using google BigQuery is not acceptable either due not reasons of privacy, dependency on proprietary vendor, failure to register an account due to various reasons invented by google, and inability to provide an application/service which works with PyPI metadata out of box, without the need to get and fill in any credentials.

FRidh · 2020-11-09T15:42:47Z

Similar #8802.

di · 2020-11-09T15:53:30Z

@FRidh Sorry that our existing datasets won't work for you. Like @ewdurbin mentioned, we don't have enough resources at this time to provide an alternative.

Not only this duplicates an effort for every data consumer, but also add extra load on the API, and still is unreliable.

Our API is heavily cached and handles a significant amount of traffic. It's unlikely that we'd notice any additional load from someone using our APIs for this purpose. In addition, our mirroring support is specifically designed for this use case.

If these APIs are somehow unreliable at the moment, I'd suggest you file a bug report so we can get that resolved.

FRidh mentioned this issue Jun 5, 2017

Missing packages #2087

Closed

brainwane added feature request Post launch - low priority labels Dec 7, 2017

brainwane added the developer experience Anything that improves the experience for Warehouse devs label Feb 27, 2018

brainwane added this to the Cool but not urgent milestone Mar 5, 2018

brainwane mentioned this issue Mar 12, 2018

Add API endpoint to get latest version of all projects #347

Open

westurner mentioned this issue Mar 14, 2018

Init data not available: 404 error from Amazon s3 nathforge/pypi-data#2

Open

ewdurbin closed this as completed Aug 10, 2020

AMDmi3 mentioned this issue Nov 9, 2020

Make all packages metadata available as a single plain file #8802

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regular dump of PyPI database #1478

Regular dump of PyPI database #1478

FRidh commented Oct 16, 2016

nealmcb commented Apr 13, 2017

FRidh commented Jun 4, 2017

nealmcb commented Jun 4, 2017 •

edited

Loading

brainwane commented Dec 7, 2017

FRidh commented Dec 7, 2017

brainwane commented Dec 7, 2017

brainwane commented Feb 27, 2018

westurner commented Mar 14, 2018 •

edited

Loading

ChillarAnand commented Feb 26, 2020

di commented Feb 26, 2020

di commented Feb 26, 2020

Mic92 commented Jul 27, 2020

Mic92 commented Jul 27, 2020 •

edited

Loading

di commented Jul 27, 2020

Mic92 commented Jul 27, 2020 •

edited

Loading

di commented Jul 27, 2020

Mic92 commented Jul 27, 2020

di commented Jul 27, 2020

FRidh commented Jul 28, 2020 •

edited

Loading

rkbennett commented Aug 9, 2020

ewdurbin commented Aug 10, 2020

pradyunsg commented Aug 10, 2020

AMDmi3 commented Nov 9, 2020

FRidh commented Nov 9, 2020

di commented Nov 9, 2020

Regular dump of PyPI database #1478

Regular dump of PyPI database #1478

Comments

FRidh commented Oct 16, 2016

nealmcb commented Apr 13, 2017

FRidh commented Jun 4, 2017

nealmcb commented Jun 4, 2017 • edited Loading

brainwane commented Dec 7, 2017

FRidh commented Dec 7, 2017

brainwane commented Dec 7, 2017

brainwane commented Feb 27, 2018

westurner commented Mar 14, 2018 • edited Loading

ChillarAnand commented Feb 26, 2020

di commented Feb 26, 2020

di commented Feb 26, 2020

Mic92 commented Jul 27, 2020

Mic92 commented Jul 27, 2020 • edited Loading

di commented Jul 27, 2020

Mic92 commented Jul 27, 2020 • edited Loading

di commented Jul 27, 2020

Mic92 commented Jul 27, 2020

di commented Jul 27, 2020

FRidh commented Jul 28, 2020 • edited Loading

rkbennett commented Aug 9, 2020

ewdurbin commented Aug 10, 2020

pradyunsg commented Aug 10, 2020

AMDmi3 commented Nov 9, 2020

FRidh commented Nov 9, 2020

di commented Nov 9, 2020

nealmcb commented Jun 4, 2017 •

edited

Loading

westurner commented Mar 14, 2018 •

edited

Loading

Mic92 commented Jul 27, 2020 •

edited

Loading

Mic92 commented Jul 27, 2020 •

edited

Loading

FRidh commented Jul 28, 2020 •

edited

Loading