Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributing large binary payloads as separate downloads #7852

Open
achimnol opened this issue Apr 25, 2020 · 2 comments
Open

Distributing large binary payloads as separate downloads #7852

achimnol opened this issue Apr 25, 2020 · 2 comments

Comments

@achimnol
Copy link

achimnol commented Apr 25, 2020

What's the problem this feature will solve?

Including me, many people are requesting the size limit increase for their wheel packages uploaded to PyPI. In particular, distributing whole OS-specific prebuilt binaries and GPU code binaries often takes hundreds of megabytes.

I think if we have a way to distribute large binary payloads separately, similar to Git LFS, it would be good for both reducing network traffic and PyPI maintenance.
#474 is also related to the idea.

Describe the solution you'd like

This is my rough idea. Maybe there are many edges to clear out.

  • Extend the wheel package format so that:
    • MANIFEST.in can associate specific files and file patterns with paths prefixed with external resource identifiers.
      e.g., assets/mydata.bin -> mybinary/mydata.bin
    • setup.py or setup.cfg can define external resource identifiers as a mapping from slug names to URL prefixes.
      e.g., mybinary -> https://mys3bucket.s3.amazonaws.com/mypackage
  • Extend setuptools so that:
    • Use the external resource identifiers to compose the actual resource URLs and download the payload during the installation process.
    • Let users be able to override the external resource identifiers using environment variables with arbitrary URL or local paths for offline distributions.

Additional context
I'd like to avoid overloading PyPI and network resources, but still want have a seamless way to distribute large-size binaries with the Python's standard packaging mechanism.

The disadvantage of this approach is that wheels become not self-contained, and versioning of external resources may be broken by package maintainers' mistakes. (Maybe, PyPI can provide a fallback repository of external resources, because it is already hosting large-sized packages now based on requests.)
We could mitigate human errors by enforcing specific rules about naming the external resource directories, like requiring them to have the names same to the wheel file names, and using checksums. Moreover, we could extend wheel and twine to handle split-packaging files that exceeds a certain size limit automatically and use a user-provided credentials to upload to sepcific locations (e.g., S3) and PyPI as a fallback.

I just want to give an idea and see what people think.
For example, considering a significant amount of technical efforts to implement and maintain the above idea, it might be more feasible to just allow larger uploads to PyPI.
There may be a past discussion about the same topic, but please forgive me if this is a duplication and guide me to the discussion thread.

@achimnol
Copy link
Author

Just searching around, and I think that a potential alternative is to use a private PyPI deployment (maybe using pypiserver or warehouse) and let my packages to refer packages from there.
However, the issue for this approach is that dependency_links is deprecated and users need to specify --extra-index-url explicitly, which breaks my seamlessness expectation.

Another potential alternative is to provide a custom post-install script, but I just realized that the wheel format officially don't support pre or post-install scripts, and I also migrated to setup.cfg to keep the setup metadata as metadata instead of codes.

@kousu
Copy link

kousu commented Apr 4, 2021

I have the same problem and I like that you want to solve this entirely within pip if possible.

nltk solved this with

>>> import nltk
>>> nltk.download()

and I've seen (but am currently forgetting) lots of neural network projects that have written their own .download().

pytorch has (pytorch/pytorch#26340 (comment)) put up this page: https://download.pytorch.org/whl/torch_stable.html (but they also distribute via pypi; they have 800MB behemoths on there)

and in my own current project we solved this by hacking together our own package manager parallel to pip:

https://github.com/neuropoly/spinalcordtoolbox/blob/c09a14a2e12317edebda03cad01c8aa4cf64aa31/install_sct#L659-L661

I don't like any of these because they break the packaging system.


Unfortunately, I don't think pypa is going to want to change anything. They want pypi to be standalone. They said pypa/pip#5898 (comment) :

Dependency links broke this expectation, and brought along with it a rash of issues where "pip install " would depend on an unknownable set of servers outside of just their configured locations. This caused a number of problems, most obviously in locked down environments where accessing a service required getting a hole punched in a firewall, but also would render people's attempts to mirror PyPI moot because these external links wouldn't get mirrored.

Basically, a user should be in charge of where they download files from, it should not be something under someone else's control. Anything that takes that control away from end users is not going to come back and if the URL form of PEP 508 works on PyPI, that's a bug that needs fixed. It should not work for any type of distribution uploaded to PyPI.

Their position doesn't make a lot of sense to me because you can still pick a specific server if you make your users install from source, e.g.:

pip install skimage@https://github.com/scikit-image/scikit-image/archive/refs/tags/v0.18.1.tar.gz

(this is not good if your package actually has source to build, so skimage isn't a great demo here, but if you're just full of data files this should be alright)

You already mentioned another workaround of just putting up your own a repo (like pytorch) (perhaps using https://github.com/chriskuehl/dumb-pypi?). It's not totally seamless but you can improve on that by:

  1. Distributing a short ./install.sh that just reads

    #!/bin/sh
    PIP_FIND_LINKS=https://example.com/your-repo.html pip install .
    
  2. Setting your repo site-wide with:

    cat <<EOF | sudo tee -a /etc/pip.conf
    [install]
    find-links = https://example.com/your-repo.html
    EOF
    

    (the supercomputer CC uses this technique to provide their customized builds)

    (there's a few other locations you can choose to put this:

    $ pip config debug
    env_var:
    env:
    global:
      /etc/xdg/pip/pip.conf, exists: False
      /etc/pip.conf, exists: False
    site:
      /usr/pip.conf, exists: False
    user:
      /home/kousu/.pip/pip.conf, exists: False
      /home/kousu/.config/pip/pip.conf, exists: False
    

    )

  3. Make a install.txt file like this:

    -f https://example.com/your-repo.html
    .
    

    then pip install -r install.txt will do the right thing.

But with all three nothing can depend on your package because -f doesn't get written into the .whl. They are okay if you're just trying to distribute internal data to servers you control, but not good for public publishing.


None of these workarounds are very good. It seems like either:

  • we need to break up our datasets (sometimes impossible?) into chunks (good ol' .rar1, .rar2, .rar3, ...)
  • pypi needs to remove the size limit (which is obviously bad for lots of reasons)
  • pip needs to regain something like dependency_links; it kept the feature in its requirements.txt parsing, I don't really see what the hesitancy is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants