-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mockup partial wheel download during resolvelib resolution #8442
Conversation
You'll understand that the last 8k will usually get you both the Zip manifest and the metadata. It's probably a good optimization to avoid extra round trips by downloading a reasonable amount of data on the first request. The last few bytes of the file contain a pointer to the start of the zip manifest. The zip manifest tells you the location of each file in the archive. So a robust implementation needs to open the partial download with zipfile.ZipFile and be prepared to download additional ranges if necessary. That is why it is useful to have a seekable file-like object that runs over HTTP. If hashing is required you would need to delay it until after the entire archive was downloaded and be prepared to roll back the entire transaction. It seems like the hashing would need to happen after the resolve step when you have decided to install the whole wheel, anyway? |
@dholth has some great ideas above!!
I recall you mentioned this here too: #7819 (comment). I think it might be really good for me to focus on canonicalizing the I think that if I were to extract that process, it would likely be something we could drop directly into this PR by then depending on @McSinyx what do you think about that separation of concerns? Thanks so much for reaching out and creating this PR by the way!!! |
I agree, @dholth, although more experiments must be carried out to determined the reasonable amount of data. E.g. 32 kB for GPL distributions can cost me a second or two (Vietnamese ISPs hate PyPI, rarely I have the megabit connection to the server) so 8 kB might be more reasonable, depending on the popularity of packages.
Somehow reading previous discussion, I planned to prepend to existing files a constant size 😄 If we feed a seekable HTTP file to ZipFile of sufficient size (so that it's recognized as a ZIP file), would it (metadata requests from pip's pkg_resources on wheel wrapper) magically work, or do we still need to determine the size ourselves? I couldn't come to the conclusion after trying to read ZIP specs and CPython's implementation. I'll try to do the homework harder but it'd be nice if you tell me directly if you already know.
Currently not (any longer?; I haven't look into the history)--it's checked right after download--and I think it's by design to ensured the pinned candidates will satisfy the requirements. As said by @pradyunsg,
although there's another use case where packages also try to have strict dependencies with hashes provided. I think we can figure this out later since it's likely we can just parallel download all wheel with checksum.
@cosmicexplorer, I see you're scraping meta data manually. I found I'm excited to see About this PR in particular, I didn't intend it to be merged (nor intended it to not being merged, I'm just trying to see if the way we're going is good enough), so please don't pay too much attention on the path here as you're finalizing Thank you both for your prompt feedback! |
Thank you so much for directing me to the canonical method here! I will use that for when the wheel file is already downloaded locally! I'm not sure if you already understood that part and I misinterpreted your message :)
Before reading this message, I tried to spend a small amount of time to make it into its own library, and that process has led me to agree with you! See https://github.com/cosmicexplorer/httpfile -- it's just the three source files So, I'll be abandoning https://github.com/cosmicexplorer/httpfile and creating a branch of pip with that httpfile code inside |
No, it needs to happen prior to further resolution. Otherwise, it'd be possible for an attacker to craft a file that results in pip downloading from an arbitrary URL (ALA PEP 508), with dependencies crafted carefully to reject that specific candidate later, so that the end result is as otherwise, except pip hit an extra URL and possibly executed code from there. That completely defeats the purpose of hash checking, which is to verify that the assets we're using are only the ones that match the hashes. |
I would suggest to disable partial download (i.e. always eagerly download the whole artifact) when hash mode is enabled. This would be much easier to implement anyway. |
I agree! I was trying to think of alternatives but couldn't figure anything out! However, I think that if both hash checking and partial downloading are desired, we could consider incorporating two separate checksums into the link for a wheel, one for the wheel's entire contents, and the other for the contents of the METADATA file (which is all we care about anyway in the case of partial downloading). |
I agree. This would likely require some standardisation work to define how things would work though, since it is quite non-trivial for an index to provide such information than the file hash. Another thing to go into the we’re going to get there eventually category. |
I was going to finish the last message with "but that sounds like something to leave for follow-up work"! It sounds like we are on the same page! Thank you for clarifying! |
If I understand @dholth correctly, then we can have a magical file-like object reading from range requests' result that we can feed to ZipFile, which can then be fed to |
Note that #8448 intentionally went out of its way to mock out some of the structure that is defined in this PR with the |
Ah, ok, this was a misunderstanding on my part! Sorry about that! (cc @dholth): The changes in #8448 do read from HTTP range requests, but they do not create a "magical file-like object" -- instead, the code in #8448 only allows you to download a single file at a time from a remote zip archive (but downloading that file is extremely fast). It doesn't quite achieve what we're thinking about here -- but it's close. The other thing that #8448 tries to work around is not fulfilling the contract of the I will make the following changes to #8448:
|
I am yet to take a closer look at both this PR and @cosmicexplorer's #8448 (yay easy to remember numbers!). My view on this entire situation (right now) is that we seem to want to get something working w.r.t. the partial-download-based resolution logic. FWIW, I think even if we end up finishing/merging these PRs now, we'll wait until after the new resolver rollout for making it easier/possible for end users to enable that. I'm 100% on board for having in-development functionality hidden behind the --unstable-feature flag. That's the mechanism to have code for in-development stuff in pip. |
That's extremely helpful!!! I think I can then keep #8448 as is, and then just focus on making it download the wheels at the end of the run to keep the contract of |
This has served us as a point for discussion and gained me certain understandings. I guess it's time to let it rest in peace. |
This is open to facilitate discussions relevant GH-7819. I believe that dry run could be done by injecting
print
andexit
after the resolution is complete, so I'll focus on the matter of real installation:After resolution, the undownloaded wheels can be downloaded in parallel, which warranties a higher number (> 5 vs < 5 I guess) of packages to download at the same time.
For local test run, I used
oslo-utils=1.4.0
for pure-python with heavy backtracking andaxuy
for something depending on extension modules (numpy
in this case).cc @pradyunsg, @cosmicexplorer, @ofek and @dholth specifically for opinions on the approach.