-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Splitting out PackageFinder logic #5800
Comments
Heh, you went right for a meaty one huh? 😜 So conceptually,
So there's a lot to unfold here, and I think that this class does entirely too much stuff within a single interface, so my suggestion for the first bite of code here, would be to extract out the bit of code that actually fetches a list of files from PyPI. Effectively focusing on step 1 in my control graph up there, before going any further. It just so happens, I started on this awhile back and started some proof of concept pull requests, which are pypa/packaging#87, pypa/packaging#88, and dstufft/packaging#1. I would suggest reading through the threads and implementation there, particularly pypa/packaging#87. When I was writing that, I had a few basic ideas in mind:
I want to stress that the APIs I wrote in those PRs are not set in stone in any way, they were some rough proof of concepts when I started looking at doing exactly this previously. I would start by looking over those PRs and figuring out if you feel strongly about any of the possibly approaches I outlined, or if there is another approach that you think is better. I would go through all of the discussion and try to start answering any open questions that were still left open, as well as getting a PR setup to pip that replaces the parts of That doesn't get you all of the way to replacing Of course, if there's another section that you're more interested in, it's possible that you could start somewhere else here, but a PEP 503 interface seems like the best first step, since the PEP is already written, there's some proof of concept code available already, and we know it's something we're going to be needing regardless of other internal changes inside of pip. Thoughts? |
Wow that is a lot :p I’ll probably reply each piece separately (it’s late here), but some fragmented thoughts:
|
Nobody ever accused me of being terse!
I'm not married to any of the specific interfaces in any of the PRs, they were proof of concepts, so if you feel like picking one of them up and running with it, you should feel free to adjust the API. I was mostly trying to figure out an idea of what the shape of the API could possibly be, not a final actual API.
I dunno, the html5lib folks have always been pretty good at working with us, and it's only.. 0.5MB of space so I don't think it's super worth trying to optimize it.
Yea I think so. My implementation doesn't even handle step 2, because in my mind the resolver would handle that so it only took the packages name as a
Yea, the
One of my comments in the PR (or maybe it was a code comment) was asking basically the same thing :) |
@uranusjr You only listed two things? |
@cjerdonek Oops :p The third one is |
We should skip dependency links support here: I expect them to removed pretty soon and it's not standard backed anyways. Other than that the two functions as described in the OP sound reasonable to me - it's the exact same change that I wanted to make within pip itself eventually. |
After some thought, I think a flat list of packages is the better design. It is easier to filter, and actually easier to deal with, since there is no guarentee any given version has a compatible link for a given environment. It would be cumbersome if the user needs to deal with “empty” versions. Also, is there a concrete (formal?) definition of how a find-link target (HTML page or files in a directory) should work? pip seems to contain only a very minimal description, and I don’t have much experience to know whether there are “features” unknown to me. Maybe I should write one first if none exists. |
It's basically as simple as it sounds there -- it either looks for links in HTML file passed or checks the directory passed for the archives. These get treated the same way as items that would come from an index. Internally, a combined list is created of the "file locations" and "url locations" from the index + find-link, before actually looking at contents of those. |
What I want to clarify is how much assumptions I can make from it. For an HTML file, can I assume the link’s text is the filename? Would the target be named the same, or would I need to potentially rename it after download? What does “local” mean, do Windows UNC paths count? There are a few hidden possibilities that could need clarifying if the reader is not already well-versed in Python packaging. |
My understanding is that the files should be compliant with PEP 503's format.
I'm not sure about that. |
I tried to write up some concrete rules.
A “find-link source” is a value passed to pip’s
[1]: pip seems to use requests, and supports The following sections define how each type of a find-link source should be. Local directoryIf a find-link source points to a local directory, the directory may contain Python packages. Each package should be a valid Python distribution file with appropriate file names. Files with invalid names are ignored. Distribution files from multiple projects may be put in the same directory. Archive fileIf a file specified by a find-link source has an extension that looks like a Python package, i.e. .zip, .whl, .tar.gz, .tgz, or .tar, the file is treated as a Python package. Such a file should be a valid Python distribution file with appropriate file names. Files with invalid names are ignored. HTML fileAn HTML file specified by a find-link source must be of valid HTML 5, following the same rules of individual project pages outlined in the Simple Repository API specification, PEP 503. The anchor tags are parsed with the same rules. The HTML file do not need to be served under a PEP 503-compatible URL. Distribution files from multiple projects may be listed in the same HTML file. The text of the anchor tag is used to identify which project a distribution file belongs to. |
What does pip's code do -- or is this FIXME from pip's code base? |
I added the FIXME to remind myself to check. I did some tracing, and it seems pip passes it directly to requests (except [1]: pip extends requests’s schemes with |
Further digging into pip’s implementation, I found that if pip gets a directory in |
I don't think that's called from the branch handling find-links. They merely get processed through a |
I’ve started a new repo with some POC implementation: https://github.com/uranusjr/packaging-repositories There are still a lot of things to be sorted out, but I am very interested in any feedbacks on the (intentionally very minimal) API, especially whether this would really work as sans-I/O. I really lack experience dealing with this kind of design. |
I feel this is preferable. Having the resolver be able to keep track of which package versions it trimmed due to version specifiers is potentially helpful information. In case we're going to end up filtering the candidates based on just the version specifiers at the end, it doesn't have any extra cost functionally while keeping the option for using the extra information in the resolver open in the future. |
I am working on extracting things from PackageFinder and HTMLPage right now, and happen to find a bug :p pip/src/pip/_internal/index.py Lines 680 to 706 in dcc3c16
This yields an incorrect result if >>> egg_info_matches('pip-18.0', None)
'-18.0'
>>> egg_info_matches('pip-18.0', 'pip')
'18.0' The bug is never hit because pip never passes Nothing meaningful, I just want to make a note about this discovery. There seems to be quite some dead code in |
Thanks, @uranusjr! Can you file an issue each time you notice something like that? For the general last thing you mentioned, it can even be called something like “index.py contains dead code” with a description of what you noticed. Do you have any interest in fixing issues you find? Lastly, when you say “extract,” will you be doing that within the pip code base, like I suggested in my last email to the distutils SIG? That would ensure that your final version is a reference implementation empirically compatible / working with pip. Then getting pip to use a library would simply be a matter of copying that code to the independent library. |
Yes, I intend to do this on the pip code base. My plan is something like
I’m still in a very early stage of step 1, taking notes and trying to decide how things should be splitted. Once I have an idea about how to split things, I’ll open a WIP pull request so the work would be more public and easier to view. I would likely include cleanups to those dead code, but will probably wait until I open the WIP PR before opening issues for them, to avoid myself being distracted by them. But I am keeping notes about them, and they will be identified. |
One suggestion I have on your proposal is that rather than aiming for / thinking of it as a single WIP pull request, I think it would be a lot more productive to approach it as a series of smaller stand-alone pull requests, each of which makes an improvement. You can file PR's as you come across things rather than having to wait. That way it will be easier for people to review, more incremental, etc. And the code base will became easier to understand as you go. Also, I imagine many of the changes you have in mind would be good in any case, even if the end result isn't yet finalized in your mind. |
@cjerdonek Agreed. Also, smaller PRs stand less chance of clashing with other work. I'm particularly conscious of this as at the moment the PEP 517 work is a large-ish PR that is likely to cause (and suffer) merge conflicts - it's quite distracting having the additional pressure of this over the general need to get the work completed - so small, self-contained PRs is definitely my preference and recommendation. |
I see. As mentioned, I intend to keep the interface of PackageFinder intact, so my changes would be pretty much self-contained. I will keep this in mind and take extra care to make sure things don’t leak, and try to make things into smaller pieces if I could. Thanks for the advice :) |
The |
I have been working a lot on improving the structure of that module, making it more testable, adding new features (yanked files, preferring hash matches, etc), decoupling things, etc, FYI. |
I think we're basically done w.r.t. decoupling / refactoring this module. |
What's the problem this feature will solve?
This is a continuation of the discussion on distutils-sig, Donald suggested I should open an issue here to outline the intention. I’ll try to summarise previous discussions, and provide some extra context why this is desired.
The idea came out of Pipenv’s need to build a resolver independent from pip. Pipenv tries to generate a platform-independent “lock” (dependency graph) for the user-specified requirements. This means that it cannot use pip’s high-level CLI to find dependencies directly (because it filters out packages not matching markers). On the other hand, pip provides a variety of ways to “find” packages, including index servers, find-links, dependency links, VCS, etc. Since Pipenv is expected to be highly compatible to pip, it is likely a good idea for it to have an implementation matching pip’s.
Describe the solution you'd like
The minimal requirement would be to document (standardise) pip’s current behaviour, and create an implementation (maybe largely copy from pip) based on that behaviour. The behaviour would be described as a PEP (or a part of one), and implemented in a standalone project. Whether pip will adopt the implementation to replace its current one is not relevant; it can freely decide either way.
Conceptually, I am splitting the current PackageFinder into two layers:
I plan to first focus on the first part, and work on the second after the first part is settled.
The finder would contain three public functions:
For the requirement specification,
packaging.requirement.Requirement
would be used to interface. An additional class would be introduced for the installation candidate, modelled afterpip._internal.models.candidate.InstallationCandidate
.Behaviourally, the finder would contain all (or almost all) functionalities in PackageFinder. One particular exception would be
add_dependency_links
. There is a particular FIXME that could be cleaned up, but would it be worth it? #4187 mentions it is still planned to be removed in pip 19, and maybe it should be omitted from the standardised behaviour altogether.I am most definitely there are a lot of caveats needing address (“standardised”). Please raise any relevant topics.
A list of resolutions, updated so I (and others reading) don’t need to go through the whole thread:
The text was updated successfully, but these errors were encountered: