Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concurrent pypi lookups with --update-all #101

Closed
peterbe opened this issue Dec 10, 2018 · 6 comments
Closed

Concurrent pypi lookups with --update-all #101

peterbe opened this issue Dec 10, 2018 · 6 comments

Comments

@peterbe
Copy link
Owner

peterbe commented Dec 10, 2018

If a requirements file has 10 packages, you have to do 10 pypi.org lookups all in serial. When you use the --update-all --interactive that delay between each line is annoying.

@peterbe
Copy link
Owner Author

peterbe commented Dec 10, 2018

I did a rough hack to get this to work and just jotting down some notes.
I have a requirements file with 53 packages listed.
I ran this:

time python hashin.py --dry-run --update-all --include-prereleases -r ~/songsearch/requirements.txt

The whole thing took 1.92s.
Also, for each line of r = urlopen(url) I put a little timer on these, dumped that to stdout and parsed the output. If you sum ALL the times it took to download, it would be 16.6 seconds.

The only bad thing so far is that there's an awkward little delay on the terminal whilst all this downloading is happening. You think nothing's happening. Like it's stuck. The verbose flag helps a little but that's not on my default. If you use --interactive it could be a nice place to inform about this.

@peterbe
Copy link
Owner Author

peterbe commented Dec 10, 2018

Perhaps I'm over-worrying about the nothing-happens-till-all-is-downloaded. I just tried another file an the WHOLE thing took just 2 seconds. That requirements file had 79 packages listed and it took a total of 2 seconds to do 79 HTTP requests plus all the post-processing.

@peterbe
Copy link
Owner Author

peterbe commented Dec 10, 2018

@mythmon @di What do you think about this? I haven't finished the work but it looks ^ promising. ~2 seconds to check 53 to 71 packages for updates. The core of it is this:

def pre_download_packages(memory, specs, verbose=False):
    futures = {}
    with concurrent.futures.ThreadPoolExecutor() as executor:
        for spec in specs:
            package, _, _ = _explode_package_spec(spec)
            req = Requirement(package)
            futures[
                executor.submit(get_package_data, req.name, verbose=verbose)
            ] = req.name
        for future in concurrent.futures.as_completed(futures):
            content = future.result()
            memory[futures[future]] = content

It basically populates a dict with the downloaded content so when it starts analyzing one package at a time, the download part of that can be skipped.

By doing all the downloads first, it makes sure the atomicity and the predictability of the interactive prompt stay intact.

@mythmon
Copy link
Contributor

mythmon commented Dec 10, 2018

I think the idea of prefetching the needed requests in interactive mode makes sense. I have very little experience with the new asyncio parts of Python, but the code in your latest comment seems fine to me.

@peterbe
Copy link
Owner Author

peterbe commented Dec 10, 2018

the new asyncio parts of Python

I certainly have experience with it but saying I get it is like saying I get Linux.

The code I've got is not asyncio at all. Just good old regular threading.

I made it so that if you're on Python 2.7 you get the backport from pypi for it. Untested.
And I also made it so you can deliberately avoid this threading stuff if you know you really can't use it. E.g. hashin --update-all --synchronous

@peterbe
Copy link
Owner Author

peterbe commented Dec 10, 2018

What I like with this is that it works in Py 2.7 and py 3 without any third-party libraries (except the backport for 2.7) and it's simple. It just does the download piece which is the only thing that can be significantly boosted because of the network IO.

I tested the error handling by messing with the spelling of a line in a requirements file (e.g. requestsXXX=2.20.1) and it immediately raised a nice exception and cleaned up the other threads.

A caveat is of course that the whole work is now basically at the mercy of the slowest download since we wait for ALL downloads to complete. Also, since it's threads there is a small chance that you saturate your network but since the individual network calls are tiny I'm not sure that's even a problem.

peterbe pushed a commit that referenced this issue Dec 13, 2018
* Concurrent pypi lookups with --update-all

Fixes #101

* exception for python 3.4
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants
@peterbe @mythmon and others