consume packed wheel cache in zipapp creation #2175

cosmicexplorer · 2023-07-19T20:52:11Z

Closes #2158.

Problem

#1675 has another repro.

Generating a --compress --layout zipapp (the default) regularly exhibits pathologically slow performance (46s zip / 1m1s total) on many real-world inputs (especially tensorflow). This occurs even when all dists are already downloaded from pypi:

; time pex -vvvvvvvvv --layout zipapp --resolver-version=pip-2020-resolver 'numpy>=1.19.5' 'keras==2.4.3' 'mtcnn' 'pillow>=7.0.0' 'bleach>=2.1.0' 'tensorflow-gpu==2.5.3' -o test.pex
# ...
# Saving PEX file to test.pex
# pex: Zipping PEX file.: 45659.1ms
# 57.18s user 3.02s system 97% cpu 1:01.57 total
; ls -lAvFh test.pex
# -rwxrwxr-x 1 cosmicexplorer cosmicexplorer 617M Jul 26 09:17 test.pex*
; du -b test.pex
# 616754775       test.pex

Alternatives

#1705 introduced --no-compress, which drastically improves the zip performance (2.2s zip => 20.6x speedup / 19.3s total = 2.4x speedup):

; time pex -vvvvvvvvv --layout zipapp --no-compress --resolver-version=pip-2020-resolver 'numpy>=1.19.5' 'keras==2.4.3' 'mtcnn' 'pillow>=7.0.0' 'bleach>=2.1.0' 'tensorflow-gpu==2.5.3' -o test.pex
# ...
# Saving PEX file to test.pex
# pex: Zipping PEX file.: 2209.7ms
# pex  14.11s user 3.06s system 89% cpu 19.285 total

but as #1686 describes, it also introduces a significant tradeoff in file size (1.6G => 2.5x blowup):

; ls -lAvFh test.pex
# -rwxrwxr-x 1 cosmicexplorer cosmicexplorer 1.6G Jul 26 09:20 test.pex*
; du -b test.pex
# 1541737605      test.pex

The cache entries are also many times larger:

; ls $(find ~/.pex/packed_wheels -name '*tensorflow_gpu*whl')
# -rw-rw-r-- 1 cosmicexplorer cosmicexplorer 468M Aug  1 09:10 /home/cosmicexplorer/.pex/packed_wheels/7a4763a35d0824ebb172b00f8d0241ff231404c4b7d97dd5ea870d5afca336a4/tensorflow_gpu-2.5.3-cp38-cp38-manylinux2010_x86_64.whl
# -rw-rw-r-- 1 cosmicexplorer cosmicexplorer 1.2G Aug  1 09:16 /home/cosmicexplorer/.pex/packed_wheels/7a4763a35d0824ebb172b00f8d0241ff231404c4b7d97dd5ea870d5afca336a4/un-compressed/tensorflow_gpu-2.5.3-cp38-cp38-manylinux2010_x86_64.whl

Solution

Introduce pex.ziputils.MergeableZipFile to incorporate the contents of other zip files without decompressing them.
- Introduce pex.common.copy_file_range() to copy over a limited range of the contents of another file handle.
Introduce the --cache-dists option to toggle the use (and/or creation) of the packed wheel caches for .bootstrap/ and .deps/ subdirectories.
Consume the existing --layout packed caches for --layout zipapp when --cache-dists is provided.

cosmicexplorer · 2023-07-19T21:12:01Z

Reading @jsirois's thoughts at #2158 (comment) again:

Since this would only be used by the Pex CLI (build-time), a rust implementation is actually how I'd approach this. It's the right language for the job, the problem really has nothing to do with Pex (it has much wider applicability), and there are great tools for turning a crate into an easily consumable Python platform-specific distribution.

I'll take a look at pyoxy and friends to convert the crate into a python distribution. I'll leave this as a draft until then, and I'll re-ping reviewers at that time.

cosmicexplorer · 2023-07-19T22:34:28Z

Ah, wonderful, I've just found the docs for pyo3 https://docs.rs/pyo3/latest/pyo3/.

jsirois · 2023-07-20T16:34:33Z

Ah, wonderful, I've just found the docs for pyo3 https://docs.rs/pyo3/latest/pyo3/.

Great. Yeah @cosmicexplorer that's exactly what I meant. I've been away climbing here and haven't taken a look yet, but if this all looks good, consuming as a native wheel was the idea.

cosmicexplorer · 2023-07-23T08:32:22Z

Ok, just had a lot of fun converting this into a pyo3 package and the result runs even faster! This branch isn't passing because I've added some hacks to get it running, but I will be working to remove those!

add change to do parallel zipping only, no crawling modify cli arg format for medusa-zip tool update cli arg format fix non-exhaustive CopyMode usage [BROKEN] add first run of complete medusa zip with cli arg! the resulting zip cannot be zipimported yet.... medusa zipping works great now, let's revert .zip() changes bump medusa options bump more medusa options use the merged medusa command lines now manage a cache of parallel intermediate zip generation jobs! small fix much closer to mergeable now working much more complex control flow between the medusa-zip cli move medusa zip to medusa.py medusa works for packed apps now too works for everything, but kind of crazy close stdin after writing to the child process factor out a ridiculous amount of boilerplate add back the non-medusa impl for packed layouts implement a "normal" version which uses atomic directories revert unintentional whitespace changes separate the serial and parallel pex creations remove the attempts at parallelism add --medusa-path remove unused code make the medusa hook work when not provided add back a tracer revert some changes that make things harder to read revert some changes i shouldn't need make medusa work with the medusa-zip package and not subprocesses! update after adding defaults in medusa-zip python package remove -f arg for resolving medusa-zip [BROKEN] possibly obsolete! fix cli arg add stitched layout create stitch copymode no initial stitch impl add merge_zip_file() method move packed wheel caching into common methods initial impl of merging zips with entry renaming make MergeableZipFile into a subclass of ZipFile fix header offset calculation tmp fix mypy remove unused imports fix buffering for file handles in py2 Revert "tmp" This reverts commit 8ad12de71a455c3918434cd81dcaf319fa43d9b4.

cosmicexplorer · 2023-08-04T15:15:25Z

Oops! Didn't realize that was the current behavior! However, regarding the following:

Zips are tail oriented; so having main.py in the tail is desirable.

Would it be useful to put __main__.py at the literal very end, then?

jsirois · 2023-08-04T15:17:32Z

If I'm hearing you right, it sounds like you're advocating for something a little different: the ordering of:

I'm advocating for don't muck with essentially unrelated stuff in this PR. Please keep the status quo.
That status quo, right or wrong, is this:

$ pex -V
2.1.140
$ touch foo.py bar.py
$ pex cowsay --module foo --module bar -o cowsay.pex
$ zipinfo -1 cowsay.pex
.bootstrap/
.bootstrap/pex/
...
.bootstrap/pex/version.py
.bootstrap/pex/ziputils.py
.deps/
.deps/cowsay-5.0-py2.py3-none-any.whl/
...
.deps/cowsay-5.0-py2.py3-none-any.whl/cowsay-5.0.dist-info/entry_points.txt
.deps/cowsay-5.0-py2.py3-none-any.whl/cowsay-5.0.dist-info/top_level.txt
PEX-INFO
__main__.py
__pex__/
__pex__/__init__.py
bar.py
foo.py

Your contributions are very welcome, but I'm having flashbacks of the wild sprawling nature these things tended to have and that part is very unwelcome as a reviewer, especially a plodding, single threaded reviewer like me. Please keep the exposed products of your work as focused as possible regardless of how sprawling your internal work process may be.

cosmicexplorer · 2023-08-04T15:20:37Z

Please see #2175 (comment), where you'll find I had a simple misunderstanding. I understand you are not at liberty to review in depth right now.

jsirois

Thanks Danny. This looks good to me mod some details.

pex/common.py

pex/pex_builder.py

jsirois · 2023-08-04T15:30:55Z

pex/ziputils.py

+            copy_file_range(source_zf.fp, self.fp, zinfo.compress_size)
+
+            # (4) Hack the synthesized ZipInfo onto the destination zip's infos.
+            self.filelist.append(zinfo)


So the hack tally is self.{fp,filelist,NameToInfo} afaict. Techincally "public", but certainly not documented as such. CI proves this is OK across CPython 2.7, 3.5, 3.7, 3.11 and 3.12 as well as PyPy 2.7 and 3.9. CI will continue to be extensive; so I'm not too worried in the medium term, but it does seem like this should all be behind a flag given the combination of the hackyness and the new cache burden (which you thankfully eliminated for --layout packed users like Pants, but still exists for folks who never use(d) that option). You seemed to be on board with a feature flag ealier - does that still seem reasonable?

Ideally the original tack of a library that handled the zipping, possibly including parallelization in addition to the ability here of merging would take the place of all this to both eliminate the hack factor and speed things up in the cold cache case not to mention make these features available outside Pex and outside Python.

Perhaps no flag is just fine. I'm a bit conflicted here. I still do not see the real use case for this feature, I'd personally use --layout packed for both iteration and distribution (via rsync over ssh), but Pex has never treated new caches as opt-in. Its cache behavior is currently uniformly untunable.

You seemed to be on board with a feature flag ealier - does that still seem reasonable?

Yes, absolutely! Pip still has the zip file magic I proposed to them behind --use-feature=fast-deps for similar reasons, although that's had a lot more people contribute to it since then.

the new cache burden (which is you thankfully eliminated for --layout packed users like Pants, but still exists for folks who never use(d) that option).

Yes: for this reason alone I think it should be behind a feature flag.

Ideally the original tack of a library that handled the zipping, possibly including parallelization in addition to the ability here of merging would take the place of all this to both eliminate the hack factor and speed things up in the cold cache case not to mention make these features available outside Pex and outside Python.

Well yes, that definitely hasn't evaporated; in fact, it's quite useful, and indeed takes advantage of rust's structured parallelism, and it's a CLI, a rust library, and a pyo3 python library: https://github.com/cosmicexplorer/medusa-zip. I also got the hang of using github actions and cibuildwheels in the meantime and it's now published to crates.io ({,lib,py}medusa-zip) and pypi (medusa-zip) upon every tagged release. I've been continuing to speak with @jonjohnsonjr regarding the similarity of this work to Chainguard's use case efficiently merging tarballs and/or containers (see chainguard-dev/apko#781) and considering whether we can share any code. libarchive exists, but its goal is compatibility with the widest range of formats, not performance, and it seems plausible that a general framework for efficiently merging the contents of archives might have applicability to a wide range of build tools and package managers.

outside Python.

I might propose this ZipFile#merge_archive() method to python-ideas too, although it seems possibly too specialized to be in the stdlib.

Yeah, that seems too specialized. FWICT they have been on a somewhat lazy warpath to do less in the stdlib, not more.

jsirois · 2023-08-04T15:35:11Z

Would it be useful to put main.py at the literal very end, then?

Yes, but __pex__/__init__.py also has the same role when the PEX is used as a sys.path entry; i.e.: PYTHONPATH=my.pex python -c 'from __pex__ import cowsay; ...' (this is a newish feature you may not be aware of). That said, I think some PR later should probably make the tail order exactly:

__pex__/
__pex__/__init__.py
__main__.py

The __main__.py entrypoint case will probably always be more commonly used than the __pex__ magic import hook.

jsirois · 2023-08-04T15:48:03Z

Just noticed this:

Ok. The docs are in a sorry state to be sure, but that can be remedied.

That wasn't my intention -- please don't assume antagonism.

I actually mean it. The docs are in a sorry state. They are horrible. I have not spent any time on them in the ~5 years I've been working as the primary Pex maintainer; meanwhile I have spent time adding a lot of features for Pants; so the docs are severely out of date. I do plan to circle back though. I've vetted out a whole system over at https://github.com/a-scie/lift based on Sphinx that I will be applying here along with fully re-worked content.

cosmicexplorer · 2023-08-04T20:42:46Z

That said, I think some PR later should probably make the tail order exactly:

On it! #2209

cosmicexplorer · 2023-08-09T00:51:03Z

@jsirois you already approved this, but I assumed you'd want to review again after I introduced the --cache-dists option to gate the use of the packed wheel cache (--cache-dists is default off for zipapps, but default on for packed, in order to maintain the current behavior). I have also significantly shorted the commit message, although in this case I've kept the Problem/Solution structure as I feel that describing the tradeoff vs --no-compress in the Alternatives section was worth specifically calling out.

jsirois

I'll look more closely here in an hour or so (this is from phone); then I'm AFK until August 18th.

jsirois · 2023-08-09T15:42:20Z

pex/pex_builder.py

@@ -664,17 +672,34 @@ def build(
            else:
                shutil.rmtree(tmp_pex, True)

+        if not compress:
+            if cache_dists is True:
+                pex_warnings.warn(


So, say there is some legitimate use case, aka the user knows better. You've now pissed them off by nagging them. It's much cleaner if you just fail fast if that makes sense to do. I'm not commenting on whether or not there is a legitimate use case, my intent is just to point out the cost of combinatorial feature explosion and the nanny vs. document vs. trust the user tradeoffs.

jsirois · 2023-08-09T15:46:37Z

Re the commit description, I'm still not a fan of the formula - you can say the exact same thing without the damn template. But that's just my grumpy dislike of the 5 paragraph essay. More importantly you just whiff the actual alternative, which is the packed layout. I think you should address why that doesn't work.

jsirois

I think adding the uncompressed option to the existing packed layout is a mistake. Seems good otherwise.

jsirois · 2023-08-09T17:28:32Z

pex/bin/pex.py

-            "PEX's bootstrap and dependency zip files. Does nothing for loose layout PEXes."
+            "PEX's bootstrap and dependency zip files. "
+            "Uncompressed PEX files are much faster to create from an empty cache, but are no "
+            "faster after the cache has been populated, and uncompressed cache entries will "


Reading a second time there is just no reason to cache uncompressed at all. I definitely vote for fail fast. It seems compress=False, cached_dists=True should not be allowed at all. Further, there is no reason for the current packed layout behavior to be toggleable. If you're building a packed layout, the zips should be compressed zips since the whole point is cache friendliness. If there is no call to add a degree of freedom, don't add it because, at least in Pex, you can never take it back.

jsirois · 2023-08-09T17:30:55Z

pex/common.py

            zinfo.compress_type = zipfile.ZIP_STORED
            data = b""
        else:
            zinfo.file_size = st.st_size
-            zinfo.compress_type = zipfile.ZIP_DEFLATED
+            # File contents may be compressed or decompressed. Decompressed is significantly faster


This seems like a gratuitous comment. You're selling here or else speaking to an end user. Clearly this is not the place to speak to the user and the coder doesn't need to be sold to, if a function takes a param ... it takes a param. If its a public function maybe sell in the API doc.

jsirois

As I said, I'll be offline ~presently. Hopefully this approve over-rides the prior request changes. I would like to see the change, but I trust you to do the right thing.

Thanks for putting up with my opinions here. You've started a contribution spike and I'm grateful for that, but I'm pretty open about not viewing all contribution as a good thing by default. As such I'm just trying to set up my expectations / the current Pex code base opinions clearly to help make further contribution more straightforward for us both. My time on this project is limited as I'm sure yours is too; so making things efficient seems to benefit us both.

jsirois · 2023-12-13T05:10:35Z

@cosmicexplorer although this PR is approved, another thing to weigh is #2298 which provides even greater speedups with 0 hacks by skipping zipping entirely with a bonus of skipping unzipping (pre-installing) as well.

cosmicexplorer · 2024-06-13T14:07:12Z

Hey @jsirois, thanks so much for your thoughtful review. I'll be rebasing and responding to your comments, although I understand that pex has changed a bit in the meantime. I think this is a useful change and I have the time to see it through.

cosmicexplorer · 2024-06-13T14:09:33Z

Ah, I'll read through #2298 before doing that. I know you've been pushing back on adding this PR's complexity to pex for a while, and I've since found that some of the tricks here are probably better suited for the rust zip crate. I'll read that change and re-evaluate this now. Thanks.

cosmicexplorer · 2024-06-13T14:16:29Z

Ok, it seems like this change will still provide a speedup in zipapp creation, without incurring the (very slight = O(100ms)) cold-start performance slowdown of --no-pre-install-wheels. I personally think it's still quite useful to get a build performance improvement "for free" with --layout zipapp, since I believe that's still the default and rather convenient. If you agree, I'll rebase and follow up with this until it's acceptable.

cosmicexplorer · 2024-06-13T14:28:15Z

Ok, I totally buy this comment: #2175 (comment). I think I'm willing to close this and move the zip file magic to my work on the zip crate. I was really surprised/impressed that we were able to do this, but --no-pre-install-wheels seems like the right way to do this. Thanks for talking me through this: I really appreciate your patience and I always learn from your expertise.

This was referenced Jul 19, 2023

research area: parallel zip creation #2158

Open

Perf tracking chainguard-dev/apko#781

Open

cosmicexplorer requested review from jsirois and Eric-Arellano and removed request for jsirois and Eric-Arellano July 19, 2023 21:08

cosmicexplorer mentioned this pull request Jul 24, 2023

Large amount of time spent in "zipping" when building a pex from a --pex-repository #1675

Closed

cosmicexplorer force-pushed the medusa-zip branch 13 times, most recently from e6ea967 to fb4bbae Compare July 26, 2023 10:57

cosmicexplorer marked this pull request as ready for review July 26, 2023 10:58

cosmicexplorer changed the title ~~Medusa zip~~ zip -FF merge optimization for --compress --layout zipapp Jul 26, 2023

cosmicexplorer force-pushed the medusa-zip branch 2 times, most recently from 954476e to 4d04953 Compare July 26, 2023 11:39

cosmicexplorer added 2 commits August 4, 2023 10:58

rewrite some of the cli help to reflect these changes

feff2c8

jsirois approved these changes Aug 4, 2023

View reviewed changes

put main and src at the end of the zip to match current behavior

e587fd3

cosmicexplorer force-pushed the medusa-zip branch from a3d3415 to e587fd3 Compare August 4, 2023 17:50

respond to review comments

2131a26

cosmicexplorer force-pushed the medusa-zip branch from a223a94 to 738c007 Compare August 4, 2023 20:32

cosmicexplorer mentioned this pull request Aug 4, 2023

colocate __main__.py with the central directory record of the zipapp #2209

Open

cosmicexplorer force-pushed the medusa-zip branch from 738c007 to 713d0ed Compare August 4, 2023 21:07

introduce --{,no-}cache-dists flag to gate use of the packed cache

31592ec

cosmicexplorer force-pushed the medusa-zip branch from 713d0ed to 31592ec Compare August 5, 2023 02:35

jsirois reviewed Aug 9, 2023

View reviewed changes

jsirois requested changes Aug 9, 2023

View reviewed changes

jsirois approved these changes Aug 9, 2023

View reviewed changes

cosmicexplorer mentioned this pull request Jun 13, 2024

feat: prototype async API, with demonstrable perf improvements via benchmark zip-rs/zip2#73

Closed

cosmicexplorer closed this Jun 13, 2024

cosmicexplorer mentioned this pull request Jun 13, 2024

parallel operations workstream zip-rs/zip2#193

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

consume packed wheel cache in zipapp creation #2175

consume packed wheel cache in zipapp creation #2175

cosmicexplorer commented Jul 19, 2023 •

edited

Loading

cosmicexplorer commented Jul 19, 2023 •

edited

Loading

cosmicexplorer commented Jul 19, 2023

jsirois commented Jul 20, 2023

cosmicexplorer commented Jul 23, 2023

cosmicexplorer commented Aug 4, 2023

jsirois commented Aug 4, 2023

cosmicexplorer commented Aug 4, 2023 •

edited

Loading

jsirois left a comment

jsirois Aug 4, 2023 •

edited

Loading

jsirois Aug 4, 2023

cosmicexplorer Aug 4, 2023

cosmicexplorer Aug 4, 2023

jsirois Aug 4, 2023

jsirois commented Aug 4, 2023

jsirois commented Aug 4, 2023 •

edited

Loading

cosmicexplorer commented Aug 4, 2023

cosmicexplorer commented Aug 9, 2023

jsirois left a comment

jsirois Aug 9, 2023

jsirois commented Aug 9, 2023

jsirois left a comment

jsirois Aug 9, 2023 •

edited

Loading

jsirois Aug 9, 2023

jsirois left a comment

jsirois commented Dec 13, 2023

cosmicexplorer commented Jun 13, 2024

cosmicexplorer commented Jun 13, 2024

cosmicexplorer commented Jun 13, 2024

cosmicexplorer commented Jun 13, 2024

consume packed wheel cache in zipapp creation #2175

consume packed wheel cache in zipapp creation #2175

Conversation

cosmicexplorer commented Jul 19, 2023 • edited Loading

Problem

Alternatives

Solution

cosmicexplorer commented Jul 19, 2023 • edited Loading

cosmicexplorer commented Jul 19, 2023

jsirois commented Jul 20, 2023

cosmicexplorer commented Jul 23, 2023

cosmicexplorer commented Aug 4, 2023

jsirois commented Aug 4, 2023

cosmicexplorer commented Aug 4, 2023 • edited Loading

jsirois left a comment

Choose a reason for hiding this comment

jsirois Aug 4, 2023 • edited Loading

Choose a reason for hiding this comment

jsirois Aug 4, 2023

Choose a reason for hiding this comment

cosmicexplorer Aug 4, 2023

Choose a reason for hiding this comment

cosmicexplorer Aug 4, 2023

Choose a reason for hiding this comment

jsirois Aug 4, 2023

Choose a reason for hiding this comment

jsirois commented Aug 4, 2023

jsirois commented Aug 4, 2023 • edited Loading

cosmicexplorer commented Aug 4, 2023

cosmicexplorer commented Aug 9, 2023

jsirois left a comment

Choose a reason for hiding this comment

jsirois Aug 9, 2023

Choose a reason for hiding this comment

jsirois commented Aug 9, 2023

jsirois left a comment

Choose a reason for hiding this comment

jsirois Aug 9, 2023 • edited Loading

Choose a reason for hiding this comment

jsirois Aug 9, 2023

Choose a reason for hiding this comment

jsirois left a comment

Choose a reason for hiding this comment

jsirois commented Dec 13, 2023

cosmicexplorer commented Jun 13, 2024

cosmicexplorer commented Jun 13, 2024

cosmicexplorer commented Jun 13, 2024

cosmicexplorer commented Jun 13, 2024

cosmicexplorer commented Jul 19, 2023 •

edited

Loading

cosmicexplorer commented Jul 19, 2023 •

edited

Loading

cosmicexplorer commented Aug 4, 2023 •

edited

Loading

jsirois Aug 4, 2023 •

edited

Loading

jsirois commented Aug 4, 2023 •

edited

Loading

jsirois Aug 9, 2023 •

edited

Loading