pep440: some minor refactoring, mostly around error types #780

BurntSushi · 2024-01-04T14:56:57Z

This PR does a bit of refactoring to the pep440 crate, and in
particular around the erorr types. This PR is meant to be a precursor
to another PR that does some surgery (both in parsing and in Version
representation) that benefits somewhat from this refactoring.

As usual, please review commit-by-commit.

crates/pep440-rs/src/version.rs

crates/pep440-rs/src/version_specifier.rs

I'll like expand this comment a little bit in subsequent commits, but I felt like this was useful to put in the beginning of this patch series to "set the stage." That is, this data will motivate some of the optimizations we'll do for version parsing, version representation and comparisons.

This moves LocalSegment and PreRelease after Version, and also keeps their impl blocks grouped together. No code is changed here.

It turns out that the current parser permits non-ASCII whitespace in places. We can be sneaky here to cause the existing implementation to produce incorrect offsets. The core issue here is that `unicode-width` is used to compute codepoint offsets, but its actual purpose is to compute the *visual width* of a codepoint that has been rendered. Some codepoints use more than 1 unit of visual width. While this establishes a mismatch between the implementation and the documented behavior of `Pep440Error`, we will also address this by switching to byte offsets instead of codepoint offsets. (Codepoint offsets are almost never what you want.)

We still use unicode-width, but only when rendering the error message. We also add the error message itself to the Display impl.

This error type was only be used when one attempted to parse a string as a `VersionSpecifiers`. We'll want to introduce more structured error types for parsing `Version` and `VersionSpecifier` as well, so renaming the error type helps make room for that. We also make the error type opaque. Nothing (in puffin at least) seemed to need its internals. We can always add accessor methods in the future if something else needs it. It's overall rare to need to expose the entire internal representation of an error type.

Since it's specifier to version specifier parsing, it makes sense to have it live with the corresponding type. Putting it in lib.rs I think makes it seem like an error type that applies to other things as well.

This smoothes out the error handling to using something a bit more structured. I did this because I intend to do the same for Version (eventually), and it seems good to be consistent. It also lets us nest errors a bit more easily and scrutinize the different error classes at a glance.

This makes the top-level error type small, just as the others we added in the previous commit are. This was specifically flagged by Clippy.

Prior to the refactoring in previous commits, tests were generally written against the rendered error message instead of the structured message. This was in part because many of the error types themselves were just a `String`, but also to explicitly test what an end user actually sees. That's valuable. While we keep most of the tests as rewritten to target the new structured error representation, we add some new tests that captures the value of testing the messages than humans will actually see.

This PR builds on #780 by making both version parsing faster, and perhaps more importantly, making version comparisons much faster. Overall, these changes result in a considerable improvement for the `boto3.in` workload. Here's the status quo: ``` $ time puffin pip-compile --no-build --cache-dir ~/astral/tmp/cache/ -o /dev/null ./scripts/requirements/boto3.in Resolved 31 packages in 34.56s real 34.579 user 34.004 sys 0.413 maxmem 2867 MB faults 0 ``` And now with this PR: ``` $ time puffin pip-compile --no-build --cache-dir ~/astral/tmp/cache/ -o /dev/null ./scripts/requirements/boto3.in Resolved 31 packages in 9.20s real 9.218 user 8.919 sys 0.165 maxmem 463 MB faults 0 ``` This particular workload gets stuck in pubgrub doing resolution, and thus benefits mightily from a faster `Version::cmp` routine. With that said, this change does also help a fair bit with "normal" runs: ``` $ hyperfine -w10 \ "puffin-base pip-compile --cache-dir ~/astral/tmp/cache/ -o /dev/null ./scripts/benchmarks/requirements.in" \ "puffin-cmparc pip-compile --cache-dir ~/astral/tmp/cache/ -o /dev/null ./scripts/benchmarks/requirements.in" Benchmark 1: puffin-base pip-compile --cache-dir ~/astral/tmp/cache/ -o /dev/null ./scripts/benchmarks/requirements.in Time (mean ± σ): 337.5 ms ± 3.9 ms [User: 310.5 ms, System: 73.2 ms] Range (min … max): 333.6 ms … 343.4 ms 10 runs Benchmark 2: puffin-cmparc pip-compile --cache-dir ~/astral/tmp/cache/ -o /dev/null ./scripts/benchmarks/requirements.in Time (mean ± σ): 189.8 ms ± 3.0 ms [User: 168.1 ms, System: 78.4 ms] Range (min … max): 185.0 ms … 196.2 ms 15 runs Summary puffin-cmparc pip-compile --cache-dir ~/astral/tmp/cache/ -o /dev/null ./scripts/benchmarks/requirements.in ran 1.78 ± 0.03 times faster than puffin-base pip-compile --cache-dir ~/astral/tmp/cache/ -o /dev/null ./scripts/benchmarks/requirements.in ``` There is perhaps some future work here (detailed in the commit messages), but I suspect it would be more fruitful to explore ways of making resolution itself and/or deserialization faster. Fixes #373, Closes #396

BurntSushi requested review from charliermarsh and konstin and removed request for charliermarsh January 4, 2024 14:57

konstin approved these changes Jan 4, 2024

View reviewed changes

BurntSushi force-pushed the ag/version-refactor branch from 276df67 to 9d982fa Compare January 4, 2024 16:44

BurntSushi added 9 commits January 4, 2024 12:04

pep440: use public methods in PyVersion wrapper

1615a83

pep440: move around move code

b0fbe40

This moves LocalSegment and PreRelease after Version, and also keeps their impl blocks grouped together. No code is changed here.

pep440: switch error to use byte offsets

33450a8

We still use unicode-width, but only when rendering the error message. We also add the error message itself to the Display impl.

pep440: move VersionSpecifiersError to version_specifiers module

872865b

Since it's specifier to version specifier parsing, it makes sense to have it live with the corresponding type. Putting it in lib.rs I think makes it seem like an error type that applies to other things as well.

pep440: add more boxing

0fbf72d

This makes the top-level error type small, just as the others we added in the previous commit are. This was specifically flagged by Clippy.

BurntSushi force-pushed the ag/version-refactor branch from 9d982fa to f619012 Compare January 4, 2024 17:06

BurntSushi force-pushed the ag/version-refactor branch from f619012 to 63da3d1 Compare January 4, 2024 17:19

BurntSushi merged commit d7c9b15 into main Jan 4, 2024

BurntSushi deleted the ag/version-refactor branch January 4, 2024 17:28

BurntSushi mentioned this pull request Jan 4, 2024

pep440: rewrite the parser and make version comparisons cheaper #789

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pep440: some minor refactoring, mostly around error types #780

pep440: some minor refactoring, mostly around error types #780

BurntSushi commented Jan 4, 2024

pep440: some minor refactoring, mostly around error types #780

pep440: some minor refactoring, mostly around error types #780

Conversation

BurntSushi commented Jan 4, 2024