-
-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
revert fallback memchr optimization #153
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Unbelievably, some of the steps in the main CI configuration were using the hard-coded `cargo` instead of `${{ env.CARGO }}`. The latter will interpolate to `cross` for targets requiring cross compilation. And since all big endian targets are only tested via cross compilation, our CI was not test big endian at all (beyond testing that compilation succeeded). This led to a [performance optimization] [breaking big endian] targets. [performance optimization]: #151 [breaking big endian]: #152
This was reported along with the fact that memchr 2.7.3 broke big endian targets. While I at first thought the test suite just somehow missed the regression and thus this test was somehow novel in some interesting way, it turned out it was just a CI configuration failure. The existing tests catch the regression easily (dozens of test failures). The problem, as mentioned in the previous commit, was that tests in CI weren't using `cross` when cross compilation was enabled. So while this test is seemingly redundant, I still think it's good form to add it. Ref #152
It turns out that the previous regression test wasn't failing for me on the `powerpc64-unknown-linux-gnu` target. But quickcheck found a case that was, so I added a second regression test specifically for that case.
This PR is on crates.io in |
stepancheg
added a commit
to stepancheg/memchr
that referenced
this pull request
Jun 14, 2024
Resubmit of PR BurntSushi#151 That PR was reverted because it broke big endian implementation and CI did not catch it (see the revert PR BurntSushi#153 for details). Andrew, thank you for new test cases which made it easier to fix the issue. The fix is: ``` --- a/src/arch/all/memchr.rs +++ b/src/arch/all/memchr.rs @@ -1019,7 +1019,7 @@ fn find_zero_in_chunk(x: usize) -> Option<usize> { if cfg!(target_endian = "little") { lowest_zero_byte(x) } else { - highest_zero_byte(x) + Some(USIZE_BYTES - 1 - highest_zero_byte(x)?) } } @@ -1028,7 +1028,7 @@ fn rfind_zero_in_chunk(x: usize) -> Option<usize> { if cfg!(target_endian = "little") { highest_zero_byte(x) } else { - lowest_zero_byte(x) + Some(USIZE_BYTES - 1 - lowest_zero_byte(x)?) } } ``` Original description: Current generic ("all") implementation checks that a chunk (`usize`) contains a zero byte, and if it is, iterates over bytes of this chunk to find the index of zero byte. Instead, we can use more bit operations to find the index without loops. Context: we use `memchr`, but many of our strings are short. Currently SIMD-optimized `memchr` processes bytes one by one when the string length is shorter than SIMD register. I suspect it can be made faster if we take `usize` bytes a chunk which does not fit into SIMD register and process it with such utility, similarly to how AVX2 implementation falls back to SSE2. So I looked at generic implementation to reuse it in SIMD-optimized version, but there were none. So here is it.
stepancheg
added a commit
to stepancheg/memchr
that referenced
this pull request
Jun 14, 2024
Resubmit of PR BurntSushi#151. That PR was reverted because it broke big endian implementation and CI did not catch it (see the revert PR BurntSushi#153 for details). Andrew, thank you for new test cases which made it easier to fix the issue. The fix is: ``` --- a/src/arch/all/memchr.rs +++ b/src/arch/all/memchr.rs @@ -1019,7 +1019,7 @@ fn find_zero_in_chunk(x: usize) -> Option<usize> { if cfg!(target_endian = "little") { lowest_zero_byte(x) } else { - highest_zero_byte(x) + Some(USIZE_BYTES - 1 - highest_zero_byte(x)?) } } @@ -1028,7 +1028,7 @@ fn rfind_zero_in_chunk(x: usize) -> Option<usize> { if cfg!(target_endian = "little") { highest_zero_byte(x) } else { - lowest_zero_byte(x) + Some(USIZE_BYTES - 1 - lowest_zero_byte(x)?) } } ``` Original description: Current generic ("all") implementation checks that a chunk (`usize`) contains a zero byte, and if it is, iterates over bytes of this chunk to find the index of zero byte. Instead, we can use more bit operations to find the index without loops. Context: we use `memchr`, but many of our strings are short. Currently SIMD-optimized `memchr` processes bytes one by one when the string length is shorter than SIMD register. I suspect it can be made faster if we take `usize` bytes a chunk which does not fit into SIMD register and process it with such utility, similarly to how AVX2 implementation falls back to SSE2. So I looked at generic implementation to reuse it in SIMD-optimized version, but there were none. So here is it.
stepancheg
added a commit
to stepancheg/memchr
that referenced
this pull request
Jun 14, 2024
Resubmit of PR BurntSushi#151. That PR was reverted because it broke big endian implementation and CI did not catch it (see the revert PR BurntSushi#153 for details). Andrew, thank you for new test cases which made it easy to fix the issue. The fix is: ``` --- a/src/arch/all/memchr.rs +++ b/src/arch/all/memchr.rs @@ -1019,7 +1019,7 @@ fn find_zero_in_chunk(x: usize) -> Option<usize> { if cfg!(target_endian = "little") { lowest_zero_byte(x) } else { - highest_zero_byte(x) + Some(USIZE_BYTES - 1 - highest_zero_byte(x)?) } } @@ -1028,7 +1028,7 @@ fn rfind_zero_in_chunk(x: usize) -> Option<usize> { if cfg!(target_endian = "little") { highest_zero_byte(x) } else { - lowest_zero_byte(x) + Some(USIZE_BYTES - 1 - lowest_zero_byte(x)?) } } ``` Original description: Current generic ("all") implementation checks that a chunk (`usize`) contains a zero byte, and if it is, iterates over bytes of this chunk to find the index of zero byte. Instead, we can use more bit operations to find the index without loops. Context: we use `memchr`, but many of our strings are short. Currently SIMD-optimized `memchr` processes bytes one by one when the string length is shorter than SIMD register. I suspect it can be made faster if we take `usize` bytes a chunk which does not fit into SIMD register and process it with such utility, similarly to how AVX2 implementation falls back to SSE2. So I looked at generic implementation to reuse it in SIMD-optimized version, but there were none. So here is it.
Resubmitted as #154. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
It turns out that #151 broke big endian targets and CI didn't catch it
because of a misconfiguration. (Namely,
cross
wasn't being used torun tests even when it was being used to do builds.)
I spent about 15 minutes trying to fix this myself, but it quickly
became clear I'd need to spend more time writing out the bit arithmetic
to get it right.
cc @stepancheg - I'd be happy to take a new PR that fixes the
optimization for big endian targets.
Fixes #152