revert fallback memchr optimization #153

BurntSushi · 2024-06-14T00:06:25Z

It turns out that #151 broke big endian targets and CI didn't catch it
because of a misconfiguration. (Namely, cross wasn't being used to
run tests even when it was being used to do builds.)

I spent about 15 minutes trying to fix this myself, but it quickly
became clear I'd need to spend more time writing out the bit arithmetic
to get it right.

cc @stepancheg - I'd be happy to take a new PR that fixes the
optimization for big endian targets.

Fixes #152

Unbelievably, some of the steps in the main CI configuration were using the hard-coded `cargo` instead of `${{ env.CARGO }}`. The latter will interpolate to `cross` for targets requiring cross compilation. And since all big endian targets are only tested via cross compilation, our CI was not test big endian at all (beyond testing that compilation succeeded). This led to a [performance optimization] [breaking big endian] targets. [performance optimization]: #151 [breaking big endian]: #152

This was reported along with the fact that memchr 2.7.3 broke big endian targets. While I at first thought the test suite just somehow missed the regression and thus this test was somehow novel in some interesting way, it turned out it was just a CI configuration failure. The existing tests catch the regression easily (dozens of test failures). The problem, as mentioned in the previous commit, was that tests in CI weren't using `cross` when cross compilation was enabled. So while this test is seemingly redundant, I still think it's good form to add it. Ref #152

It turns out that the previous regression test wasn't failing for me on the `powerpc64-unknown-linux-gnu` target. But quickcheck found a case that was, so I added a second regression test specifically for that case.

This reverts #151 because it broke big endian targets. CI didn't catch it because of a misconfiguration that resulted in tests being skipped for any target that required cross compilation. This reverts commit 345fab7. Fixes #152

BurntSushi · 2024-06-14T00:16:39Z

It looks like the buggy CI configuration was introduced here (which is also when I introduced the s390x and powerpc64 targets): #129

I'm pretty sure I was running the big endian tests locally as part of working on #129, which is why we never had any surprising test failures.

BurntSushi · 2024-06-14T00:17:19Z

This PR is on crates.io in memchr 2.7.4.

Resubmit of PR BurntSushi#151 That PR was reverted because it broke big endian implementation and CI did not catch it (see the revert PR BurntSushi#153 for details). Andrew, thank you for new test cases which made it easier to fix the issue. The fix is: ``` --- a/src/arch/all/memchr.rs +++ b/src/arch/all/memchr.rs @@ -1019,7 +1019,7 @@ fn find_zero_in_chunk(x: usize) -> Option<usize> { if cfg!(target_endian = "little") { lowest_zero_byte(x) } else { - highest_zero_byte(x) + Some(USIZE_BYTES - 1 - highest_zero_byte(x)?) } } @@ -1028,7 +1028,7 @@ fn rfind_zero_in_chunk(x: usize) -> Option<usize> { if cfg!(target_endian = "little") { highest_zero_byte(x) } else { - lowest_zero_byte(x) + Some(USIZE_BYTES - 1 - lowest_zero_byte(x)?) } } ``` Original description: Current generic ("all") implementation checks that a chunk (`usize`) contains a zero byte, and if it is, iterates over bytes of this chunk to find the index of zero byte. Instead, we can use more bit operations to find the index without loops. Context: we use `memchr`, but many of our strings are short. Currently SIMD-optimized `memchr` processes bytes one by one when the string length is shorter than SIMD register. I suspect it can be made faster if we take `usize` bytes a chunk which does not fit into SIMD register and process it with such utility, similarly to how AVX2 implementation falls back to SSE2. So I looked at generic implementation to reuse it in SIMD-optimized version, but there were none. So here is it.

Resubmit of PR BurntSushi#151. That PR was reverted because it broke big endian implementation and CI did not catch it (see the revert PR BurntSushi#153 for details). Andrew, thank you for new test cases which made it easier to fix the issue. The fix is: ``` --- a/src/arch/all/memchr.rs +++ b/src/arch/all/memchr.rs @@ -1019,7 +1019,7 @@ fn find_zero_in_chunk(x: usize) -> Option<usize> { if cfg!(target_endian = "little") { lowest_zero_byte(x) } else { - highest_zero_byte(x) + Some(USIZE_BYTES - 1 - highest_zero_byte(x)?) } } @@ -1028,7 +1028,7 @@ fn rfind_zero_in_chunk(x: usize) -> Option<usize> { if cfg!(target_endian = "little") { highest_zero_byte(x) } else { - lowest_zero_byte(x) + Some(USIZE_BYTES - 1 - lowest_zero_byte(x)?) } } ``` Original description: Current generic ("all") implementation checks that a chunk (`usize`) contains a zero byte, and if it is, iterates over bytes of this chunk to find the index of zero byte. Instead, we can use more bit operations to find the index without loops. Context: we use `memchr`, but many of our strings are short. Currently SIMD-optimized `memchr` processes bytes one by one when the string length is shorter than SIMD register. I suspect it can be made faster if we take `usize` bytes a chunk which does not fit into SIMD register and process it with such utility, similarly to how AVX2 implementation falls back to SSE2. So I looked at generic implementation to reuse it in SIMD-optimized version, but there were none. So here is it.

Resubmit of PR BurntSushi#151. That PR was reverted because it broke big endian implementation and CI did not catch it (see the revert PR BurntSushi#153 for details). Andrew, thank you for new test cases which made it easy to fix the issue. The fix is: ``` --- a/src/arch/all/memchr.rs +++ b/src/arch/all/memchr.rs @@ -1019,7 +1019,7 @@ fn find_zero_in_chunk(x: usize) -> Option<usize> { if cfg!(target_endian = "little") { lowest_zero_byte(x) } else { - highest_zero_byte(x) + Some(USIZE_BYTES - 1 - highest_zero_byte(x)?) } } @@ -1028,7 +1028,7 @@ fn rfind_zero_in_chunk(x: usize) -> Option<usize> { if cfg!(target_endian = "little") { highest_zero_byte(x) } else { - lowest_zero_byte(x) + Some(USIZE_BYTES - 1 - lowest_zero_byte(x)?) } } ``` Original description: Current generic ("all") implementation checks that a chunk (`usize`) contains a zero byte, and if it is, iterates over bytes of this chunk to find the index of zero byte. Instead, we can use more bit operations to find the index without loops. Context: we use `memchr`, but many of our strings are short. Currently SIMD-optimized `memchr` processes bytes one by one when the string length is shorter than SIMD register. I suspect it can be made faster if we take `usize` bytes a chunk which does not fit into SIMD register and process it with such utility, similarly to how AVX2 implementation falls back to SSE2. So I looked at generic implementation to reuse it in SIMD-optimized version, but there were none. So here is it.

stepancheg · 2024-06-14T19:09:15Z

Resubmitted as #154.

BurntSushi added 5 commits June 13, 2024 19:16

benchmarks: bump version

4054b45

arch: add a second big endian regression test

3a4ef60

It turns out that the previous regression test wasn't failing for me on the `powerpc64-unknown-linux-gnu` target. But quickcheck found a case that was, so I added a second regression test specifically for that case.

perf: revert fallback optimization

91e0648

This reverts #151 because it broke big endian targets. CI didn't catch it because of a misconfiguration that resulted in tests being skipped for any target that required cross compilation. This reverts commit 345fab7. Fixes #152

BurntSushi merged commit b0af902 into master Jun 14, 2024
17 checks passed

BurntSushi deleted the ag/fix-big-endian branch June 14, 2024 00:17

BurntSushi mentioned this pull request Jun 14, 2024

Faster memchr, memchr2 and memchr3 in generic version #151

Merged

stepancheg mentioned this pull request Jun 14, 2024

all: improve perf of memchr fallback (v2) #154

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

revert fallback memchr optimization #153

revert fallback memchr optimization #153

BurntSushi commented Jun 14, 2024

BurntSushi commented Jun 14, 2024

BurntSushi commented Jun 14, 2024

stepancheg commented Jun 14, 2024

revert fallback memchr optimization #153

revert fallback memchr optimization #153

Conversation

BurntSushi commented Jun 14, 2024

BurntSushi commented Jun 14, 2024

BurntSushi commented Jun 14, 2024

stepancheg commented Jun 14, 2024