Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

revert fallback memchr optimization #153

Merged
merged 5 commits into from
Jun 14, 2024
Merged

revert fallback memchr optimization #153

merged 5 commits into from
Jun 14, 2024

Conversation

BurntSushi
Copy link
Owner

It turns out that #151 broke big endian targets and CI didn't catch it
because of a misconfiguration. (Namely, cross wasn't being used to
run tests even when it was being used to do builds.)

I spent about 15 minutes trying to fix this myself, but it quickly
became clear I'd need to spend more time writing out the bit arithmetic
to get it right.

cc @stepancheg - I'd be happy to take a new PR that fixes the
optimization for big endian targets.

Fixes #152

Unbelievably, some of the steps in the main CI configuration were using
the hard-coded `cargo` instead of `${{ env.CARGO }}`. The latter will
interpolate to `cross` for targets requiring cross compilation. And
since all big endian targets are only tested via cross compilation, our
CI was not test big endian at all (beyond testing that compilation
succeeded).

This led to a [performance optimization] [breaking big endian] targets.

[performance optimization]: #151
[breaking big endian]: #152
This was reported along with the fact that memchr 2.7.3 broke big endian
targets. While I at first thought the test suite just somehow missed the
regression and thus this test was somehow novel in some interesting way,
it turned out it was just a CI configuration failure. The existing tests
catch the regression easily (dozens of test failures). The problem, as
mentioned in the previous commit, was that tests in CI weren't using
`cross` when cross compilation was enabled.

So while this test is seemingly redundant, I still think it's good form
to add it.

Ref #152
It turns out that the previous regression test wasn't failing for me on
the `powerpc64-unknown-linux-gnu` target. But quickcheck found a case
that was, so I added a second regression test specifically for that
case.
This reverts #151 because it broke big endian targets. CI didn't catch
it because of a misconfiguration that resulted in tests being skipped
for any target that required cross compilation.

This reverts commit 345fab7.

Fixes #152
@BurntSushi
Copy link
Owner Author

It looks like the buggy CI configuration was introduced here (which is also when I introduced the s390x and powerpc64 targets): #129

I'm pretty sure I was running the big endian tests locally as part of working on #129, which is why we never had any surprising test failures.

@BurntSushi BurntSushi merged commit b0af902 into master Jun 14, 2024
17 checks passed
@BurntSushi BurntSushi deleted the ag/fix-big-endian branch June 14, 2024 00:17
@BurntSushi
Copy link
Owner Author

This PR is on crates.io in memchr 2.7.4.

stepancheg added a commit to stepancheg/memchr that referenced this pull request Jun 14, 2024
Resubmit of PR BurntSushi#151

That PR was reverted because it broke big endian implementation and
CI did not catch it (see the revert PR BurntSushi#153 for details).

Andrew, thank you for new test cases which made it easier to fix the
issue.

The fix is:

```
--- a/src/arch/all/memchr.rs
+++ b/src/arch/all/memchr.rs
@@ -1019,7 +1019,7 @@ fn find_zero_in_chunk(x: usize) -> Option<usize> {
     if cfg!(target_endian = "little") {
         lowest_zero_byte(x)
     } else {
-        highest_zero_byte(x)
+        Some(USIZE_BYTES - 1 - highest_zero_byte(x)?)
     }
 }

@@ -1028,7 +1028,7 @@ fn rfind_zero_in_chunk(x: usize) -> Option<usize> {
     if cfg!(target_endian = "little") {
         highest_zero_byte(x)
     } else {
-        lowest_zero_byte(x)
+        Some(USIZE_BYTES - 1 - lowest_zero_byte(x)?)
     }
 }
```

Original description:

Current generic ("all") implementation checks that a chunk (`usize`)
contains a zero byte, and if it is, iterates over bytes of this
chunk to find the index of zero byte. Instead, we can use more bit
operations to find the index without loops.

Context: we use `memchr`, but many of our strings are short.
Currently SIMD-optimized `memchr` processes bytes one by one when
the string length is shorter than SIMD register. I suspect it can
be made faster if we take `usize` bytes a chunk which does not fit
into SIMD register and process it with such utility, similarly to
how AVX2 implementation falls back to SSE2. So I looked at generic
implementation to reuse it in SIMD-optimized version, but there
were none. So here is it.
stepancheg added a commit to stepancheg/memchr that referenced this pull request Jun 14, 2024
Resubmit of PR BurntSushi#151.

That PR was reverted because it broke big endian implementation and
CI did not catch it (see the revert PR BurntSushi#153 for details).

Andrew, thank you for new test cases which made it easier to fix the
issue.

The fix is:

```
--- a/src/arch/all/memchr.rs
+++ b/src/arch/all/memchr.rs
@@ -1019,7 +1019,7 @@ fn find_zero_in_chunk(x: usize) -> Option<usize> {
     if cfg!(target_endian = "little") {
         lowest_zero_byte(x)
     } else {
-        highest_zero_byte(x)
+        Some(USIZE_BYTES - 1 - highest_zero_byte(x)?)
     }
 }

@@ -1028,7 +1028,7 @@ fn rfind_zero_in_chunk(x: usize) -> Option<usize> {
     if cfg!(target_endian = "little") {
         highest_zero_byte(x)
     } else {
-        lowest_zero_byte(x)
+        Some(USIZE_BYTES - 1 - lowest_zero_byte(x)?)
     }
 }
```

Original description:

Current generic ("all") implementation checks that a chunk (`usize`)
contains a zero byte, and if it is, iterates over bytes of this
chunk to find the index of zero byte. Instead, we can use more bit
operations to find the index without loops.

Context: we use `memchr`, but many of our strings are short.
Currently SIMD-optimized `memchr` processes bytes one by one when
the string length is shorter than SIMD register. I suspect it can
be made faster if we take `usize` bytes a chunk which does not fit
into SIMD register and process it with such utility, similarly to
how AVX2 implementation falls back to SSE2. So I looked at generic
implementation to reuse it in SIMD-optimized version, but there
were none. So here is it.
stepancheg added a commit to stepancheg/memchr that referenced this pull request Jun 14, 2024
Resubmit of PR BurntSushi#151.

That PR was reverted because it broke big endian implementation and
CI did not catch it (see the revert PR BurntSushi#153 for details).

Andrew, thank you for new test cases which made it easy to fix the
issue.

The fix is:

```
--- a/src/arch/all/memchr.rs
+++ b/src/arch/all/memchr.rs
@@ -1019,7 +1019,7 @@ fn find_zero_in_chunk(x: usize) -> Option<usize> {
     if cfg!(target_endian = "little") {
         lowest_zero_byte(x)
     } else {
-        highest_zero_byte(x)
+        Some(USIZE_BYTES - 1 - highest_zero_byte(x)?)
     }
 }

@@ -1028,7 +1028,7 @@ fn rfind_zero_in_chunk(x: usize) -> Option<usize> {
     if cfg!(target_endian = "little") {
         highest_zero_byte(x)
     } else {
-        lowest_zero_byte(x)
+        Some(USIZE_BYTES - 1 - lowest_zero_byte(x)?)
     }
 }
```

Original description:

Current generic ("all") implementation checks that a chunk (`usize`)
contains a zero byte, and if it is, iterates over bytes of this
chunk to find the index of zero byte. Instead, we can use more bit
operations to find the index without loops.

Context: we use `memchr`, but many of our strings are short.
Currently SIMD-optimized `memchr` processes bytes one by one when
the string length is shorter than SIMD register. I suspect it can
be made faster if we take `usize` bytes a chunk which does not fit
into SIMD register and process it with such utility, similarly to
how AVX2 implementation falls back to SSE2. So I looked at generic
implementation to reuse it in SIMD-optimized version, but there
were none. So here is it.
@stepancheg
Copy link
Contributor

Resubmitted as #154.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

v2.7.3 breaks big endian targets
2 participants