Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

all: improve perf of memchr fallback (v2) #154

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

stepancheg
Copy link
Contributor

Resubmit of PR #151.

That PR was reverted because it broke big endian implementation and CI did not catch it (see the revert PR #153 for details).

Andrew, thank you for new test cases which made it easy to fix the issue.

The fix is:

--- a/src/arch/all/memchr.rs
+++ b/src/arch/all/memchr.rs
@@ -1019,7 +1019,7 @@ fn find_zero_in_chunk(x: usize) -> Option<usize> {
     if cfg!(target_endian = "little") {
         lowest_zero_byte(x)
     } else {
-        highest_zero_byte(x)
+        Some(USIZE_BYTES - 1 - highest_zero_byte(x)?)
     }
 }

@@ -1028,7 +1028,7 @@ fn rfind_zero_in_chunk(x: usize) -> Option<usize> {
     if cfg!(target_endian = "little") {
         highest_zero_byte(x)
     } else {
-        lowest_zero_byte(x)
+        Some(USIZE_BYTES - 1 - lowest_zero_byte(x)?)
     }
 }

Original description:

Current generic ("all") implementation checks that a chunk (usize) contains a zero byte, and if it is, iterates over bytes of this chunk to find the index of zero byte. Instead, we can use more bit operations to find the index without loops.

Context: we use memchr, but many of our strings are short. Currently SIMD-optimized memchr processes bytes one by one when the string length is shorter than SIMD register. I suspect it can be made faster if we take usize bytes a chunk which does not fit into SIMD register and process it with such utility, similarly to how AVX2 implementation falls back to SSE2. So I looked at generic implementation to reuse it in SIMD-optimized version, but there were none. So here is it.

Resubmit of PR BurntSushi#151.

That PR was reverted because it broke big endian implementation and
CI did not catch it (see the revert PR BurntSushi#153 for details).

Andrew, thank you for new test cases which made it easy to fix the
issue.

The fix is:

```
--- a/src/arch/all/memchr.rs
+++ b/src/arch/all/memchr.rs
@@ -1019,7 +1019,7 @@ fn find_zero_in_chunk(x: usize) -> Option<usize> {
     if cfg!(target_endian = "little") {
         lowest_zero_byte(x)
     } else {
-        highest_zero_byte(x)
+        Some(USIZE_BYTES - 1 - highest_zero_byte(x)?)
     }
 }

@@ -1028,7 +1028,7 @@ fn rfind_zero_in_chunk(x: usize) -> Option<usize> {
     if cfg!(target_endian = "little") {
         highest_zero_byte(x)
     } else {
-        lowest_zero_byte(x)
+        Some(USIZE_BYTES - 1 - lowest_zero_byte(x)?)
     }
 }
```

Original description:

Current generic ("all") implementation checks that a chunk (`usize`)
contains a zero byte, and if it is, iterates over bytes of this
chunk to find the index of zero byte. Instead, we can use more bit
operations to find the index without loops.

Context: we use `memchr`, but many of our strings are short.
Currently SIMD-optimized `memchr` processes bytes one by one when
the string length is shorter than SIMD register. I suspect it can
be made faster if we take `usize` bytes a chunk which does not fit
into SIMD register and process it with such utility, similarly to
how AVX2 implementation falls back to SSE2. So I looked at generic
implementation to reuse it in SIMD-optimized version, but there
were none. So here is it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant