Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
all: improve perf of memchr fallback (v2)
Resubmit of PR BurntSushi#151. That PR was reverted because it broke big endian implementation and CI did not catch it (see the revert PR BurntSushi#153 for details). Andrew, thank you for new test cases which made it easy to fix the issue. The fix is: ``` --- a/src/arch/all/memchr.rs +++ b/src/arch/all/memchr.rs @@ -1019,7 +1019,7 @@ fn find_zero_in_chunk(x: usize) -> Option<usize> { if cfg!(target_endian = "little") { lowest_zero_byte(x) } else { - highest_zero_byte(x) + Some(USIZE_BYTES - 1 - highest_zero_byte(x)?) } } @@ -1028,7 +1028,7 @@ fn rfind_zero_in_chunk(x: usize) -> Option<usize> { if cfg!(target_endian = "little") { highest_zero_byte(x) } else { - lowest_zero_byte(x) + Some(USIZE_BYTES - 1 - lowest_zero_byte(x)?) } } ``` Original description: Current generic ("all") implementation checks that a chunk (`usize`) contains a zero byte, and if it is, iterates over bytes of this chunk to find the index of zero byte. Instead, we can use more bit operations to find the index without loops. Context: we use `memchr`, but many of our strings are short. Currently SIMD-optimized `memchr` processes bytes one by one when the string length is shorter than SIMD register. I suspect it can be made faster if we take `usize` bytes a chunk which does not fit into SIMD register and process it with such utility, similarly to how AVX2 implementation falls back to SSE2. So I looked at generic implementation to reuse it in SIMD-optimized version, but there were none. So here is it.
- Loading branch information