all: improve perf of memchr fallback (v2) #154
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Resubmit of PR #151.
That PR was reverted because it broke big endian implementation and CI did not catch it (see the revert PR #153 for details).
Andrew, thank you for new test cases which made it easy to fix the issue.
The fix is:
Original description:
Current generic ("all") implementation checks that a chunk (
usize
) contains a zero byte, and if it is, iterates over bytes of this chunk to find the index of zero byte. Instead, we can use more bit operations to find the index without loops.Context: we use
memchr
, but many of our strings are short. Currently SIMD-optimizedmemchr
processes bytes one by one when the string length is shorter than SIMD register. I suspect it can be made faster if we takeusize
bytes a chunk which does not fit into SIMD register and process it with such utility, similarly to how AVX2 implementation falls back to SSE2. So I looked at generic implementation to reuse it in SIMD-optimized version, but there were none. So here is it.