-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize String#valid_encoding?
#12145
Optimize String#valid_encoding?
#12145
Conversation
Unrolling looks good to me. It may have a negative impact on some cases. But the magnitude of the negative impact is dwarfed by the positive impact (10s of nanoseconds vs 100s of microseconds). |
Perhaps a way to optimize for the early failing case would be to switch positions of the two loops and process the individual bytes first, before the unrolled loop. For a valid encoding there should be no difference because it needs to iterate through everything anyways. If there's an invalid encoding at the start, it would cause an early fail. However, I'm also wondering if it would make a significant difference to move the return statement out of the individual byte loop. This would be optimizing for the likely case of a valid encoding. |
Doing the byte loop before the unrolled loop does not matter, because if the first invalid byte occurs within the first unrolled chunk, then the method now visits Moving out the What I wonder most is whether unrolling still gives a performance boost on AArch64. |
I'm not sure about that when accounting for the conditional probability that when there's an invalid byte somewhere in the slice (including in some of the last bytes), there's a higher probability of having more invalid bytes somewhere else (including in some of the first bytes). Purely theoretical thinking though and I don't think it's worth it digging deeper into this to maybe squeeze out a little bit more for use cases.
Definitely! 👍
I can try that. |
Sorry for the delay. I finally got around running the benchmark on a Raspberry Pi 4 Model B Rev 1.4. $ uname -a
Linux 2977d8bfd403 5.15.32-v8+ #1538 SMP PREEMPT Thu Mar 31 19:40:39 BST 2022 aarch64 Linux
$ cat /etc/os-release | grep PRETTY_NAME
PRETTY_NAME="Alpine Linux v3.16"
|
@HertzDevil The implementation still needs to be integrated into |
Crystal 1.0.0 cannot handle 🤔 |
This PR implements one of the algorithms from #11873.
The loop can be made even faster by unrolling, which is not done here yet. Here is a benchmark comparing 4 different implementations of
String#valid_encoding?
:old
: The existingChar::Reader
-based loop.old inlined
: As above, but with this modification applied toChar::Reader#next_char
.new
: This PR.new unrolled
: Adds the following unrolling code, whereUNROLL = 64
for the benchmark:return
statement must be outside the inner loop, otherwise the unrolling would become only marginally faster than this PR.The results are:
With
--mcpu=native
this becomes:As explained in the linked issue, this improvement relies on the
shrx
instruction from the BMI2 extension. TheChar::Reader
implementations are barely affected.Note that if the first byte of a sufficiently long string is already invalid, the inner loop must still consume the rest of the
UNROLL
-byte chunk before the method returns. This is what happens if the benchmark hads = "\x80" * N
instead:Because of this I am not sure if we should unroll the loop here.