Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize String#valid_encoding? #12145

Merged

Conversation

HertzDevil
Copy link
Contributor

This PR implements one of the algorithms from #11873.

The loop can be made even faster by unrolling, which is not done here yet. Here is a benchmark comparing 4 different implementations of String#valid_encoding?:

  • old: The existing Char::Reader-based loop.
  • old inlined: As above, but with this modification applied to Char::Reader#next_char.
  • new: This PR.
  • new unrolled: Adds the following unrolling code, where UNROLL = 64 for the benchmark:
    while s + UNROLL <= e
      {% for i in 0...UNROLL %}
        state = table[s[{{ i }}]].unsafe_shr(state & 0x3F)
      {% end %}
      return false if state & 0x3F == 6
      s += UNROLL
    end
    The return statement must be outside the inner loop, otherwise the unrolling would become only marginally faster than this PR.

The results are:

         old 936.05  (  1.07ms) (± 2.37%)  0.0B/op   2.93× slower
 old inlined   1.57k (638.75µs) (± 3.85%)  0.0B/op   1.75× slower
         new   1.88k (532.38µs) (± 7.53%)  0.0B/op   1.46× slower
new unrolled   2.75k (364.02µs) (± 3.42%)  0.0B/op        fastest

With --mcpu=native this becomes:

         old 931.93  (  1.07ms) (± 3.02%)  0.0B/op   5.19× slower
 old inlined   1.57k (635.39µs) (± 3.11%)  0.0B/op   3.07× slower
         new   2.18k (458.16µs) (± 6.94%)  0.0B/op   2.22× slower
new unrolled   4.84k (206.70µs) (± 3.54%)  0.0B/op        fastest

As explained in the linked issue, this improvement relies on the shrx instruction from the BMI2 extension. The Char::Reader implementations are barely affected.

Note that if the first byte of a sufficiently long string is already invalid, the inner loop must still consume the rest of the UNROLL-byte chunk before the method returns. This is what happens if the benchmark had s = "\x80" * N instead:

         old  78.94M ( 12.67ns) (± 6.46%)  0.0B/op   3.56× slower
 old inlined 280.90M (  3.56ns) (± 5.88%)  0.0B/op        fastest
         new 275.24M (  3.63ns) (± 8.38%)  0.0B/op   1.02× slower
new unrolled  20.22M ( 49.45ns) (± 4.52%)  0.0B/op  13.89× slower

Because of this I am not sure if we should unroll the loop here.

@straight-shoota
Copy link
Member

Unrolling looks good to me. It may have a negative impact on some cases. But the magnitude of the negative impact is dwarfed by the positive impact (10s of nanoseconds vs 100s of microseconds).
And the typically expected case is a valid encoding. Finding an invalid byte which would allow to abort early is unexpected.

@straight-shoota
Copy link
Member

Perhaps a way to optimize for the early failing case would be to switch positions of the two loops and process the individual bytes first, before the unrolled loop. For a valid encoding there should be no difference because it needs to iterate through everything anyways. If there's an invalid encoding at the start, it would cause an early fail.

However, I'm also wondering if it would make a significant difference to move the return statement out of the individual byte loop. This would be optimizing for the likely case of a valid encoding.

@HertzDevil
Copy link
Contributor Author

Doing the byte loop before the unrolled loop does not matter, because if the first invalid byte occurs within the first unrolled chunk, then the method now visits bytes.size % UNROLL + UNROLL bytes instead of UNROLL bytes; the average iteration count would ultimately be the same.

Moving out the return in the byte loop did little for valid strings, but now the method must also iterate through all bytes for invalid strings shorter than UNROLL bytes. So I think the benchmark's version is good enough.

What I wonder most is whether unrolling still gives a performance boost on AArch64.

@straight-shoota
Copy link
Member

the average iteration count would ultimately be the same.

I'm not sure about that when accounting for the conditional probability that when there's an invalid byte somewhere in the slice (including in some of the last bytes), there's a higher probability of having more invalid bytes somewhere else (including in some of the first bytes). Purely theoretical thinking though and I don't think it's worth it digging deeper into this to maybe squeeze out a little bit more for use cases.

So I think the benchmark's version is good enough.

Definitely! 👍 ☺️ Thank you very much for that.

What I wonder most is whether unrolling still gives a performance boost on AArch64.

I can try that.

@straight-shoota
Copy link
Member

straight-shoota commented Jul 27, 2022

Sorry for the delay. I finally got around running the benchmark on a Raspberry Pi 4 Model B Rev 1.4.

$ uname -a
Linux 2977d8bfd403 5.15.32-v8+ #1538 SMP PREEMPT Thu Mar 31 19:40:39 BST 2022 aarch64 Linux
$ cat /etc/os-release | grep PRETTY_NAME
PRETTY_NAME="Alpine Linux v3.16"
         old 387.73  (  2.58ms) (± 0.87%)  0.0B/op   4.58× slower
 old inlined 713.18  (  1.40ms) (± 0.21%)  0.0B/op   2.49× slower
         new   1.18k (849.85µs) (± 0.55%)  0.0B/op   1.51× slower
new unrolled   1.78k (562.88µs) (± 0.53%)  0.0B/op        fastest

--mcpu=native doesn't seem to have any effect on any of the implementations:

         old 398.40  (  2.51ms) (± 0.19%)  0.0B/op   4.49× slower
 old inlined 688.86  (  1.45ms) (± 0.47%)  0.0B/op   2.60× slower
         new   1.18k (848.07µs) (± 0.27%)  0.0B/op   1.52× slower
new unrolled   1.79k (559.40µs) (± 0.09%)  0.0B/op        fastest

@straight-shoota
Copy link
Member

@HertzDevil The implementation still needs to be integrated into valid_encoding?, right?

@HertzDevil
Copy link
Contributor Author

Crystal 1.0.0 cannot handle UInt64s inside the dfa_state macro until #10713.

🤔

@straight-shoota straight-shoota added this to the 1.6.0 milestone Aug 16, 2022
@straight-shoota straight-shoota merged commit d48b9f3 into crystal-lang:master Aug 18, 2022
@HertzDevil HertzDevil deleted the perf/string-valid-encoding branch August 18, 2022 12:51
@HertzDevil HertzDevil mentioned this pull request May 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants