Optimize `String#valid_encoding?` #12145

HertzDevil · 2022-06-21T23:48:55Z

This PR implements one of the algorithms from #11873.

The loop can be made even faster by unrolling, which is not done here yet. Here is a benchmark comparing 4 different implementations of String#valid_encoding?:

old: The existing Char::Reader-based loop.
old inlined: As above, but with this modification applied to Char::Reader#next_char.
new: This PR.
new unrolled: Adds the following unrolling code, where UNROLL = 64 for the benchmark:
```
while s + UNROLL <= e
  {% for i in 0...UNROLL %}
    state = table[s[{{ i }}]].unsafe_shr(state & 0x3F)
  {% end %}
  return false if state & 0x3F == 6
  s += UNROLL
end
```
The return statement must be outside the inner loop, otherwise the unrolling would become only marginally faster than this PR.

The results are:

         old 936.05  (  1.07ms) (± 2.37%)  0.0B/op   2.93× slower
 old inlined   1.57k (638.75µs) (± 3.85%)  0.0B/op   1.75× slower
         new   1.88k (532.38µs) (± 7.53%)  0.0B/op   1.46× slower
new unrolled   2.75k (364.02µs) (± 3.42%)  0.0B/op        fastest

With --mcpu=native this becomes:

         old 931.93  (  1.07ms) (± 3.02%)  0.0B/op   5.19× slower
 old inlined   1.57k (635.39µs) (± 3.11%)  0.0B/op   3.07× slower
         new   2.18k (458.16µs) (± 6.94%)  0.0B/op   2.22× slower
new unrolled   4.84k (206.70µs) (± 3.54%)  0.0B/op        fastest

As explained in the linked issue, this improvement relies on the shrx instruction from the BMI2 extension. The Char::Reader implementations are barely affected.

Note that if the first byte of a sufficiently long string is already invalid, the inner loop must still consume the rest of the UNROLL-byte chunk before the method returns. This is what happens if the benchmark had s = "\x80" * N instead:

         old  78.94M ( 12.67ns) (± 6.46%)  0.0B/op   3.56× slower
 old inlined 280.90M (  3.56ns) (± 5.88%)  0.0B/op        fastest
         new 275.24M (  3.63ns) (± 8.38%)  0.0B/op   1.02× slower
new unrolled  20.22M ( 49.45ns) (± 4.52%)  0.0B/op  13.89× slower

Because of this I am not sure if we should unroll the loop here.

straight-shoota · 2022-06-22T11:44:32Z

Unrolling looks good to me. It may have a negative impact on some cases. But the magnitude of the negative impact is dwarfed by the positive impact (10s of nanoseconds vs 100s of microseconds).
And the typically expected case is a valid encoding. Finding an invalid byte which would allow to abort early is unexpected.

straight-shoota · 2022-06-22T11:51:24Z

Perhaps a way to optimize for the early failing case would be to switch positions of the two loops and process the individual bytes first, before the unrolled loop. For a valid encoding there should be no difference because it needs to iterate through everything anyways. If there's an invalid encoding at the start, it would cause an early fail.

However, I'm also wondering if it would make a significant difference to move the return statement out of the individual byte loop. This would be optimizing for the likely case of a valid encoding.

HertzDevil · 2022-06-22T20:16:31Z

Doing the byte loop before the unrolled loop does not matter, because if the first invalid byte occurs within the first unrolled chunk, then the method now visits bytes.size % UNROLL + UNROLL bytes instead of UNROLL bytes; the average iteration count would ultimately be the same.

Moving out the return in the byte loop did little for valid strings, but now the method must also iterate through all bytes for invalid strings shorter than UNROLL bytes. So I think the benchmark's version is good enough.

What I wonder most is whether unrolling still gives a performance boost on AArch64.

straight-shoota · 2022-06-22T22:36:52Z

the average iteration count would ultimately be the same.

I'm not sure about that when accounting for the conditional probability that when there's an invalid byte somewhere in the slice (including in some of the last bytes), there's a higher probability of having more invalid bytes somewhere else (including in some of the first bytes). Purely theoretical thinking though and I don't think it's worth it digging deeper into this to maybe squeeze out a little bit more for use cases.

So I think the benchmark's version is good enough.

Definitely! 👍 ☺️ Thank you very much for that.

What I wonder most is whether unrolling still gives a performance boost on AArch64.

I can try that.

straight-shoota · 2022-07-27T15:11:17Z

Sorry for the delay. I finally got around running the benchmark on a Raspberry Pi 4 Model B Rev 1.4.

$ uname -a
Linux 2977d8bfd403 5.15.32-v8+ #1538 SMP PREEMPT Thu Mar 31 19:40:39 BST 2022 aarch64 Linux
$ cat /etc/os-release | grep PRETTY_NAME
PRETTY_NAME="Alpine Linux v3.16"

         old 387.73  (  2.58ms) (± 0.87%)  0.0B/op   4.58× slower
 old inlined 713.18  (  1.40ms) (± 0.21%)  0.0B/op   2.49× slower
         new   1.18k (849.85µs) (± 0.55%)  0.0B/op   1.51× slower
new unrolled   1.78k (562.88µs) (± 0.53%)  0.0B/op        fastest

--mcpu=native doesn't seem to have any effect on any of the implementations:

         old 398.40  (  2.51ms) (± 0.19%)  0.0B/op   4.49× slower
 old inlined 688.86  (  1.45ms) (± 0.47%)  0.0B/op   2.60× slower
         new   1.18k (848.07µs) (± 0.27%)  0.0B/op   1.52× slower
new unrolled   1.79k (559.40µs) (± 0.09%)  0.0B/op        fastest

straight-shoota · 2022-08-01T23:15:20Z

@HertzDevil The implementation still needs to be integrated into valid_encoding?, right?

…-encoding

HertzDevil · 2022-08-13T22:14:35Z

Crystal 1.0.0 cannot handle UInt64s inside the dfa_state macro until #10713.

🤔

Optimize String#valid_encoding?

e25c7fc

HertzDevil added performance topic:stdlib:text labels Jun 21, 2022

HertzDevil added 2 commits August 2, 2022 21:18

unroll

efc6e61

actually use Unicode.valid?

d742851

straight-shoota approved these changes Aug 2, 2022

View reviewed changes

HertzDevil added 2 commits August 13, 2022 10:30

Merge remote-tracking branch 'upstream/master' into perf/string-valid…

a5a2577

…-encoding

Merge remote-tracking branch 'upstream/master' into perf/string-valid…

9cf989e

…-encoding

HertzDevil added 2 commits August 14, 2022 06:18

1.0.0 workaround

df6f818

slightly shorter

b9262bb

straight-shoota approved these changes Aug 14, 2022

View reviewed changes

straight-shoota added this to the 1.6.0 milestone Aug 16, 2022

straight-shoota merged commit d48b9f3 into crystal-lang:master Aug 18, 2022

HertzDevil deleted the perf/string-valid-encoding branch August 18, 2022 12:51

HertzDevil mentioned this pull request May 21, 2024

Support SIMD #3057

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize `String#valid_encoding?` #12145

Optimize `String#valid_encoding?` #12145

HertzDevil commented Jun 21, 2022

straight-shoota commented Jun 22, 2022

straight-shoota commented Jun 22, 2022

HertzDevil commented Jun 22, 2022

straight-shoota commented Jun 22, 2022

straight-shoota commented Jul 27, 2022 •

edited

Loading

straight-shoota commented Aug 1, 2022

HertzDevil commented Aug 13, 2022

Optimize String#valid_encoding? #12145

Optimize String#valid_encoding? #12145

Conversation

HertzDevil commented Jun 21, 2022

straight-shoota commented Jun 22, 2022

straight-shoota commented Jun 22, 2022

HertzDevil commented Jun 22, 2022

straight-shoota commented Jun 22, 2022

straight-shoota commented Jul 27, 2022 • edited Loading

straight-shoota commented Aug 1, 2022

HertzDevil commented Aug 13, 2022

Optimize `String#valid_encoding?` #12145

Optimize `String#valid_encoding?` #12145

straight-shoota commented Jul 27, 2022 •

edited

Loading