Optimize UTF-8 decoding #11873

straight-shoota · 2022-03-05T12:16:21Z

Several places in stdlib deal with UTF-8 decoding:

String.char_bytesize_at
Char::Reader#decode_char_at
IO#read_char_with_bytesize

They are three different implementations of pretty much the same algorithm. The differences are attributed to their respective domain and are probably insignificant for the UTF-8 decoding algorithm itself.

The algorithm is not very performant, though. It is my understanding there are far more efficient algorithms for UTF-8 decoding.
One example of such is this: http://bjoern.hoehrmann.de/utf-8/decoder/dfa I did not conduct a search for potential other options, yet. We can consider stand-alone implementations like the one I mentioned, as well as look at what other text libraries do (including stdlib implementations of other languages).

Considering that UTF-8 decoding plays a major part in string operations, this would certainly have a positive effect on many Crystal programs.

The text was updated successfully, but these errors were encountered:

asterite · 2022-03-05T12:41:43Z

I think the case of IO is slightly different because unless we have at least 4 bytes of buffered data the algorithm has to go byte per byte. But if we do have 4 bytes in the peek buffer something more optimal than what we have right now could be done.

Then, yes, I think we should have a place with an algorithm that takes a slice of bytes and returns the UTF-8 codepoint and how many bytes it consumed, or an error otherwise.

asterite · 2022-03-05T12:43:50Z

Also, to use that efficient algorithm you linked, we have to have a way to put that table into ROM, but there's no way we can do that right now (we need const [...])

straight-shoota · 2022-03-05T14:36:43Z

unless we have at least 4 bytes of buffered data the algorithm has to go byte per byte.

The example algorithm I reference consumes a single byte and the status indicates whether there was an error, a sequence was complete or more bytes need to be consumed.
This is necessary when reading bytes from a stream. But also for bounds checking when a string (with potentially invalid UTF-8 encoding) is not null terminated.

yxhuvud · 2022-03-05T21:50:23Z

Couldn't the lookup tables be put into a class variable on String or something? Or does that have too much overhead?

It is my understanding there are far more efficient algorithms for UTF-8 decoding.

I looked into this a while ago and the strategy that is currently used (at least for string utf8 validation) is actually supposed to be faster than the strategy that is based on a single lookup table, which seems like the one that is linked in the description. Or at least possible to make faster. If I remember correct the reason was that the cpu can parallelize the execution of the instructions better even if it does execute a few more instructions. But it was a while ago I read about it so I may misremember.

However, there are some modern approaches which are supposed to improve things a bit, by using two tables instead of one. Recommended references:
https://lemire.me/blog/2020/10/20/ridiculously-fast-unicode-utf-8-validation/
https://twitter.com/pervognsen/status/1410488394535243778 (which then points to https://gist.github.com/pervognsen/218ea17743e1442e59bb60d29b1aa725 for actual code)

yxhuvud · 2022-03-06T12:48:31Z

Hmm. I tried profiling the following program:

long_string = "a" * (2**30)
10.times { p long_string.valid_encoding? }

This takes 24 seconds.
With the following monkeypatch, this reduces to 5.4s: (compiled with -release --mcpu znver2 )

struct Char::Reader
  @[AlwaysInline]
  def next_char : Char
    @pos &+= @current_char_width
    if @pos > @string.bytesize
      raise IndexError.new
    end

    decode_current_char
  end
end

The bulk of the change is due to the @[AlwaysInline] line, but the &+= also gives a second or two. The --mcpu flag is also worth seconds - I suppose it enables some vector instructions.

That is a quite ridiculous difference. Unfortunately I have no idea what actually happens code generation wise, as the @[AlwaysInline] seems to make the dwarf row number information simply disappear somehow, in a way I haven't seen before with regular inlining initiated by the compiler. don't know if it is due to the annotation or due to the optimizations that happens.

HertzDevil · 2022-05-01T12:18:24Z

https://twitter.com/pervognsen/status/1410488394535243778 (which then points to https://gist.github.com/pervognsen/218ea17743e1442e59bb60d29b1aa725 for actual code)

Here is a rather straightforward Crystal port of this new String#valid_encoding?, without the loop unrolling:

class String
  VALID_ENCODING_TABLE = begin
    x = Array(UInt64).new(256)

    # the same DFA transition table, with state 1 hidden:
    #
    #              accepted
    #              | 1 continuation byte left
    #              | | 2 continuation bytes left
    #              | | | E0-?? ??; disallow overlong encodings up to U+07FF
    #              | | | | ED-?? ??; disallow surrogate pairs
    #              | | | | | F0-?? ?? ??; disallow overlong encodings up to U+FFFF
    #              | | | | | | 3 continuation bytes left
    #              | | | | | | | F4-?? ?? ??; disallow codepoints above U+10FFFF
    #              v v v v v v v v
    #
    #            | 0 2 3 4 5 6 7 8
    # -----------+----------------
    # 0x00..0x7F | 0 _ _ _ _ _ _ _
    # 0x80..0x8F | _ 0 2 _ 2 _ 3 3
    # 0x90..0x9F | _ 0 2 _ 2 3 3 _
    # 0xA0..0xBF | _ 0 2 2 _ 3 3 _
    # 0xC2..0xDF | 2 _ _ _ _ _ _ _
    # 0xE0..0xE0 | 4 _ _ _ _ _ _ _
    # 0xE1..0xEC | 3 _ _ _ _ _ _ _
    # 0xED..0xED | 5 _ _ _ _ _ _ _
    # 0xEE..0xEF | 3 _ _ _ _ _ _ _
    # 0xF0..0xF0 | 6 _ _ _ _ _ _ _
    # 0xF1..0xF3 | 7 _ _ _ _ _ _ _
    # 0xF4..0xF4 | 8 _ _ _ _ _ _ _

    {% for ch in 0x00..0x7F %} put(x, dfa_state(0, 1, 1, 1, 1, 1, 1, 1, 1)); {% end %}
    {% for ch in 0x80..0x8F %} put(x, dfa_state(1, 1, 0, 2, 1, 2, 1, 3, 3)); {% end %}
    {% for ch in 0x90..0x9F %} put(x, dfa_state(1, 1, 0, 2, 1, 2, 3, 3, 1)); {% end %}
    {% for ch in 0xA0..0xBF %} put(x, dfa_state(1, 1, 0, 2, 2, 1, 3, 3, 1)); {% end %}
    {% for ch in 0xC0..0xC1 %} put(x, dfa_state(1, 1, 1, 1, 1, 1, 1, 1, 1)); {% end %}
    {% for ch in 0xC2..0xDF %} put(x, dfa_state(2, 1, 1, 1, 1, 1, 1, 1, 1)); {% end %}
    {% for ch in 0xE0..0xE0 %} put(x, dfa_state(4, 1, 1, 1, 1, 1, 1, 1, 1)); {% end %}
    {% for ch in 0xE1..0xEC %} put(x, dfa_state(3, 1, 1, 1, 1, 1, 1, 1, 1)); {% end %}
    {% for ch in 0xED..0xED %} put(x, dfa_state(5, 1, 1, 1, 1, 1, 1, 1, 1)); {% end %}
    {% for ch in 0xEE..0xEF %} put(x, dfa_state(3, 1, 1, 1, 1, 1, 1, 1, 1)); {% end %}
    {% for ch in 0xF0..0xF0 %} put(x, dfa_state(6, 1, 1, 1, 1, 1, 1, 1, 1)); {% end %}
    {% for ch in 0xF1..0xF3 %} put(x, dfa_state(7, 1, 1, 1, 1, 1, 1, 1, 1)); {% end %}
    {% for ch in 0xF4..0xF4 %} put(x, dfa_state(8, 1, 1, 1, 1, 1, 1, 1, 1)); {% end %}
    {% for ch in 0xF5..0xFF %} put(x, dfa_state(1, 1, 1, 1, 1, 1, 1, 1, 1)); {% end %}

    x
  end

  private def self.put(array, x)
    array << x
  end

  private macro dfa_state(*transitions)
    {% x = 0_u64 %}
    {% for tr, i in transitions %}
      {% x |= (1_u64 << (i * 6)) * tr * 6 %}
    {% end %}
    {{ x }}
  end

  def valid_encoding2? : Bool
    state = 0_u64
    table = VALID_ENCODING_TABLE.to_unsafe
    s = to_unsafe
    e = s + bytesize

    while s < e
      state = table[s.value].unsafe_shr(state & 0x3F)
      return false if state & 0x3F == 6
      s += 1
    end

    state & 0x3F == 0
  end
end

Here is a global fun that does the same thing:

@[NoInline]
fun foo_valid_encoding(str : Void*) : Bool
  str = str.as(String*)
  state = 0_u64
  table = String::VALID_ENCODING_TABLE.to_unsafe
  s = str.value.to_unsafe
  e = s + str.value.bytesize

  while s < e
    state = table[s.value].unsafe_shr(state & 0x3F)
    return false if state & 0x3F == 6
    s += 1
  end

  state & 0x3F == 0
end

Its generated assembly under --mcpu=native is:

foo_valid_encoding:
	movq	(%rdi), %rax
	movslq	4(%rax), %rdx
	testq	%rdx, %rdx
	jle	.LBB299_1
	movq	"String::VALID_ENCODING_TABLE"(%rip), %rcx
	movq	16(%rcx), %rcx
	addq	%rax, %rdx
	addq	$12, %rdx
	addq	$12, %rax
	xorl	%esi, %esi
.LBB299_4:
	movzbl	(%rax), %edi
	shrxq	%rsi, (%rcx,%rdi,8), %rsi
	andl	$63, %esi
	cmpq	$6, %rsi
	je	.LBB299_5
	incq	%rax
	cmpq	%rdx, %rax
	jb	.LBB299_4
	testq	%rsi, %rsi
	sete	%al
	retq
.LBB299_1:
	movb	$1, %al
	retq
.LBB299_5:
	xorl	%eax, %eax
	retq

It appears to be really fast, since it reproduces the shrxq %rsi, (%rcx,%rdi,8), %rsi line. It should not matter if the DFA was placed in read-only memory or not. In reality some kind of recovering mechanism is also needed to perform correct backtracking in case of errors.

The unrolling only makes sense in the context of #valid_encoding?. If we do anything else within the loop its performance improvements will most likely diminish.

I don't think the SIMD example is of much relevance because it is still restricted to #valid_encoding? and the compiler offers no SIMD support at the moment.

HertzDevil · 2022-05-01T15:08:44Z

And here is an optimized version of #each_char(&), borrowing ideas from both DFA examples:

class String
  VALID_ENCODING_TABLE = begin
    x = Array(UInt64).new(256)

    # each transition is also associated with a mask that indicates which bits
    # of the current byte are used to form an UTF-8 codepoint; this is possible
    # because every legal UTF-8 byte uniquely determines whether itself is
    # leading or trailing, and in the former case, the number of trailing bytes
    # thus only bits 54 and 55 remain unused in the DFA representation

    {% for ch in 0x00..0x7F %} put(x, dfa_state(0, 1, 1, 1, 1, 1, 1, 1, 1, mask: 0x7F)); {% end %}
    {% for ch in 0x80..0x8F %} put(x, dfa_state(1, 1, 0, 2, 1, 2, 1, 3, 3, mask: 0x3F)); {% end %}
    {% for ch in 0x90..0x9F %} put(x, dfa_state(1, 1, 0, 2, 1, 2, 3, 3, 1, mask: 0x3F)); {% end %}
    {% for ch in 0xA0..0xBF %} put(x, dfa_state(1, 1, 0, 2, 2, 1, 3, 3, 1, mask: 0x3F)); {% end %}
    {% for ch in 0xC0..0xC1 %} put(x, dfa_state(1, 1, 1, 1, 1, 1, 1, 1, 1, mask: 0x00)); {% end %}
    {% for ch in 0xC2..0xDF %} put(x, dfa_state(2, 1, 1, 1, 1, 1, 1, 1, 1, mask: 0x3F)); {% end %}
    {% for ch in 0xE0..0xE0 %} put(x, dfa_state(4, 1, 1, 1, 1, 1, 1, 1, 1, mask: 0x1F)); {% end %}
    {% for ch in 0xE1..0xEC %} put(x, dfa_state(3, 1, 1, 1, 1, 1, 1, 1, 1, mask: 0x1F)); {% end %}
    {% for ch in 0xED..0xED %} put(x, dfa_state(5, 1, 1, 1, 1, 1, 1, 1, 1, mask: 0x1F)); {% end %}
    {% for ch in 0xEE..0xEF %} put(x, dfa_state(3, 1, 1, 1, 1, 1, 1, 1, 1, mask: 0x1F)); {% end %}
    {% for ch in 0xF0..0xF0 %} put(x, dfa_state(6, 1, 1, 1, 1, 1, 1, 1, 1, mask: 0x0F)); {% end %}
    {% for ch in 0xF1..0xF3 %} put(x, dfa_state(7, 1, 1, 1, 1, 1, 1, 1, 1, mask: 0x0F)); {% end %}
    {% for ch in 0xF4..0xF4 %} put(x, dfa_state(8, 1, 1, 1, 1, 1, 1, 1, 1, mask: 0x0F)); {% end %}
    {% for ch in 0xF5..0xFF %} put(x, dfa_state(1, 1, 1, 1, 1, 1, 1, 1, 1, mask: 0x00)); {% end %}

    x
  end

  def each_char2(& : Char ->) : Nil
    state = 0_u64
    table = VALID_ENCODING_TABLE.to_unsafe
    ch = 0_u32
    s = to_unsafe
    e = s + bytesize
    prev = s

    while prev < e
      ch = ch.unsafe_shl(6) | (s.value & table[s.value].unsafe_shr(56))
      state = table[s.value].unsafe_shr(state & 0x3F)
      case state & 0x3F
      when 0
        s += 1
        yield ch.unsafe_chr
        prev = s
        ch = 0_u32
      when 6
        yield Char::REPLACEMENT
        state = 0_u64
        prev += 1
        ch = 0_u32
        s = prev
      else
        s += 1
      end
    end
  end
end

From my own tests it is only about twice as fast as the existing implementation. The new bottleneck might be the computation of ch.

yxhuvud · 2022-05-02T17:01:08Z

How are you benchmarking? In my silly benchmarks I get a 100 times worse performance with each_char2 compared to the original.

valid_encoding? perform really well though. Roughly as fast as the @AlwaysInline version I posted above, but a lot more stable in performance.

kostya · 2022-05-02T17:52:26Z

interesting to see comparison with https://github.com/simdutf/simdutf

kostya · 2022-05-03T11:27:36Z

i run on files from simdutf benchmark. valid_encoding2? average 4.78 times faster than valid_encoding?, simdutf::validate_utf8 average 28.38 times faster than valid_encoding?

straight-shoota added performance topic:stdlib:text labels Mar 5, 2022

straight-shoota mentioned this issue Mar 5, 2022

Refactor String#char_bytesize_at #5501

Closed

This was referenced Jun 15, 2022

Literals for Slice(UInt8) #2886

Open

Optimize String#valid_encoding? #12145

Merged

HertzDevil mentioned this issue May 21, 2024

Support SIMD #3057

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize UTF-8 decoding #11873

Optimize UTF-8 decoding #11873

straight-shoota commented Mar 5, 2022

asterite commented Mar 5, 2022

asterite commented Mar 5, 2022

straight-shoota commented Mar 5, 2022

yxhuvud commented Mar 5, 2022 •

edited

Loading

yxhuvud commented Mar 6, 2022 •

edited

Loading

HertzDevil commented May 1, 2022 •

edited

Loading

HertzDevil commented May 1, 2022

yxhuvud commented May 2, 2022

kostya commented May 2, 2022

kostya commented May 3, 2022

Optimize UTF-8 decoding #11873

Optimize UTF-8 decoding #11873

Comments

straight-shoota commented Mar 5, 2022

asterite commented Mar 5, 2022

asterite commented Mar 5, 2022

straight-shoota commented Mar 5, 2022

yxhuvud commented Mar 5, 2022 • edited Loading

yxhuvud commented Mar 6, 2022 • edited Loading

HertzDevil commented May 1, 2022 • edited Loading

HertzDevil commented May 1, 2022

yxhuvud commented May 2, 2022

kostya commented May 2, 2022

kostya commented May 3, 2022

yxhuvud commented Mar 5, 2022 •

edited

Loading

yxhuvud commented Mar 6, 2022 •

edited

Loading

HertzDevil commented May 1, 2022 •

edited

Loading