Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize UTF-8 decoding #11873

Open
straight-shoota opened this issue Mar 5, 2022 · 10 comments
Open

Optimize UTF-8 decoding #11873

straight-shoota opened this issue Mar 5, 2022 · 10 comments

Comments

@straight-shoota
Copy link
Member

Several places in stdlib deal with UTF-8 decoding:

  • String.char_bytesize_at
  • Char::Reader#decode_char_at
  • IO#read_char_with_bytesize

They are three different implementations of pretty much the same algorithm. The differences are attributed to their respective domain and are probably insignificant for the UTF-8 decoding algorithm itself.

The algorithm is not very performant, though. It is my understanding there are far more efficient algorithms for UTF-8 decoding.
One example of such is this: http://bjoern.hoehrmann.de/utf-8/decoder/dfa I did not conduct a search for potential other options, yet. We can consider stand-alone implementations like the one I mentioned, as well as look at what other text libraries do (including stdlib implementations of other languages).

Considering that UTF-8 decoding plays a major part in string operations, this would certainly have a positive effect on many Crystal programs.

@asterite
Copy link
Member

asterite commented Mar 5, 2022

I think the case of IO is slightly different because unless we have at least 4 bytes of buffered data the algorithm has to go byte per byte. But if we do have 4 bytes in the peek buffer something more optimal than what we have right now could be done.

Then, yes, I think we should have a place with an algorithm that takes a slice of bytes and returns the UTF-8 codepoint and how many bytes it consumed, or an error otherwise.

@asterite
Copy link
Member

asterite commented Mar 5, 2022

Also, to use that efficient algorithm you linked, we have to have a way to put that table into ROM, but there's no way we can do that right now (we need const [...])

@straight-shoota
Copy link
Member Author

unless we have at least 4 bytes of buffered data the algorithm has to go byte per byte.

The example algorithm I reference consumes a single byte and the status indicates whether there was an error, a sequence was complete or more bytes need to be consumed.
This is necessary when reading bytes from a stream. But also for bounds checking when a string (with potentially invalid UTF-8 encoding) is not null terminated.

@yxhuvud
Copy link
Contributor

yxhuvud commented Mar 5, 2022

Couldn't the lookup tables be put into a class variable on String or something? Or does that have too much overhead?

It is my understanding there are far more efficient algorithms for UTF-8 decoding.

I looked into this a while ago and the strategy that is currently used (at least for string utf8 validation) is actually supposed to be faster than the strategy that is based on a single lookup table, which seems like the one that is linked in the description. Or at least possible to make faster. If I remember correct the reason was that the cpu can parallelize the execution of the instructions better even if it does execute a few more instructions. But it was a while ago I read about it so I may misremember.

However, there are some modern approaches which are supposed to improve things a bit, by using two tables instead of one. Recommended references:
https://lemire.me/blog/2020/10/20/ridiculously-fast-unicode-utf-8-validation/
https://twitter.com/pervognsen/status/1410488394535243778 (which then points to https://gist.github.com/pervognsen/218ea17743e1442e59bb60d29b1aa725 for actual code)

@yxhuvud
Copy link
Contributor

yxhuvud commented Mar 6, 2022

Hmm. I tried profiling the following program:

long_string = "a" * (2**30)
10.times { p long_string.valid_encoding? }

This takes 24 seconds.
With the following monkeypatch, this reduces to 5.4s: (compiled with -release --mcpu znver2 )

struct Char::Reader
  @[AlwaysInline]
  def next_char : Char
    @pos &+= @current_char_width
    if @pos > @string.bytesize
      raise IndexError.new
    end

    decode_current_char
  end
end

The bulk of the change is due to the @[AlwaysInline] line, but the &+= also gives a second or two. The --mcpu flag is also worth seconds - I suppose it enables some vector instructions.

That is a quite ridiculous difference. Unfortunately I have no idea what actually happens code generation wise, as the @[AlwaysInline] seems to make the dwarf row number information simply disappear somehow, in a way I haven't seen before with regular inlining initiated by the compiler. don't know if it is due to the annotation or due to the optimizations that happens.

@HertzDevil
Copy link
Contributor

HertzDevil commented May 1, 2022

https://twitter.com/pervognsen/status/1410488394535243778 (which then points to https://gist.github.com/pervognsen/218ea17743e1442e59bb60d29b1aa725 for actual code)

Here is a rather straightforward Crystal port of this new String#valid_encoding?, without the loop unrolling:

class String
  VALID_ENCODING_TABLE = begin
    x = Array(UInt64).new(256)

    # the same DFA transition table, with state 1 hidden:
    #
    #              accepted
    #              | 1 continuation byte left
    #              | | 2 continuation bytes left
    #              | | | E0-?? ??; disallow overlong encodings up to U+07FF
    #              | | | | ED-?? ??; disallow surrogate pairs
    #              | | | | | F0-?? ?? ??; disallow overlong encodings up to U+FFFF
    #              | | | | | | 3 continuation bytes left
    #              | | | | | | | F4-?? ?? ??; disallow codepoints above U+10FFFF
    #              v v v v v v v v
    #
    #            | 0 2 3 4 5 6 7 8
    # -----------+----------------
    # 0x00..0x7F | 0 _ _ _ _ _ _ _
    # 0x80..0x8F | _ 0 2 _ 2 _ 3 3
    # 0x90..0x9F | _ 0 2 _ 2 3 3 _
    # 0xA0..0xBF | _ 0 2 2 _ 3 3 _
    # 0xC2..0xDF | 2 _ _ _ _ _ _ _
    # 0xE0..0xE0 | 4 _ _ _ _ _ _ _
    # 0xE1..0xEC | 3 _ _ _ _ _ _ _
    # 0xED..0xED | 5 _ _ _ _ _ _ _
    # 0xEE..0xEF | 3 _ _ _ _ _ _ _
    # 0xF0..0xF0 | 6 _ _ _ _ _ _ _
    # 0xF1..0xF3 | 7 _ _ _ _ _ _ _
    # 0xF4..0xF4 | 8 _ _ _ _ _ _ _

    {% for ch in 0x00..0x7F %} put(x, dfa_state(0, 1, 1, 1, 1, 1, 1, 1, 1)); {% end %}
    {% for ch in 0x80..0x8F %} put(x, dfa_state(1, 1, 0, 2, 1, 2, 1, 3, 3)); {% end %}
    {% for ch in 0x90..0x9F %} put(x, dfa_state(1, 1, 0, 2, 1, 2, 3, 3, 1)); {% end %}
    {% for ch in 0xA0..0xBF %} put(x, dfa_state(1, 1, 0, 2, 2, 1, 3, 3, 1)); {% end %}
    {% for ch in 0xC0..0xC1 %} put(x, dfa_state(1, 1, 1, 1, 1, 1, 1, 1, 1)); {% end %}
    {% for ch in 0xC2..0xDF %} put(x, dfa_state(2, 1, 1, 1, 1, 1, 1, 1, 1)); {% end %}
    {% for ch in 0xE0..0xE0 %} put(x, dfa_state(4, 1, 1, 1, 1, 1, 1, 1, 1)); {% end %}
    {% for ch in 0xE1..0xEC %} put(x, dfa_state(3, 1, 1, 1, 1, 1, 1, 1, 1)); {% end %}
    {% for ch in 0xED..0xED %} put(x, dfa_state(5, 1, 1, 1, 1, 1, 1, 1, 1)); {% end %}
    {% for ch in 0xEE..0xEF %} put(x, dfa_state(3, 1, 1, 1, 1, 1, 1, 1, 1)); {% end %}
    {% for ch in 0xF0..0xF0 %} put(x, dfa_state(6, 1, 1, 1, 1, 1, 1, 1, 1)); {% end %}
    {% for ch in 0xF1..0xF3 %} put(x, dfa_state(7, 1, 1, 1, 1, 1, 1, 1, 1)); {% end %}
    {% for ch in 0xF4..0xF4 %} put(x, dfa_state(8, 1, 1, 1, 1, 1, 1, 1, 1)); {% end %}
    {% for ch in 0xF5..0xFF %} put(x, dfa_state(1, 1, 1, 1, 1, 1, 1, 1, 1)); {% end %}

    x
  end

  private def self.put(array, x)
    array << x
  end

  private macro dfa_state(*transitions)
    {% x = 0_u64 %}
    {% for tr, i in transitions %}
      {% x |= (1_u64 << (i * 6)) * tr * 6 %}
    {% end %}
    {{ x }}
  end

  def valid_encoding2? : Bool
    state = 0_u64
    table = VALID_ENCODING_TABLE.to_unsafe
    s = to_unsafe
    e = s + bytesize

    while s < e
      state = table[s.value].unsafe_shr(state & 0x3F)
      return false if state & 0x3F == 6
      s += 1
    end

    state & 0x3F == 0
  end
end

Here is a global fun that does the same thing:

@[NoInline]
fun foo_valid_encoding(str : Void*) : Bool
  str = str.as(String*)
  state = 0_u64
  table = String::VALID_ENCODING_TABLE.to_unsafe
  s = str.value.to_unsafe
  e = s + str.value.bytesize

  while s < e
    state = table[s.value].unsafe_shr(state & 0x3F)
    return false if state & 0x3F == 6
    s += 1
  end

  state & 0x3F == 0
end

Its generated assembly under --mcpu=native is:

foo_valid_encoding:
	movq	(%rdi), %rax
	movslq	4(%rax), %rdx
	testq	%rdx, %rdx
	jle	.LBB299_1
	movq	"String::VALID_ENCODING_TABLE"(%rip), %rcx
	movq	16(%rcx), %rcx
	addq	%rax, %rdx
	addq	$12, %rdx
	addq	$12, %rax
	xorl	%esi, %esi
.LBB299_4:
	movzbl	(%rax), %edi
	shrxq	%rsi, (%rcx,%rdi,8), %rsi
	andl	$63, %esi
	cmpq	$6, %rsi
	je	.LBB299_5
	incq	%rax
	cmpq	%rdx, %rax
	jb	.LBB299_4
	testq	%rsi, %rsi
	sete	%al
	retq
.LBB299_1:
	movb	$1, %al
	retq
.LBB299_5:
	xorl	%eax, %eax
	retq

It appears to be really fast, since it reproduces the shrxq %rsi, (%rcx,%rdi,8), %rsi line. It should not matter if the DFA was placed in read-only memory or not. In reality some kind of recovering mechanism is also needed to perform correct backtracking in case of errors.

The unrolling only makes sense in the context of #valid_encoding?. If we do anything else within the loop its performance improvements will most likely diminish.

I don't think the SIMD example is of much relevance because it is still restricted to #valid_encoding? and the compiler offers no SIMD support at the moment.

@HertzDevil
Copy link
Contributor

And here is an optimized version of #each_char(&), borrowing ideas from both DFA examples:

class String
  VALID_ENCODING_TABLE = begin
    x = Array(UInt64).new(256)

    # each transition is also associated with a mask that indicates which bits
    # of the current byte are used to form an UTF-8 codepoint; this is possible
    # because every legal UTF-8 byte uniquely determines whether itself is
    # leading or trailing, and in the former case, the number of trailing bytes
    # thus only bits 54 and 55 remain unused in the DFA representation

    {% for ch in 0x00..0x7F %} put(x, dfa_state(0, 1, 1, 1, 1, 1, 1, 1, 1, mask: 0x7F)); {% end %}
    {% for ch in 0x80..0x8F %} put(x, dfa_state(1, 1, 0, 2, 1, 2, 1, 3, 3, mask: 0x3F)); {% end %}
    {% for ch in 0x90..0x9F %} put(x, dfa_state(1, 1, 0, 2, 1, 2, 3, 3, 1, mask: 0x3F)); {% end %}
    {% for ch in 0xA0..0xBF %} put(x, dfa_state(1, 1, 0, 2, 2, 1, 3, 3, 1, mask: 0x3F)); {% end %}
    {% for ch in 0xC0..0xC1 %} put(x, dfa_state(1, 1, 1, 1, 1, 1, 1, 1, 1, mask: 0x00)); {% end %}
    {% for ch in 0xC2..0xDF %} put(x, dfa_state(2, 1, 1, 1, 1, 1, 1, 1, 1, mask: 0x3F)); {% end %}
    {% for ch in 0xE0..0xE0 %} put(x, dfa_state(4, 1, 1, 1, 1, 1, 1, 1, 1, mask: 0x1F)); {% end %}
    {% for ch in 0xE1..0xEC %} put(x, dfa_state(3, 1, 1, 1, 1, 1, 1, 1, 1, mask: 0x1F)); {% end %}
    {% for ch in 0xED..0xED %} put(x, dfa_state(5, 1, 1, 1, 1, 1, 1, 1, 1, mask: 0x1F)); {% end %}
    {% for ch in 0xEE..0xEF %} put(x, dfa_state(3, 1, 1, 1, 1, 1, 1, 1, 1, mask: 0x1F)); {% end %}
    {% for ch in 0xF0..0xF0 %} put(x, dfa_state(6, 1, 1, 1, 1, 1, 1, 1, 1, mask: 0x0F)); {% end %}
    {% for ch in 0xF1..0xF3 %} put(x, dfa_state(7, 1, 1, 1, 1, 1, 1, 1, 1, mask: 0x0F)); {% end %}
    {% for ch in 0xF4..0xF4 %} put(x, dfa_state(8, 1, 1, 1, 1, 1, 1, 1, 1, mask: 0x0F)); {% end %}
    {% for ch in 0xF5..0xFF %} put(x, dfa_state(1, 1, 1, 1, 1, 1, 1, 1, 1, mask: 0x00)); {% end %}

    x
  end

  def each_char2(& : Char ->) : Nil
    state = 0_u64
    table = VALID_ENCODING_TABLE.to_unsafe
    ch = 0_u32
    s = to_unsafe
    e = s + bytesize
    prev = s

    while prev < e
      ch = ch.unsafe_shl(6) | (s.value & table[s.value].unsafe_shr(56))
      state = table[s.value].unsafe_shr(state & 0x3F)
      case state & 0x3F
      when 0
        s += 1
        yield ch.unsafe_chr
        prev = s
        ch = 0_u32
      when 6
        yield Char::REPLACEMENT
        state = 0_u64
        prev += 1
        ch = 0_u32
        s = prev
      else
        s += 1
      end
    end
  end
end

From my own tests it is only about twice as fast as the existing implementation. The new bottleneck might be the computation of ch.

@yxhuvud
Copy link
Contributor

yxhuvud commented May 2, 2022

How are you benchmarking? In my silly benchmarks I get a 100 times worse performance with each_char2 compared to the original.

valid_encoding? perform really well though. Roughly as fast as the @AlwaysInline version I posted above, but a lot more stable in performance.

@kostya
Copy link
Contributor

kostya commented May 2, 2022

interesting to see comparison with https://github.com/simdutf/simdutf

@kostya
Copy link
Contributor

kostya commented May 3, 2022

i run on files from simdutf benchmark. valid_encoding2? average 4.78 times faster than valid_encoding?, simdutf::validate_utf8 average 28.38 times faster than valid_encoding?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants