-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize UTF-8 decoding #11873
Comments
I think the case of Then, yes, I think we should have a place with an algorithm that takes a slice of bytes and returns the UTF-8 codepoint and how many bytes it consumed, or an error otherwise. |
Also, to use that efficient algorithm you linked, we have to have a way to put that table into ROM, but there's no way we can do that right now (we need |
The example algorithm I reference consumes a single byte and the status indicates whether there was an error, a sequence was complete or more bytes need to be consumed. |
Couldn't the lookup tables be put into a class variable on String or something? Or does that have too much overhead?
I looked into this a while ago and the strategy that is currently used (at least for string utf8 validation) is actually supposed to be faster than the strategy that is based on a single lookup table, which seems like the one that is linked in the description. Or at least possible to make faster. If I remember correct the reason was that the cpu can parallelize the execution of the instructions better even if it does execute a few more instructions. But it was a while ago I read about it so I may misremember. However, there are some modern approaches which are supposed to improve things a bit, by using two tables instead of one. Recommended references: |
Hmm. I tried profiling the following program:
This takes 24 seconds.
The bulk of the change is due to the That is a quite ridiculous difference. Unfortunately I have no idea what actually happens code generation wise, as the |
Here is a rather straightforward Crystal port of this new class String
VALID_ENCODING_TABLE = begin
x = Array(UInt64).new(256)
# the same DFA transition table, with state 1 hidden:
#
# accepted
# | 1 continuation byte left
# | | 2 continuation bytes left
# | | | E0-?? ??; disallow overlong encodings up to U+07FF
# | | | | ED-?? ??; disallow surrogate pairs
# | | | | | F0-?? ?? ??; disallow overlong encodings up to U+FFFF
# | | | | | | 3 continuation bytes left
# | | | | | | | F4-?? ?? ??; disallow codepoints above U+10FFFF
# v v v v v v v v
#
# | 0 2 3 4 5 6 7 8
# -----------+----------------
# 0x00..0x7F | 0 _ _ _ _ _ _ _
# 0x80..0x8F | _ 0 2 _ 2 _ 3 3
# 0x90..0x9F | _ 0 2 _ 2 3 3 _
# 0xA0..0xBF | _ 0 2 2 _ 3 3 _
# 0xC2..0xDF | 2 _ _ _ _ _ _ _
# 0xE0..0xE0 | 4 _ _ _ _ _ _ _
# 0xE1..0xEC | 3 _ _ _ _ _ _ _
# 0xED..0xED | 5 _ _ _ _ _ _ _
# 0xEE..0xEF | 3 _ _ _ _ _ _ _
# 0xF0..0xF0 | 6 _ _ _ _ _ _ _
# 0xF1..0xF3 | 7 _ _ _ _ _ _ _
# 0xF4..0xF4 | 8 _ _ _ _ _ _ _
{% for ch in 0x00..0x7F %} put(x, dfa_state(0, 1, 1, 1, 1, 1, 1, 1, 1)); {% end %}
{% for ch in 0x80..0x8F %} put(x, dfa_state(1, 1, 0, 2, 1, 2, 1, 3, 3)); {% end %}
{% for ch in 0x90..0x9F %} put(x, dfa_state(1, 1, 0, 2, 1, 2, 3, 3, 1)); {% end %}
{% for ch in 0xA0..0xBF %} put(x, dfa_state(1, 1, 0, 2, 2, 1, 3, 3, 1)); {% end %}
{% for ch in 0xC0..0xC1 %} put(x, dfa_state(1, 1, 1, 1, 1, 1, 1, 1, 1)); {% end %}
{% for ch in 0xC2..0xDF %} put(x, dfa_state(2, 1, 1, 1, 1, 1, 1, 1, 1)); {% end %}
{% for ch in 0xE0..0xE0 %} put(x, dfa_state(4, 1, 1, 1, 1, 1, 1, 1, 1)); {% end %}
{% for ch in 0xE1..0xEC %} put(x, dfa_state(3, 1, 1, 1, 1, 1, 1, 1, 1)); {% end %}
{% for ch in 0xED..0xED %} put(x, dfa_state(5, 1, 1, 1, 1, 1, 1, 1, 1)); {% end %}
{% for ch in 0xEE..0xEF %} put(x, dfa_state(3, 1, 1, 1, 1, 1, 1, 1, 1)); {% end %}
{% for ch in 0xF0..0xF0 %} put(x, dfa_state(6, 1, 1, 1, 1, 1, 1, 1, 1)); {% end %}
{% for ch in 0xF1..0xF3 %} put(x, dfa_state(7, 1, 1, 1, 1, 1, 1, 1, 1)); {% end %}
{% for ch in 0xF4..0xF4 %} put(x, dfa_state(8, 1, 1, 1, 1, 1, 1, 1, 1)); {% end %}
{% for ch in 0xF5..0xFF %} put(x, dfa_state(1, 1, 1, 1, 1, 1, 1, 1, 1)); {% end %}
x
end
private def self.put(array, x)
array << x
end
private macro dfa_state(*transitions)
{% x = 0_u64 %}
{% for tr, i in transitions %}
{% x |= (1_u64 << (i * 6)) * tr * 6 %}
{% end %}
{{ x }}
end
def valid_encoding2? : Bool
state = 0_u64
table = VALID_ENCODING_TABLE.to_unsafe
s = to_unsafe
e = s + bytesize
while s < e
state = table[s.value].unsafe_shr(state & 0x3F)
return false if state & 0x3F == 6
s += 1
end
state & 0x3F == 0
end
end Here is a global fun that does the same thing: @[NoInline]
fun foo_valid_encoding(str : Void*) : Bool
str = str.as(String*)
state = 0_u64
table = String::VALID_ENCODING_TABLE.to_unsafe
s = str.value.to_unsafe
e = s + str.value.bytesize
while s < e
state = table[s.value].unsafe_shr(state & 0x3F)
return false if state & 0x3F == 6
s += 1
end
state & 0x3F == 0
end Its generated assembly under foo_valid_encoding:
movq (%rdi), %rax
movslq 4(%rax), %rdx
testq %rdx, %rdx
jle .LBB299_1
movq "String::VALID_ENCODING_TABLE"(%rip), %rcx
movq 16(%rcx), %rcx
addq %rax, %rdx
addq $12, %rdx
addq $12, %rax
xorl %esi, %esi
.LBB299_4:
movzbl (%rax), %edi
shrxq %rsi, (%rcx,%rdi,8), %rsi
andl $63, %esi
cmpq $6, %rsi
je .LBB299_5
incq %rax
cmpq %rdx, %rax
jb .LBB299_4
testq %rsi, %rsi
sete %al
retq
.LBB299_1:
movb $1, %al
retq
.LBB299_5:
xorl %eax, %eax
retq It appears to be really fast, since it reproduces the The unrolling only makes sense in the context of I don't think the SIMD example is of much relevance because it is still restricted to |
And here is an optimized version of class String
VALID_ENCODING_TABLE = begin
x = Array(UInt64).new(256)
# each transition is also associated with a mask that indicates which bits
# of the current byte are used to form an UTF-8 codepoint; this is possible
# because every legal UTF-8 byte uniquely determines whether itself is
# leading or trailing, and in the former case, the number of trailing bytes
# thus only bits 54 and 55 remain unused in the DFA representation
{% for ch in 0x00..0x7F %} put(x, dfa_state(0, 1, 1, 1, 1, 1, 1, 1, 1, mask: 0x7F)); {% end %}
{% for ch in 0x80..0x8F %} put(x, dfa_state(1, 1, 0, 2, 1, 2, 1, 3, 3, mask: 0x3F)); {% end %}
{% for ch in 0x90..0x9F %} put(x, dfa_state(1, 1, 0, 2, 1, 2, 3, 3, 1, mask: 0x3F)); {% end %}
{% for ch in 0xA0..0xBF %} put(x, dfa_state(1, 1, 0, 2, 2, 1, 3, 3, 1, mask: 0x3F)); {% end %}
{% for ch in 0xC0..0xC1 %} put(x, dfa_state(1, 1, 1, 1, 1, 1, 1, 1, 1, mask: 0x00)); {% end %}
{% for ch in 0xC2..0xDF %} put(x, dfa_state(2, 1, 1, 1, 1, 1, 1, 1, 1, mask: 0x3F)); {% end %}
{% for ch in 0xE0..0xE0 %} put(x, dfa_state(4, 1, 1, 1, 1, 1, 1, 1, 1, mask: 0x1F)); {% end %}
{% for ch in 0xE1..0xEC %} put(x, dfa_state(3, 1, 1, 1, 1, 1, 1, 1, 1, mask: 0x1F)); {% end %}
{% for ch in 0xED..0xED %} put(x, dfa_state(5, 1, 1, 1, 1, 1, 1, 1, 1, mask: 0x1F)); {% end %}
{% for ch in 0xEE..0xEF %} put(x, dfa_state(3, 1, 1, 1, 1, 1, 1, 1, 1, mask: 0x1F)); {% end %}
{% for ch in 0xF0..0xF0 %} put(x, dfa_state(6, 1, 1, 1, 1, 1, 1, 1, 1, mask: 0x0F)); {% end %}
{% for ch in 0xF1..0xF3 %} put(x, dfa_state(7, 1, 1, 1, 1, 1, 1, 1, 1, mask: 0x0F)); {% end %}
{% for ch in 0xF4..0xF4 %} put(x, dfa_state(8, 1, 1, 1, 1, 1, 1, 1, 1, mask: 0x0F)); {% end %}
{% for ch in 0xF5..0xFF %} put(x, dfa_state(1, 1, 1, 1, 1, 1, 1, 1, 1, mask: 0x00)); {% end %}
x
end
def each_char2(& : Char ->) : Nil
state = 0_u64
table = VALID_ENCODING_TABLE.to_unsafe
ch = 0_u32
s = to_unsafe
e = s + bytesize
prev = s
while prev < e
ch = ch.unsafe_shl(6) | (s.value & table[s.value].unsafe_shr(56))
state = table[s.value].unsafe_shr(state & 0x3F)
case state & 0x3F
when 0
s += 1
yield ch.unsafe_chr
prev = s
ch = 0_u32
when 6
yield Char::REPLACEMENT
state = 0_u64
prev += 1
ch = 0_u32
s = prev
else
s += 1
end
end
end
end From my own tests it is only about twice as fast as the existing implementation. The new bottleneck might be the computation of |
How are you benchmarking? In my silly benchmarks I get a 100 times worse performance with
|
interesting to see comparison with https://github.com/simdutf/simdutf |
i run on files from simdutf benchmark. |
Several places in stdlib deal with UTF-8 decoding:
String.char_bytesize_at
Char::Reader#decode_char_at
IO#read_char_with_bytesize
They are three different implementations of pretty much the same algorithm. The differences are attributed to their respective domain and are probably insignificant for the UTF-8 decoding algorithm itself.
The algorithm is not very performant, though. It is my understanding there are far more efficient algorithms for UTF-8 decoding.
One example of such is this: http://bjoern.hoehrmann.de/utf-8/decoder/dfa I did not conduct a search for potential other options, yet. We can consider stand-alone implementations like the one I mentioned, as well as look at what other text libraries do (including stdlib implementations of other languages).
Considering that UTF-8 decoding plays a major part in string operations, this would certainly have a positive effect on many Crystal programs.
The text was updated successfully, but these errors were encountered: