-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
isequal_normalized("בְּ", Unicode.normalize("בְּ")) == false
#52408
Comments
I would definitely expect that |
For folks who happen to have the unicode standard memorized, this failure occurs on 4-codeunit strings only if they follow the bit pattern
and always if they follow the bit pattern
|
Looks like the culprit is that The logic of doing this in a single in-place pass over the string looks tricky, unfortunately. Note that the combining_class(uc::Integer) =
uc ≤ 0x10ffff ? unsafe_load(ccall(:utf8proc_get_property, Ptr{UInt16}, (UInt32,), uc), 2) : 0x0000
combining_class(c::AbstractChar) = ismalformed(c) ? 0x0000 : combining_class(UInt32(c)) |
Here is a better fix. When it encounters any decomposed sequence that ends with a combining character, it keeps reading more characters into our decomposition buffer. Then it sorts the combining chars. In the common case where there are only a few consecutive combining characters in a row, this should be reasonably efficient. using Base: ismalformed
using Base.Unicode: utf8proc_error, UTF8PROC_DECOMPOSE, UTF8PROC_CASEFOLD, UTF8PROC_STRIPMARK, utf8proc_decompose, UTF8PROC_COMPAT
combining_class(uc::Integer) =
0x00000301 ≤ uc ≤ 0x10ffff ? unsafe_load(ccall(:utf8proc_get_property, Ptr{UInt16}, (UInt32,), uc), 2) : 0x0000
combining_class(c::AbstractChar) = ismalformed(c) ? 0x0000 : combining_class(UInt32(c))
import Unicode: _decompose_char!
function _decompose_char!(codepoint::Union{Integer,Char}, dest::Vector{UInt32}, offset::Integer, options::Integer)
ret = GC.@preserve dest @ccall utf8proc_decompose_char(codepoint::UInt32, pointer(dest, 1+offset)::Ptr{UInt32}, (length(dest)-offset)::Int, options::Cint, C_NULL::Ptr{Cint})::Int
ret < 0 && utf8proc_error(ret)
return ret
end
function isequal_normalized2(s1::AbstractString, s2::AbstractString; casefold::Bool=false, stripmark::Bool=false, chartransform=identity)
function decompose_next_chars!(state, d, cc, options, s)
local n
offset = 0
@inbounds while true
# read a char and decompose it to d
c = chartransform(UInt32(state[1]))
state = iterate(s, state[2])
while true
n = _decompose_char!(c, d, offset, options) + offset
if n > length(d)
resize!(d, n)
resize!(cc, n)
continue
end
break
end
# decomposed chars must be sorted in ascending order of combining class,
# which means we need to keep fetching chars until we get to non-combining
for j = 1+offset:n
cc[j] = combining_class(d[j])
end
(iszero(cc[n]) || isnothing(state)) && break # non-combining
offset = n
end
# sort by combining class
if n < 128 # almost always true
for j1 = 2:n # insertion sort
for j2 = j1:-1:2
cc[j2-1] ≤ cc[j2] && continue
d[j2-1], d[j2] = d[j2], d[j2-1]
cc[j2-1], cc[j2] = cc[j2], cc[j2-1]
end
end
else # avoid n^2 complexity in crazy large-n case
d[:] = d[sortperm(cc)]
end
# split return statement to help type inference:
return state === nothing ? (1, n, nothing) : (1, n, state)
end
options = UTF8PROC_DECOMPOSE
casefold && (options |= UTF8PROC_CASEFOLD)
stripmark && (options |= UTF8PROC_STRIPMARK)
i1,i2 = iterate(s1),iterate(s2)
d1,d2 = Vector{UInt32}(undef, 4), Vector{UInt32}(undef, 4) # codepoint buffers
cc1,cc2 = Vector{UInt16}(undef, 4), Vector{UInt16}(undef, 4) # combining classes
n1 = n2 = 0 # lengths of codepoint buffers
j1 = j2 = 1 # indices in d1, d2
while true
if j1 > n1
i1 === nothing && return i2 === nothing && j2 > n2
j1, n1, i1 = decompose_next_chars!(i1, d1, cc1, options, s1)
end
if j2 > n2
i2 === nothing && return false
j2, n2, i2 = decompose_next_chars!(i2, d2, cc2, options, s2)
end
d1[j1] == d2[j2] || return false
j1 += 1; j2 += 1
end
end |
Performance is better than the naive algorithm but worse than before: using Random, BenchmarkTools
s = randstring(10^2);
@btime Unicode.isequal_normalized($s, $s);
@btime isequal_normalized2($s, $s);
@btime Unicode.normalize($s) == Unicode.normalize($s); gives
|
Here's a slightly faster version in the common case of no combining characters (it avoids caching the combining classes): using Base: ismalformed
using Base.Unicode: utf8proc_error, UTF8PROC_DECOMPOSE, UTF8PROC_CASEFOLD, UTF8PROC_STRIPMARK, utf8proc_decompose, UTF8PROC_COMPAT
combining_class(uc::Integer) =
0x00000301 ≤ uc ≤ 0x10ffff ? unsafe_load(ccall(:utf8proc_get_property, Ptr{UInt16}, (UInt32,), uc), 2) : 0x0000
combining_class(c::AbstractChar) = ismalformed(c) ? 0x0000 : combining_class(UInt32(c))
import Unicode: _decompose_char!
function _decompose_char!(codepoint::Union{Integer,Char}, dest::Vector{UInt32}, offset::Integer, options::Integer)
ret = GC.@preserve dest @ccall utf8proc_decompose_char(codepoint::UInt32, pointer(dest, 1+offset)::Ptr{UInt32}, (length(dest)-offset)::Int, options::Cint, C_NULL::Ptr{Cint})::Int
ret < 0 && utf8proc_error(ret)
return ret
end
function isequal_normalized(s1::AbstractString, s2::AbstractString; casefold::Bool=false, stripmark::Bool=false, chartransform=identity)
function decompose_next_chars!(state, d, options, s)
local n
offset = 0
@inbounds while true
# read a char and decompose it to d
c = chartransform(UInt32(state[1]))
state = iterate(s, state[2])
while true
n = _decompose_char!(c, d, offset, options) + offset
if n > length(d)
resize!(d, n)
continue
end
break
end
# decomposed chars must be sorted in ascending order of combining class,
# which means we need to keep fetching chars until we get to non-combining
(iszero(combining_class(d[n])) || isnothing(state)) && break # non-combining
offset = n
end
# sort by combining class
if n < 32 # almost always true
for j1 = 2:n # insertion sort
cc = combining_class(d[j1])
iszero(cc) && continue # don't re-order non-combiners
for j2 = j1:-1:2
cc1 = combining_class(d[j2-1])
cc1 ≤ cc && break
d[j2-1], d[j2] = d[j2], d[j2-1]
end
end
else # avoid n^2 complexity in crazy large-n case
j = 1
@views while j < n
j₀ = j + something(findnext(iszero ∘ combining_class, d[j+1:n], 1), n+1-j)
sort!(d[j:j₀-1], by=combining_class)
j = j₀
end
end
# split return statement to help type inference:
return state === nothing ? (1, n, nothing) : (1, n, state)
end
options = UTF8PROC_DECOMPOSE
casefold && (options |= UTF8PROC_CASEFOLD)
stripmark && (options |= UTF8PROC_STRIPMARK)
i1,i2 = iterate(s1),iterate(s2)
d1,d2 = Vector{UInt32}(undef, 4), Vector{UInt32}(undef, 4) # codepoint buffers
n1 = n2 = 0 # lengths of codepoint buffers
j1 = j2 = 1 # indices in d1, d2
while true
if j1 > n1
i1 === nothing && return i2 === nothing && j2 > n2
j1, n1, i1 = decompose_next_chars!(i1, d1, options, s1)
end
if j2 > n2
i2 === nothing && return false
j2, n2, i2 = decompose_next_chars!(i2, d2, options, s2)
end
d1[j1] == d2[j2] || return false
j1 += 1; j2 += 1
end
end which when benchmarked with using Random, BenchmarkTools
s = randstring(10^2);
@btime Unicode.isequal_normalized($s, $s);
@btime isequal_normalized2($s, $s);
@btime Unicode.normalize($s) == Unicode.normalize($s); gives
I'm not sure if there is any other low-hanging fruit for optimization — if anyone else wants to take a crack at it, please go ahead. |
I'd say go ahead and make a PR. While I'd love to get this even more efficient, we can certainly make a bugfix that isn't quite as fast as the existing buggy implementation. |
Will do. Am writing some more extensive tests,
Found another bug for julia> using Unicode: normalize
julia> s = "J\uf72\uec8\u345\u315\u5bf\u5bb\U1d16d\u5b0\u334\u35c"
"J̴ְֻֿ່ི𝅭̕͜ͅ"
julia> normalize(s, casefold=true) == normalize(normalize(s), casefold=true)
false
julia> normalize(normalize(s, casefold=true)) == normalize(normalize(s), casefold=true)
false Either Unicode case folding is fundamentally weirder than I thought or this is a bug in utf8proc. In either case, this would affect the naive algorithm equally so it's unrelated to this issue. Update: Python 3 agrees with Julia/utf8proc here: JuliaStrings/utf8proc#257 |
Fixes #52408. (Note that this function was added in Julia 1.8, in #42493.) In the future it would be good to further optimize this function by adding a fast path for the common case of strings that are mostly ASCII characters. Perhaps simply skip ahead to the first byte that doesn't match before we begin doing decomposition etcetera.
Fixes #52408. (Note that this function was added in Julia 1.8, in #42493.) In the future it would be good to further optimize this function by adding a fast path for the common case of strings that are mostly ASCII characters. Perhaps simply skip ahead to the first byte that doesn't match before we begin doing decomposition etcetera. (cherry picked from commit 3b250c7)
Unicode.isequal_normalized(x, Unicode.normalize(x))
should probably return true in all cases.Here's an example where it doesn't
Thanks to Neel Smith on Zulip for finding this
The text was updated successfully, but these errors were encountered: