`isequal_normalized("בְּ", Unicode.normalize("בְּ")) == false` #52408

LilithHafner · 2023-12-05T19:26:15Z

Unicode.isequal_normalized(x, Unicode.normalize(x)) should probably return true in all cases.

Here's an example where it doesn't

julia> using Unicode

julia> str = String([0xd6, 0xbc, 0xd6, 0xb0])
"ְּ"

julia> Unicode.isequal_normalized(str, Unicode.normalize(str))
false

Thanks to Neel Smith on Zulip for finding this

The text was updated successfully, but these errors were encountered:

StefanKarpinski · 2023-12-05T21:44:05Z

I would definitely expect that isequal_normalized(s, t; kws...) produces the same answer as normalize(s; kws...) == normalize(t; kws...) but computes the answer more efficiently. The simplest fix would be to change the definition to that; it could be optimized again later on.

LilithHafner · 2023-12-06T00:57:29Z

For folks who happen to have the unicode standard memorized, this failure occurs on 4-codeunit strings only if they follow the bit pattern

10xxxxxx 110xxxxx 10xxxxxx 110xxxxx

and always if they follow the bit pattern

10110xxx 11010110 100xxxxx 11001100

stevengj · 2023-12-07T04:38:43Z

Looks like the culprit is that isequal_normalized is missing this step from the canonical decomposition, which re-orders the combining marks: https://github.com/JuliaStrings/utf8proc/blob/a9c6332ad1f8e1c987aae5d4a4b632b2c52d5782/utf8proc.c#L588-L606

The logic of doing this in a single in-place pass over the string looks tricky, unfortunately.

Note that the combining_class property can be retrieved in Julia via:

combining_class(uc::Integer) =
    uc ≤ 0x10ffff ? unsafe_load(ccall(:utf8proc_get_property, Ptr{UInt16}, (UInt32,), uc), 2) : 0x0000
combining_class(c::AbstractChar) = ismalformed(c) ? 0x0000 : combining_class(UInt32(c))

stevengj · 2023-12-07T05:46:50Z

Here is a better fix. When it encounters any decomposed sequence that ends with a combining character, it keeps reading more characters into our decomposition buffer. Then it sorts the combining chars. In the common case where there are only a few consecutive combining characters in a row, this should be reasonably efficient.

using Base: ismalformed
using Base.Unicode: utf8proc_error, UTF8PROC_DECOMPOSE, UTF8PROC_CASEFOLD, UTF8PROC_STRIPMARK, utf8proc_decompose, UTF8PROC_COMPAT

combining_class(uc::Integer) =
    0x00000301 ≤ uc ≤ 0x10ffff ? unsafe_load(ccall(:utf8proc_get_property, Ptr{UInt16}, (UInt32,), uc), 2) : 0x0000
combining_class(c::AbstractChar) = ismalformed(c) ? 0x0000 : combining_class(UInt32(c))
    
import Unicode: _decompose_char!

function _decompose_char!(codepoint::Union{Integer,Char}, dest::Vector{UInt32}, offset::Integer, options::Integer)
    ret = GC.@preserve dest @ccall utf8proc_decompose_char(codepoint::UInt32, pointer(dest, 1+offset)::Ptr{UInt32}, (length(dest)-offset)::Int, options::Cint, C_NULL::Ptr{Cint})::Int
    ret < 0 && utf8proc_error(ret)
    return ret
end

function isequal_normalized2(s1::AbstractString, s2::AbstractString; casefold::Bool=false, stripmark::Bool=false, chartransform=identity)
    function decompose_next_chars!(state, d, cc, options, s)
        local n
        offset = 0
        @inbounds while true
            # read a char and decompose it to d
            c = chartransform(UInt32(state[1]))
            state = iterate(s, state[2])
            while true
                n = _decompose_char!(c, d, offset, options) + offset
                if n > length(d)
                    resize!(d, n)
                    resize!(cc, n)
                    continue
                end
                break
            end
            
            # decomposed chars must be sorted in ascending order of combining class,
            # which means we need to keep fetching chars until we get to non-combining
            for j = 1+offset:n
                cc[j] = combining_class(d[j])
            end
            (iszero(cc[n]) || isnothing(state)) && break # non-combining 
            offset = n
        end
        
        # sort by combining class
        if n < 128 # almost always true 
            for j1 = 2:n # insertion sort
                for j2 = j1:-1:2
                    cc[j2-1] ≤ cc[j2] && continue
                    d[j2-1], d[j2] = d[j2], d[j2-1]
                    cc[j2-1], cc[j2] = cc[j2], cc[j2-1]
                end
            end
        else # avoid n^2 complexity in crazy large-n case
            d[:] = d[sortperm(cc)]
        end
        
        # split return statement to help type inference:
        return state === nothing ? (1, n, nothing) : (1, n, state)
    end
    options = UTF8PROC_DECOMPOSE
    casefold && (options |= UTF8PROC_CASEFOLD)
    stripmark && (options |= UTF8PROC_STRIPMARK)
    i1,i2 = iterate(s1),iterate(s2)
    d1,d2 = Vector{UInt32}(undef, 4), Vector{UInt32}(undef, 4) # codepoint buffers
    cc1,cc2 = Vector{UInt16}(undef, 4), Vector{UInt16}(undef, 4) # combining classes
    n1 = n2 = 0 # lengths of codepoint buffers
    j1 = j2 = 1 # indices in d1, d2
    while true
        if j1 > n1
            i1 === nothing && return i2 === nothing && j2 > n2
            j1, n1, i1 = decompose_next_chars!(i1, d1, cc1, options, s1)
        end
        if j2 > n2
            i2 === nothing && return false
            j2, n2, i2 = decompose_next_chars!(i2, d2, cc2, options, s2)
        end
        d1[j1] == d2[j2] || return false
        j1 += 1; j2 += 1
    end
end

stevengj · 2023-12-07T05:50:24Z

~~The above code has a lot of allocations, so maybe there is a type instability. But it should be a good starting point.~~ Fixed the type instability.

~~Looks like there's some other bugs in the above code, though — it is always returning true. I'll look at it more tomorrow.~~ Fixed that bug.

Performance is better than the naive algorithm but worse than before:

using Random, BenchmarkTools
s = randstring(10^2);
@btime Unicode.isequal_normalized($s, $s);
@btime isequal_normalized2($s, $s);
@btime Unicode.normalize($s) == Unicode.normalize($s);

gives

  2.096 μs (2 allocations: 160 bytes)
  3.969 μs (4 allocations: 288 bytes)
  5.460 μs (4 allocations: 1.00 KiB)

stevengj · 2023-12-07T06:19:38Z

Here's a slightly faster version in the common case of no combining characters (it avoids caching the combining classes):

using Base: ismalformed
using Base.Unicode: utf8proc_error, UTF8PROC_DECOMPOSE, UTF8PROC_CASEFOLD, UTF8PROC_STRIPMARK, utf8proc_decompose, UTF8PROC_COMPAT

combining_class(uc::Integer) =
    0x00000301 ≤ uc ≤ 0x10ffff ? unsafe_load(ccall(:utf8proc_get_property, Ptr{UInt16}, (UInt32,), uc), 2) : 0x0000
combining_class(c::AbstractChar) = ismalformed(c) ? 0x0000 : combining_class(UInt32(c))
    
import Unicode: _decompose_char!

function _decompose_char!(codepoint::Union{Integer,Char}, dest::Vector{UInt32}, offset::Integer, options::Integer)
    ret = GC.@preserve dest @ccall utf8proc_decompose_char(codepoint::UInt32, pointer(dest, 1+offset)::Ptr{UInt32}, (length(dest)-offset)::Int, options::Cint, C_NULL::Ptr{Cint})::Int
    ret < 0 && utf8proc_error(ret)
    return ret
end

function isequal_normalized(s1::AbstractString, s2::AbstractString; casefold::Bool=false, stripmark::Bool=false, chartransform=identity)
    function decompose_next_chars!(state, d, options, s)
        local n
        offset = 0
        @inbounds while true
            # read a char and decompose it to d
            c = chartransform(UInt32(state[1]))
            state = iterate(s, state[2])
            while true
                n = _decompose_char!(c, d, offset, options) + offset
                if n > length(d)
                    resize!(d, n)
                    continue
                end
                break
            end
            
            # decomposed chars must be sorted in ascending order of combining class,
            # which means we need to keep fetching chars until we get to non-combining
            (iszero(combining_class(d[n])) || isnothing(state)) && break # non-combining 
            offset = n
        end
            
        # sort by combining class
        if n < 32 # almost always true 
            for j1 = 2:n # insertion sort
                cc = combining_class(d[j1])
                iszero(cc) && continue # don't re-order non-combiners
                for j2 = j1:-1:2
                    cc1 = combining_class(d[j2-1])
                    cc1 ≤ cc && break
                    d[j2-1], d[j2] = d[j2], d[j2-1]
                end
            end
        else # avoid n^2 complexity in crazy large-n case
            j = 1
            @views while j < n
                j₀ = j + something(findnext(iszero ∘ combining_class, d[j+1:n], 1), n+1-j)
                sort!(d[j:j₀-1], by=combining_class)
                j = j₀
            end
        end
        
        # split return statement to help type inference:
        return state === nothing ? (1, n, nothing) : (1, n, state)
    end
    options = UTF8PROC_DECOMPOSE
    casefold && (options |= UTF8PROC_CASEFOLD)
    stripmark && (options |= UTF8PROC_STRIPMARK)
    i1,i2 = iterate(s1),iterate(s2)
    d1,d2 = Vector{UInt32}(undef, 4), Vector{UInt32}(undef, 4) # codepoint buffers
    n1 = n2 = 0 # lengths of codepoint buffers
    j1 = j2 = 1 # indices in d1, d2
    while true
        if j1 > n1
            i1 === nothing && return i2 === nothing && j2 > n2
            j1, n1, i1 = decompose_next_chars!(i1, d1, options, s1)
        end
        if j2 > n2
            i2 === nothing && return false
            j2, n2, i2 = decompose_next_chars!(i2, d2, options, s2)
        end
        d1[j1] == d2[j2] || return false
        j1 += 1; j2 += 1
    end
end

which when benchmarked with

using Random, BenchmarkTools
s = randstring(10^2);
@btime Unicode.isequal_normalized($s, $s);
@btime isequal_normalized2($s, $s);
@btime Unicode.normalize($s) == Unicode.normalize($s);

gives

  2.094 μs (2 allocations: 160 bytes)
  3.573 μs (2 allocations: 160 bytes)
  5.312 μs (4 allocations: 1.00 KiB)

I'm not sure if there is any other low-hanging fruit for optimization — if anyone else wants to take a crack at it, please go ahead.

LilithHafner · 2023-12-07T14:07:10Z

I'd say go ahead and make a PR. While I'd love to get this even more efficient, we can certainly make a bugfix that isn't quite as fast as the existing buggy implementation.

stevengj · 2023-12-07T15:32:57Z

Will do.

Am writing some more extensive tests, ~~which caught a case where the above code fails: s = "j\u5ae\u5bf\u5b2\u5b4".~~ Fixed.

~~Another bug for s = c\u5c1\u5b2\ue48\uf74\u5ae\u5b8\uc56\u35d\u345\u93cN\u59a\u301 where isequal_normalized(s, normalize(s)) returns false. This one should definitely work.~~ Fixed.

Found another bug for s = "J\uf72\uec8\u345\u315\u5bf\u5bb\U1d16d\u5b0\u334\u35c":
isequal_normalized(lowercase(s), normalize(s), casefold=true) returns false. Actually, that example is consistent with the naive algorithm: normalize(lowercase(s), casefold=true) == normalize(normalize(s), casefold=true) also returns false. ~~I guess because lowercase(s) doing something inequivalent to Unicode casefolding?~~ Something is really weird with this example:

julia> using Unicode: normalize

julia> s = "J\uf72\uec8\u345\u315\u5bf\u5bb\U1d16d\u5b0\u334\u35c"
"J̴ְֻֿ່ི𝅭̕͜ͅ"

julia> normalize(s, casefold=true) == normalize(normalize(s), casefold=true)
false

julia> normalize(normalize(s, casefold=true)) == normalize(normalize(s), casefold=true)
false

Either Unicode case folding is fundamentally weirder than I thought or this is a bug in utf8proc. In either case, this would affect the naive algorithm equally so it's unrelated to this issue. Update: Python 3 agrees with Julia/utf8proc here: JuliaStrings/utf8proc#257

Fixes #52408. (Note that this function was added in Julia 1.8, in #42493.) In the future it would be good to further optimize this function by adding a fast path for the common case of strings that are mostly ASCII characters. Perhaps simply skip ahead to the first byte that doesn't match before we begin doing decomposition etcetera.

Fixes #52408. (Note that this function was added in Julia 1.8, in #42493.) In the future it would be good to further optimize this function by adding a fast path for the common case of strings that are mostly ASCII characters. Perhaps simply skip ahead to the first byte that doesn't match before we begin doing decomposition etcetera. (cherry picked from commit 3b250c7)

LilithHafner added bug Indicates an unexpected problem or unintended behavior unicode Related to unicode characters and encodings labels Dec 5, 2023

This was referenced Dec 8, 2023

fix isequal_normalized for combining-char reordering #52447

Merged

normalization does not commute with case-folding? JuliaStrings/utf8proc#257

Closed

StefanKarpinski closed this as completed in #52447 Dec 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`isequal_normalized("בְּ", Unicode.normalize("בְּ")) == false` #52408

`isequal_normalized("בְּ", Unicode.normalize("בְּ")) == false` #52408

LilithHafner commented Dec 5, 2023 •

edited

Loading

StefanKarpinski commented Dec 5, 2023

LilithHafner commented Dec 6, 2023

stevengj commented Dec 7, 2023 •

edited

Loading

stevengj commented Dec 7, 2023 •

edited

Loading

stevengj commented Dec 7, 2023 •

edited

Loading

stevengj commented Dec 7, 2023 •

edited

Loading

LilithHafner commented Dec 7, 2023

stevengj commented Dec 7, 2023 •

edited

Loading

isequal_normalized("בְּ", Unicode.normalize("בְּ")) == false #52408

isequal_normalized("בְּ", Unicode.normalize("בְּ")) == false #52408

Comments

LilithHafner commented Dec 5, 2023 • edited Loading

StefanKarpinski commented Dec 5, 2023

LilithHafner commented Dec 6, 2023

stevengj commented Dec 7, 2023 • edited Loading

stevengj commented Dec 7, 2023 • edited Loading

stevengj commented Dec 7, 2023 • edited Loading

stevengj commented Dec 7, 2023 • edited Loading

LilithHafner commented Dec 7, 2023

stevengj commented Dec 7, 2023 • edited Loading

`isequal_normalized("בְּ", Unicode.normalize("בְּ")) == false` #52408

`isequal_normalized("בְּ", Unicode.normalize("בְּ")) == false` #52408

LilithHafner commented Dec 5, 2023 •

edited

Loading

stevengj commented Dec 7, 2023 •

edited

Loading

stevengj commented Dec 7, 2023 •

edited

Loading

stevengj commented Dec 7, 2023 •

edited

Loading

stevengj commented Dec 7, 2023 •

edited

Loading

stevengj commented Dec 7, 2023 •

edited

Loading