Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

isequal_normalized("בְּ", Unicode.normalize("בְּ")) == false #52408

Closed
LilithHafner opened this issue Dec 5, 2023 · 8 comments · Fixed by #52447
Closed

isequal_normalized("בְּ", Unicode.normalize("בְּ")) == false #52408

LilithHafner opened this issue Dec 5, 2023 · 8 comments · Fixed by #52447
Labels
bug Indicates an unexpected problem or unintended behavior unicode Related to unicode characters and encodings

Comments

@LilithHafner
Copy link
Member

LilithHafner commented Dec 5, 2023

Unicode.isequal_normalized(x, Unicode.normalize(x)) should probably return true in all cases.

Here's an example where it doesn't

julia> using Unicode

julia> str = String([0xd6, 0xbc, 0xd6, 0xb0])
"ְּ"

julia> Unicode.isequal_normalized(str, Unicode.normalize(str))
false

Thanks to Neel Smith on Zulip for finding this

@LilithHafner LilithHafner added bug Indicates an unexpected problem or unintended behavior unicode Related to unicode characters and encodings labels Dec 5, 2023
@StefanKarpinski
Copy link
Member

I would definitely expect that isequal_normalized(s, t; kws...) produces the same answer as normalize(s; kws...) == normalize(t; kws...) but computes the answer more efficiently. The simplest fix would be to change the definition to that; it could be optimized again later on.

@LilithHafner
Copy link
Member Author

For folks who happen to have the unicode standard memorized, this failure occurs on 4-codeunit strings only if they follow the bit pattern

10xxxxxx 110xxxxx 10xxxxxx 110xxxxx

and always if they follow the bit pattern

10110xxx 11010110 100xxxxx 11001100

@stevengj
Copy link
Member

stevengj commented Dec 7, 2023

Looks like the culprit is that isequal_normalized is missing this step from the canonical decomposition, which re-orders the combining marks: https://github.com/JuliaStrings/utf8proc/blob/a9c6332ad1f8e1c987aae5d4a4b632b2c52d5782/utf8proc.c#L588-L606

The logic of doing this in a single in-place pass over the string looks tricky, unfortunately.

Note that the combining_class property can be retrieved in Julia via:

combining_class(uc::Integer) =
    uc  0x10ffff ? unsafe_load(ccall(:utf8proc_get_property, Ptr{UInt16}, (UInt32,), uc), 2) : 0x0000
combining_class(c::AbstractChar) = ismalformed(c) ? 0x0000 : combining_class(UInt32(c))

@stevengj
Copy link
Member

stevengj commented Dec 7, 2023

Here is a better fix. When it encounters any decomposed sequence that ends with a combining character, it keeps reading more characters into our decomposition buffer. Then it sorts the combining chars. In the common case where there are only a few consecutive combining characters in a row, this should be reasonably efficient.

using Base: ismalformed
using Base.Unicode: utf8proc_error, UTF8PROC_DECOMPOSE, UTF8PROC_CASEFOLD, UTF8PROC_STRIPMARK, utf8proc_decompose, UTF8PROC_COMPAT

combining_class(uc::Integer) =
    0x00000301  uc  0x10ffff ? unsafe_load(ccall(:utf8proc_get_property, Ptr{UInt16}, (UInt32,), uc), 2) : 0x0000
combining_class(c::AbstractChar) = ismalformed(c) ? 0x0000 : combining_class(UInt32(c))
    
import Unicode: _decompose_char!

function _decompose_char!(codepoint::Union{Integer,Char}, dest::Vector{UInt32}, offset::Integer, options::Integer)
    ret = GC.@preserve dest @ccall utf8proc_decompose_char(codepoint::UInt32, pointer(dest, 1+offset)::Ptr{UInt32}, (length(dest)-offset)::Int, options::Cint, C_NULL::Ptr{Cint})::Int
    ret < 0 && utf8proc_error(ret)
    return ret
end

function isequal_normalized2(s1::AbstractString, s2::AbstractString; casefold::Bool=false, stripmark::Bool=false, chartransform=identity)
    function decompose_next_chars!(state, d, cc, options, s)
        local n
        offset = 0
        @inbounds while true
            # read a char and decompose it to d
            c = chartransform(UInt32(state[1]))
            state = iterate(s, state[2])
            while true
                n = _decompose_char!(c, d, offset, options) + offset
                if n > length(d)
                    resize!(d, n)
                    resize!(cc, n)
                    continue
                end
                break
            end
            
            # decomposed chars must be sorted in ascending order of combining class,
            # which means we need to keep fetching chars until we get to non-combining
            for j = 1+offset:n
                cc[j] = combining_class(d[j])
            end
            (iszero(cc[n]) || isnothing(state)) && break # non-combining 
            offset = n
        end
        
        # sort by combining class
        if n < 128 # almost always true 
            for j1 = 2:n # insertion sort
                for j2 = j1:-1:2
                    cc[j2-1]  cc[j2] && continue
                    d[j2-1], d[j2] = d[j2], d[j2-1]
                    cc[j2-1], cc[j2] = cc[j2], cc[j2-1]
                end
            end
        else # avoid n^2 complexity in crazy large-n case
            d[:] = d[sortperm(cc)]
        end
        
        # split return statement to help type inference:
        return state === nothing ? (1, n, nothing) : (1, n, state)
    end
    options = UTF8PROC_DECOMPOSE
    casefold && (options |= UTF8PROC_CASEFOLD)
    stripmark && (options |= UTF8PROC_STRIPMARK)
    i1,i2 = iterate(s1),iterate(s2)
    d1,d2 = Vector{UInt32}(undef, 4), Vector{UInt32}(undef, 4) # codepoint buffers
    cc1,cc2 = Vector{UInt16}(undef, 4), Vector{UInt16}(undef, 4) # combining classes
    n1 = n2 = 0 # lengths of codepoint buffers
    j1 = j2 = 1 # indices in d1, d2
    while true
        if j1 > n1
            i1 === nothing && return i2 === nothing && j2 > n2
            j1, n1, i1 = decompose_next_chars!(i1, d1, cc1, options, s1)
        end
        if j2 > n2
            i2 === nothing && return false
            j2, n2, i2 = decompose_next_chars!(i2, d2, cc2, options, s2)
        end
        d1[j1] == d2[j2] || return false
        j1 += 1; j2 += 1
    end
end

@stevengj
Copy link
Member

stevengj commented Dec 7, 2023

The above code has a lot of allocations, so maybe there is a type instability. But it should be a good starting point. Fixed the type instability.

Looks like there's some other bugs in the above code, though — it is always returning true. I'll look at it more tomorrow. Fixed that bug.

Performance is better than the naive algorithm but worse than before:

using Random, BenchmarkTools
s = randstring(10^2);
@btime Unicode.isequal_normalized($s, $s);
@btime isequal_normalized2($s, $s);
@btime Unicode.normalize($s) == Unicode.normalize($s);

gives

  2.096 μs (2 allocations: 160 bytes)
  3.969 μs (4 allocations: 288 bytes)
  5.460 μs (4 allocations: 1.00 KiB)

@stevengj
Copy link
Member

stevengj commented Dec 7, 2023

Here's a slightly faster version in the common case of no combining characters (it avoids caching the combining classes):

using Base: ismalformed
using Base.Unicode: utf8proc_error, UTF8PROC_DECOMPOSE, UTF8PROC_CASEFOLD, UTF8PROC_STRIPMARK, utf8proc_decompose, UTF8PROC_COMPAT

combining_class(uc::Integer) =
    0x00000301  uc  0x10ffff ? unsafe_load(ccall(:utf8proc_get_property, Ptr{UInt16}, (UInt32,), uc), 2) : 0x0000
combining_class(c::AbstractChar) = ismalformed(c) ? 0x0000 : combining_class(UInt32(c))
    
import Unicode: _decompose_char!

function _decompose_char!(codepoint::Union{Integer,Char}, dest::Vector{UInt32}, offset::Integer, options::Integer)
    ret = GC.@preserve dest @ccall utf8proc_decompose_char(codepoint::UInt32, pointer(dest, 1+offset)::Ptr{UInt32}, (length(dest)-offset)::Int, options::Cint, C_NULL::Ptr{Cint})::Int
    ret < 0 && utf8proc_error(ret)
    return ret
end

function isequal_normalized(s1::AbstractString, s2::AbstractString; casefold::Bool=false, stripmark::Bool=false, chartransform=identity)
    function decompose_next_chars!(state, d, options, s)
        local n
        offset = 0
        @inbounds while true
            # read a char and decompose it to d
            c = chartransform(UInt32(state[1]))
            state = iterate(s, state[2])
            while true
                n = _decompose_char!(c, d, offset, options) + offset
                if n > length(d)
                    resize!(d, n)
                    continue
                end
                break
            end
            
            # decomposed chars must be sorted in ascending order of combining class,
            # which means we need to keep fetching chars until we get to non-combining
            (iszero(combining_class(d[n])) || isnothing(state)) && break # non-combining 
            offset = n
        end
            
        # sort by combining class
        if n < 32 # almost always true 
            for j1 = 2:n # insertion sort
                cc = combining_class(d[j1])
                iszero(cc) && continue # don't re-order non-combiners
                for j2 = j1:-1:2
                    cc1 = combining_class(d[j2-1])
                    cc1  cc && break
                    d[j2-1], d[j2] = d[j2], d[j2-1]
                end
            end
        else # avoid n^2 complexity in crazy large-n case
            j = 1
            @views while j < n
                j₀ = j + something(findnext(iszero  combining_class, d[j+1:n], 1), n+1-j)
                sort!(d[j:j₀-1], by=combining_class)
                j = j₀
            end
        end
        
        # split return statement to help type inference:
        return state === nothing ? (1, n, nothing) : (1, n, state)
    end
    options = UTF8PROC_DECOMPOSE
    casefold && (options |= UTF8PROC_CASEFOLD)
    stripmark && (options |= UTF8PROC_STRIPMARK)
    i1,i2 = iterate(s1),iterate(s2)
    d1,d2 = Vector{UInt32}(undef, 4), Vector{UInt32}(undef, 4) # codepoint buffers
    n1 = n2 = 0 # lengths of codepoint buffers
    j1 = j2 = 1 # indices in d1, d2
    while true
        if j1 > n1
            i1 === nothing && return i2 === nothing && j2 > n2
            j1, n1, i1 = decompose_next_chars!(i1, d1, options, s1)
        end
        if j2 > n2
            i2 === nothing && return false
            j2, n2, i2 = decompose_next_chars!(i2, d2, options, s2)
        end
        d1[j1] == d2[j2] || return false
        j1 += 1; j2 += 1
    end
end

which when benchmarked with

using Random, BenchmarkTools
s = randstring(10^2);
@btime Unicode.isequal_normalized($s, $s);
@btime isequal_normalized2($s, $s);
@btime Unicode.normalize($s) == Unicode.normalize($s);

gives

  2.094 μs (2 allocations: 160 bytes)
  3.573 μs (2 allocations: 160 bytes)
  5.312 μs (4 allocations: 1.00 KiB)

I'm not sure if there is any other low-hanging fruit for optimization — if anyone else wants to take a crack at it, please go ahead.

@LilithHafner
Copy link
Member Author

I'd say go ahead and make a PR. While I'd love to get this even more efficient, we can certainly make a bugfix that isn't quite as fast as the existing buggy implementation.

@stevengj
Copy link
Member

stevengj commented Dec 7, 2023

Will do.

Am writing some more extensive tests, which caught a case where the above code fails: s = "j\u5ae\u5bf\u5b2\u5b4". Fixed.

Another bug for s = c\u5c1\u5b2\ue48\uf74\u5ae\u5b8\uc56\u35d\u345\u93cN\u59a\u301 where isequal_normalized(s, normalize(s)) returns false. This one should definitely work. Fixed.

Found another bug for s = "J\uf72\uec8\u345\u315\u5bf\u5bb\U1d16d\u5b0\u334\u35c":
isequal_normalized(lowercase(s), normalize(s), casefold=true) returns false. Actually, that example is consistent with the naive algorithm: normalize(lowercase(s), casefold=true) == normalize(normalize(s), casefold=true) also returns false. I guess because lowercase(s) doing something inequivalent to Unicode casefolding? Something is really weird with this example:

julia> using Unicode: normalize

julia> s = "J\uf72\uec8\u345\u315\u5bf\u5bb\U1d16d\u5b0\u334\u35c"
"J̴ְֻֿ່ི𝅭̕͜ͅ"

julia> normalize(s, casefold=true) == normalize(normalize(s), casefold=true)
false

julia> normalize(normalize(s, casefold=true)) == normalize(normalize(s), casefold=true)
false

Either Unicode case folding is fundamentally weirder than I thought or this is a bug in utf8proc. In either case, this would affect the naive algorithm equally so it's unrelated to this issue. Update: Python 3 agrees with Julia/utf8proc here: JuliaStrings/utf8proc#257

StefanKarpinski pushed a commit that referenced this issue Dec 19, 2023
Fixes #52408.

(Note that this function was added in Julia 1.8, in #42493.)

In the future it would be good to further optimize this function by
adding a fast path for the common case of strings that are mostly ASCII
characters. Perhaps simply skip ahead to the first byte that doesn't
match before we begin doing decomposition etcetera.
KristofferC pushed a commit that referenced this issue Dec 23, 2023
Fixes #52408.

(Note that this function was added in Julia 1.8, in #42493.)

In the future it would be good to further optimize this function by
adding a fast path for the common case of strings that are mostly ASCII
characters. Perhaps simply skip ahead to the first byte that doesn't
match before we begin doing decomposition etcetera.

(cherry picked from commit 3b250c7)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Indicates an unexpected problem or unintended behavior unicode Related to unicode characters and encodings
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants