Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support unicode fuzzy search #18

Open
lewispham opened this issue Jan 17, 2018 · 2 comments
Open

Support unicode fuzzy search #18

lewispham opened this issue Jan 17, 2018 · 2 comments

Comments

@lewispham
Copy link

I've just took a look at the source code and I saw str.charCodeAt is used instead of str.codePointAt. So fuzzy searching unicode characters (multibytes characters for example) is probably not supported by this algorithm.

@meltuhamy
Copy link

@pasztorpisti
Copy link

pasztorpisti commented May 27, 2019

It would become significantly slower with codePointAt while in it's current form (with charCodeAt) it'll work in most cases but can report false positives when the text contains UTF-16 surrogate pairs to encode unicode characters/codepoints larger than 0xFFFF.

Example that reports false positive:

needle = "𠀨";
haystack = "𠀧𡀨";

// returns true
fuzzysearch(needle, haystack)

Explanation of the above code:

The haystack string consists of the 0x20027 and 0x21028 unicode codepoints while needle contains only the 0x20028 codepoint. These 3 codepoints in UTF-16 are encoded to the following 16-bit values (surrogate pairs, because all 3 codepoints are larger than 0xFFFF):

codepoint 0x20027 == utf16: uint16[ 0xD840, 0xDC27 ]
codepoint 0x20028 == utf16: uint16[ 0xD840, 0xDC28 ]
codepoint 0x21028 == utf16: uint16[ 0xD844, 0xDC28 ]

haystack == utf16: uint16[ 0xD840, 0xDC27, 0xD844, 0xDC28 ]
needle == utf16: uint16[ 0xD840, 0xDC28 ] 

How to fix this: (without sacrificing performance)

In UTF16 when a unicode codepoint (an actual unicode character) is encoded using two uint16 values (utf16 surrogate pair) then the first uint16 is always in the inclusive range [0xD800-0xDBFF] (high surrogate) while the second uint16 (in case of valid utf-16) is always in the inclusive range [0xDC00-0DFFF] (low surrogate). These ranges are reserved for surrogate pairs and none of the values in those ranges are used alone as unicode characters/codepoints.

If a charcode in needle is in the range [0xD800-0xDBFF] then this charcode and the charcode that follows it have to be treated as as a single unicode character so they have to be matched together as a pair and there can't be a gap between the two matching charcodes in haystack either.

EDIT: A fixed version would look like this:

function fuzzysearch (needle, haystack) {
    var hlen = haystack.length;
    var nlen = needle.length;
    if (nlen > hlen) {
        return false;
    }
    if (nlen === hlen) {
        return needle === haystack;
    }
    outer: for (var i = 0, j = 0; i < nlen; i++) {
        var nch = needle.charCodeAt(i);

        // handling a utf-16 surrogate pair
        if (nch >= 0xD800 && nch <= 0xDBFF) {
            if (++i >= nlen || j >= hlen) {
                return false
            }
            var nch2 = needle.charCodeAt(i);
            var hch = haystack.charCodeAt(j++);
            while (j < hlen) {
                var hch2 = haystack.charCodeAt(j++);
                if (hch === nch && hch2 === nch2) {
                    continue outer;
                }
                hch = hch2;
            }
            return false
        }

        // no utf-16 surrogate pair
        while (j < hlen) {
            if (haystack.charCodeAt(j++) === nch) {
                continue outer;
            }
        }
        return false;
    }
    return true;
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants