Support unicode fuzzy search #18

lewispham · 2018-01-17T05:39:04Z

I've just took a look at the source code and I saw str.charCodeAt is used instead of str.codePointAt. So fuzzy searching unicode characters (multibytes characters for example) is probably not supported by this algorithm.

The text was updated successfully, but these errors were encountered:

meltuhamy · 2018-06-12T11:32:43Z

seems to work https://runkit.com/meltuhamy/5b1faf06fba94100126a5bbb

pasztorpisti · 2019-05-27T02:57:31Z

It would become significantly slower with codePointAt while in it's current form (with charCodeAt) it'll work in most cases but can report false positives when the text contains UTF-16 surrogate pairs to encode unicode characters/codepoints larger than 0xFFFF.

Example that reports false positive:

needle = "𠀨";
haystack = "𠀧𡀨";

// returns true
fuzzysearch(needle, haystack)

Explanation of the above code:

The haystack string consists of the 0x20027 and 0x21028 unicode codepoints while needle contains only the 0x20028 codepoint. These 3 codepoints in UTF-16 are encoded to the following 16-bit values (surrogate pairs, because all 3 codepoints are larger than 0xFFFF):

codepoint 0x20027 == utf16: uint16[ 0xD840, 0xDC27 ]
codepoint 0x20028 == utf16: uint16[ 0xD840, 0xDC28 ]
codepoint 0x21028 == utf16: uint16[ 0xD844, 0xDC28 ]

haystack == utf16: uint16[ 0xD840, 0xDC27, 0xD844, 0xDC28 ]
needle == utf16: uint16[ 0xD840, 0xDC28 ]

How to fix this: (without sacrificing performance)

In UTF16 when a unicode codepoint (an actual unicode character) is encoded using two uint16 values (utf16 surrogate pair) then the first uint16 is always in the inclusive range [0xD800-0xDBFF] (high surrogate) while the second uint16 (in case of valid utf-16) is always in the inclusive range [0xDC00-0DFFF] (low surrogate). These ranges are reserved for surrogate pairs and none of the values in those ranges are used alone as unicode characters/codepoints.

If a charcode in needle is in the range [0xD800-0xDBFF] then this charcode and the charcode that follows it have to be treated as as a single unicode character so they have to be matched together as a pair and there can't be a gap between the two matching charcodes in haystack either.

EDIT: A fixed version would look like this:

function fuzzysearch (needle, haystack) {
    var hlen = haystack.length;
    var nlen = needle.length;
    if (nlen > hlen) {
        return false;
    }
    if (nlen === hlen) {
        return needle === haystack;
    }
    outer: for (var i = 0, j = 0; i < nlen; i++) {
        var nch = needle.charCodeAt(i);

        // handling a utf-16 surrogate pair
        if (nch >= 0xD800 && nch <= 0xDBFF) {
            if (++i >= nlen || j >= hlen) {
                return false
            }
            var nch2 = needle.charCodeAt(i);
            var hch = haystack.charCodeAt(j++);
            while (j < hlen) {
                var hch2 = haystack.charCodeAt(j++);
                if (hch === nch && hch2 === nch2) {
                    continue outer;
                }
                hch = hch2;
            }
            return false
        }

        // no utf-16 surrogate pair
        while (j < hlen) {
            if (haystack.charCodeAt(j++) === nch) {
                continue outer;
            }
        }
        return false;
    }
    return true;
}

ajitid mentioned this issue Aug 3, 2021

Opt for charCodeAt instead of codePointAt ajitid/fzf-for-js#45

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support unicode fuzzy search #18

Support unicode fuzzy search #18

lewispham commented Jan 17, 2018

meltuhamy commented Jun 12, 2018

pasztorpisti commented May 27, 2019 •

edited

Loading

Support unicode fuzzy search #18

Support unicode fuzzy search #18

Comments

lewispham commented Jan 17, 2018

meltuhamy commented Jun 12, 2018

pasztorpisti commented May 27, 2019 • edited Loading

pasztorpisti commented May 27, 2019 •

edited

Loading