Use LCS to Rank Matches #20

walles · 2015-07-06T13:31:31Z

I suggest using LCS to rank matches:
https://en.wikipedia.org/wiki/Longest_common_subsequence_problem#Solution_for_two_sequences

Matches with lower number of substrings would come before matches with higher number of substrings.

This would be a lot more generic than the substring() based approach suggested in some PRs.

If you want to play with this, here's an online thing demonstrating the concept (I didn't write this, just found it):
http://lcs-demo.sourceforge.net/

Start by upping the "Max Size" a bit, then type in two strings and press "Execute LCS Lengths" to see how it works.

One real-world use case where this would come in handy is when searching for "git push", note how it's third place even though "git" and "push" are exact matches. Single-word matches would be caught by this as well.

plrenaudin · 2015-07-06T13:55:45Z

Interesting idea. Found the algo for CoffeeScript: http://rosettacode.org/wiki/Longest_common_subsequence#CoffeeScript

Will try to implement it in my PR

jeancroy · 2015-07-06T16:01:21Z

LCS alone will not help for the order of item in the screenshot.

We have LCS("git push","git push") = "git push"
But also LCS("git push","git plus hunk") = "git push"

What you need is to normalize matches against the length of both string. I myself have a project using LCS and use something like this:

m = Length of LCS(a,b)
lena = a.length
lenb = b.length
score = m*(m/lena+ m/lenb)

Note that you can probably use current algorithm to calculate m too. This is because current algorithm looks like a greedy approximation of LCS, with special score depending of the nature of the character.
By greedy i mean score("hzello","helloz") will match the z because it's in both sub-string but will miss the ello part doing so. LCS will skip the z to match ello

If you want to see behavior of above scoring among other things, I have a demo page

plrenaudin · 2015-07-06T20:49:48Z

Tried the LCS, it's way too slow.

Filtering 66672 entries for 'index' took 6557ms for 6168 results
Matching 66672 entries for 'index' took 237ms for 6168 results

Instead of

Filtering 66672 entries for 'index' took 267ms for 6168 results
Matching 66672 entries for 'index' took 218ms for 6168 results

Maybe I'm missing something...

I also added a test in my PR for the use case above.
The issue is with using 2 words: "push" retrieve the correct result (see test) but "git push" doesn't 😞

walles · 2015-07-07T05:17:29Z

Well spotted @jeancroy, I see now that LCS won't handle my use case.

Your scoring function can be greatly simplified into "prefer shorter hits" if used to compare already-filtered hits. Since the hits are already filtered, we actually already have the LCS, and:

m = the length of the LCS = constant between all options
lena = the length of the LCS = m = constant between all options
lenb = the only thing varying = the length of the candidate string

So if we replace lena with m, we get...
score = m * (m/m + m/lenb) = m * (1 + m / lenb) = m + m^2/lenb
... and since m is the same between all candidates its value won't affect comparisons and we can just as well replace with a constant of our choosing (I choose 1)...
score = 1 * (1 + 1^2/lenb) = 1 + 1/lenb
... and since adding the same constant to all values we compare doesn't change anything we can remove that too, giving us...
score = 1 / lenb
... or in other words prefer shorter hits.

Which I think sounds just fine actually :)

walles · 2015-07-07T05:35:23Z

@plrenaudin, thanks for trying this and benchmarking it!

After the above discussion however I'm not sure this is actually the way to go any more, and I'm not convinced about pure substring matches either.

Real-world example, searching for install using cmd-shift-p gives me:

A substring match would put "Find And Replace: Select All" at the bottom (good), but it would also consider "uninstall", "install" and "installed" as equally good (bad).

So I'm starting to think that the way to go about this is to:

Collect some real world use cases to support. I suggest "git push" and "install". Some people seem to want "cccn" to match "CamelCasedClassName.java" as well (I don't care personally) or "gaa" to match "Git Plus: Add All". Implementor's choice really :).
Think of how a human would sort them.
Write specs based on that. Make sure the specs contain the full list of hits for each use case.
Implement something that makes all the new specs pass.

That's probably better than starting with a solution like I suggested here initially.

plrenaudin · 2015-07-07T08:12:14Z

You're right, People would also want to score starting substring with a higher score...

I'd like to have similar suggestion as IntelliJ or Sublime text but it would probably require to rewrite the whole scoring system.

We really need improvement on this because for projects with lots of files, the pertinence is just so bad that it's rare to have what we search for in first result.

jeancroy · 2015-07-07T14:20:46Z

@walles the part you simplified out is actually important for accuracy should you use LCS.
match("uni","university") should be better than match("unicorn","university")

Now I because your scoring function have a no typo policy I'd agree with your simplification.

jeancroy · 2015-07-07T21:03:50Z

I have put something together that I believe account for some of the issues mentioned.

Use special character to split query / subject into tokens. Match only token against token.
- This will prevent to match across tokens push will not fully match plus husk. Same for directory, SomeDir will not match Awsome/David_R
Give bonus to prefix, applicable for each token.
- install will match prefix for second token of git install but will not match prefix of git uninstall
If you are still interested in using LCS, I have attached an implementation that should be faster that what @plrenaudin tested (With also the token stuff mentioned above).

https://gist.github.com/jeancroy/a6a7f59db14de85b2b97

plrenaudin · 2015-07-07T22:19:46Z

@jeancroy Thanks for the idea!
I like the first point (match the tokens instead of the entire query).
Just tried to change the actual score function by your Gist but lots of tests are failling and benchamrks result is:

Filtering 66672 entries for 'index' took 1322ms for 2040 results
Matching 66672 entries for 'index' took 250ms for 2040 results
Results count changed! 2040 instead of 6168

It's missing some results so even if it worked, it would still be quite slow compared to the actual implementation.

But there is definitely a potential improvement in the tokenizing of the query / subject.

jeancroy · 2015-07-07T22:40:20Z

@plrenaudin can I play with your implementation of that benchmark? It should not miss an "index" so i'm curious to debug that, also query is not meant to be re-computed each time.

Also open question ... Why do this project score special character higher ? I mean any character of query not in subject squash the score to 0. Isn't not being squashed to 0 a good enough outcome ? Also every result not being 0 will have every char including every special char and their bonus, so it doesn't help the sort very much.

plrenaudin · 2015-07-07T22:49:17Z

@jeancroy My "implementation" is just to change the score function by your gist, nothing more nothing less. This is where the magic should happen and the score put the relevant results in front of the rest.

I'm sure you could improve thist but it looks like the algo has much more complexity than the one in place so having a slower benchmark is not that surprising. I personally wouldn't mind having a slower fuzzy algo for more pertinence though.

jeancroy · 2015-07-08T03:01:19Z

@plrenaudin sorry I did not meant you have changed something in a wrong way. Just I never played with coffeescript / node only before today so I didn't know where to start. I corrected something in my code.

Filtering 66672 entries for 'index' took 877ms for 2040 results
Matching 66672 entries for 'index' took 307ms for 2040 results
Results count changed! 2040 instead of 6168

I'm also OK with matches that are missed. I believe any token implementation (idea 1) will miss them.

index vs
./node_modules/npm/node_modules/path-is-inside/LICENSE.txt

and some more

plrenaudin · 2015-07-08T08:07:35Z

@jeancroy You didn't change anything in a wrong way, it just requires more changes to match the expected results of tests.
I'm eager to see it fully compliant with the tests.
When you look at the specs, most of them make sense so it is a good way to see if your algo is compatible. it just misses the tests about the "git push" query situation. (and maybe also a more complete test about CamelCase query)

It bothers me to see that for such an important feature there is so few people concerned in improving it.

thomasjo · 2015-07-08T08:26:00Z

@plrenaudin This issue has many "lurkers", I assure you. But I guess many, like myself, are very busy with other stuff. Wish I had time to invest in improving this...

thomasjo · 2015-07-08T08:26:53Z

I applaud the effort put into this so far by the way. Love it when the community (even if only a handful of people) come together like this ❤️

lee-dohm · 2015-07-08T15:08:27Z

I just got off of a work bender ... pushing for release after my vacation time the last couple weeks. I'm still trying to catch up on all the work you all are doing. One thing I'd like to note though ...

I'd like to see more testing of "fuzzy" matches, just to make sure that all bases are covered, such as:

This is the way a lot of people search for things, not when they don't know what they're looking for ... but when they do. As another example, when I need to set the grammar of a file by hand, I use my set-syntax package which allows the Command Palette to mimic how I did things in Sublime:

plrenaudin · 2015-07-08T15:36:27Z

The question then is: should ssrb match "Set Syntax: Ruby" or a class called "SsrbController" for example?
I think at some point there will have to be some controversy about which result should score better than the other.

lee-dohm · 2015-07-08T16:00:56Z

Definitely there will be. I just didn't want things getting completely optimized for whole or partial words when scattered single-character matches are also an important use case.

plrenaudin · 2015-07-08T16:02:52Z

Indeed, I completely agree with the use case as for the one with CamelCaseFiles

jeancroy · 2015-07-08T20:48:13Z

Ok a new day, a new algorithm :)

This one is based on the idea of splitting the speed/accuracy trade off in two.

Quickly reject non-match
Take the time it takes to do an accurate scoring on good candidates.

For 1, i simplified current algorithm will all score = 1. Output is match yes/no.. This mean new algorithm report same positive result than original.

For 2, i used a local sequence alignment algorithm with afine gap penalty (Smith Waterman / Gotoh). I also used the measure of similarity between characters to implement lower/upper case bonus, CamelCase and This is an acronym

What this mean in layman term

a. We get a lot of knob to play with.
b. Algorithm will try to find the combination of matches that give the best score.
c. I'm currently using that algorithm to highlight matches on another project, we we decide to use this as a scorer, I know it's relatively easy to make the matcher.

@plrenaudin

I've downloaded your updated filter.spec. Current proposal pass all test but 4 "base path" ones. It's also faster on benchmark than last proposal mostly due to quick exit.

https://gist.github.com/jeancroy/a37b701659658f2529bf

plrenaudin · 2015-07-08T21:08:20Z

Nicely done @jeancroy
Glad we are moving in a good direction!
Can you test the ordering of the results?
Something like:
https://gist.github.com/plrenaudin/c8dace4edeb5ea58c949

jeancroy · 2015-07-08T22:54:57Z

added test, it pass.
0: "install"
1: "Application: Install Update"
2: "Settings View: Uninstall Packages"
3: "Find And Replace: Selet All"

jeancroy · 2015-07-09T00:12:44Z

Ok this version pass the whole suite.
https://gist.github.com/jeancroy/a37b701659658f2529bf

Anyone has any other test case ?

Once we have cleared some other test/ideas I might do a PR.

walles · 2015-07-09T07:56:49Z

@jeancroy, I'd like to have "Settings View: View Installed Themes" added to the "install" test.

plrenaudin · 2015-07-09T08:03:05Z

@jeancroy congrats! I think it covers already a good bunch of them now.

jeancroy · 2015-07-09T12:56:52Z

@walles it now rank like so, make sens?

0: "install"
1: "Application: Install Update"
2: "Settings View: Uninstall Packages"
3: "Settings View: View Installed Themes"
4: "Find And Replace: Selet All"

I believe the rule that come into play is both "rank start of match higher" and "prefer smaller subject", there will be some point for "installed" because the proper letter "i" follow a token separator (space)

plrenaudin · 2015-07-09T13:01:38Z

That's a question about exact case substring score vs start of the word score.
Personally I would prefer a better score for the former one but it's already better that what we currently have.
Would it be possible to have some benchmark to compare?

jeancroy · 2015-07-09T13:04:51Z

note that because the install test is made like so
result = filter(candidates, 'install');

Uninstall have the exact case but not Installed

jeancroy · 2015-07-09T13:06:46Z

Filtering 66672 entries for 'index' took 658ms for 6168 results
Matching 66672 entries for 'index' took 253ms for 6168 results

plrenaudin · 2015-07-09T13:08:17Z

Yes, what I mean is the "exact case substring" score should not beat the "ignore case substring at start of the word" score. But that's totally personal, I'm already fine with the result 😃

Great stuff for the benchmark!

walles · 2015-07-09T14:49:48Z

@jeancroy I think your last list seems splendid!

Also I think that a PR wouldn't need to solve everything, so as long as the "installed" test is in the PR I think it's a good (new) starting point for making more refinements from.

PR!

jeancroy · 2015-07-09T15:01:54Z

What do I do with credit from https://github.com/joshaven/string_score/ ?
I'm not using any of this anymore. Basically the whole scorer.coffee is rewritten

plrenaudin · 2015-07-09T15:03:31Z

I assume it's safe to remove it as for the line (https://github.com/atom/fuzzaldrin/blob/master/src/fuzzaldrin.coffee#L12) if tests passes afterward

jeancroy · 2015-07-11T14:38:20Z

@lee-dohm I considered your example. They are particular in the sens the "take first concurrence of the matching char after previous match" algorithm magically align with word boundary. By magically I mean, I'm pretty sure someone with experience can craft a query that match, but Sublime, Chrome dev tool and others software setup a higher expectation of user-friendliness .

Consider this example

match 'acon' in 'application':
<a>ppli<c>ati<o><n>

Scattered letter is the best we can do

match 'acon' in 'application_control':
<a>pplication_<c><o><n>trol

Here we have to undo 3 greedy matches to get intended result.
And it's in that case - where greedy doesn't align with bonus - that previous algorithm was giving non intuitive results.

For that I provide two mechanism:

initialism to boost score of <c>
gap open penalty:

let | be a match, # be a gap opening , - be a (in match) gap extension, . be a haystack gap

a#---c#--on
|    |   ||
application

so we have 2 opening 5 extension

a#----------con....
|           |||    
application_control

so we have 10 extension but only 1 opening

So basically:

Gap Open -> We prefer grouped characters to scatered
Gap Extend + Local Match -> we prefer the more concentrated match (less gap in the middle of the match)

So for me that is the main contribution of PR #22, to get extra information about the match structure. Once we have that information it's easier to fine-tune for particular cases.

But that information have a computing power cost, so we provide quick exit for low quality matches (same==0 rule as before) as well as trivial matches (for now this is exact substring and 100% camel case initialism)

To that I'll add that @plrenaudin wanted to nail the exact substring case and provided plenty of real life test case to the spec, his contribution became the base for the "exact substring" part. In addition to exact substring, CamelCase got a special handler to ensure we nail them.

Also @walles was very useful in asking the right questions, scouring issues for test cases, and providing a patch for a substantial speedup.

Lastly how does this relate to LCS ?
If we set gap penalty to 0 and match weight to 1, we fall back to LCS, so it's really a more flexible generalization.

jeancroy · 2015-07-11T14:53:34Z

I'll also add that for the purpose of CamelCase detection, case invariant character count as lowecase
" " == " ".toLowerCase(). So PR #22 should handle @lee-dohm screenshot without any problem

winstliu added the enhancement label Jul 7, 2015

jeancroy mentioned this issue Jul 9, 2015

Rewrite scoring algorithm to support run of consecutive character, fix acronyms and add optimal selection of character. #22

Closed

benogle mentioned this issue Jul 20, 2015

The Quality Initiative atom/atom#7995

Closed

mnquintana mentioned this issue Sep 8, 2015

Question: Is this package already used in Atom? #26

Closed

jeancroy mentioned this issue Nov 3, 2015

Can this be enabled for the command-palette? jeancroy/fuzz-aldrin-plus#5

Closed

peppy mentioned this issue Nov 3, 2015

Add support for alternate scoring (fuzzaldrin) atom/command-palette#59

Closed

Use LCS to Rank Matches #20

Use LCS to Rank Matches #20

Comments

walles commented Jul 6, 2015

plrenaudin commented Jul 6, 2015

jeancroy commented Jul 6, 2015

plrenaudin commented Jul 6, 2015

walles commented Jul 7, 2015

walles commented Jul 7, 2015

plrenaudin commented Jul 7, 2015

jeancroy commented Jul 7, 2015

jeancroy commented Jul 7, 2015

plrenaudin commented Jul 7, 2015

jeancroy commented Jul 7, 2015

plrenaudin commented Jul 7, 2015

jeancroy commented Jul 8, 2015

plrenaudin commented Jul 8, 2015

thomasjo commented Jul 8, 2015

thomasjo commented Jul 8, 2015

lee-dohm commented Jul 8, 2015

plrenaudin commented Jul 8, 2015

lee-dohm commented Jul 8, 2015

plrenaudin commented Jul 8, 2015

jeancroy commented Jul 8, 2015

plrenaudin commented Jul 8, 2015

jeancroy commented Jul 8, 2015

jeancroy commented Jul 9, 2015

walles commented Jul 9, 2015

plrenaudin commented Jul 9, 2015

jeancroy commented Jul 9, 2015

plrenaudin commented Jul 9, 2015

jeancroy commented Jul 9, 2015

jeancroy commented Jul 9, 2015

plrenaudin commented Jul 9, 2015

walles commented Jul 9, 2015

jeancroy commented Jul 9, 2015

plrenaudin commented Jul 9, 2015

jeancroy commented Jul 11, 2015

jeancroy commented Jul 11, 2015