Skip to content
This repository has been archived by the owner on Dec 15, 2022. It is now read-only.

Use LCS to Rank Matches #20

Open
walles opened this issue Jul 6, 2015 · 35 comments
Open

Use LCS to Rank Matches #20

walles opened this issue Jul 6, 2015 · 35 comments

Comments

@walles
Copy link

walles commented Jul 6, 2015

I suggest using LCS to rank matches:
https://en.wikipedia.org/wiki/Longest_common_subsequence_problem#Solution_for_two_sequences

Matches with lower number of substrings would come before matches with higher number of substrings.

This would be a lot more generic than the substring() based approach suggested in some PRs.

If you want to play with this, here's an online thing demonstrating the concept (I didn't write this, just found it):
http://lcs-demo.sourceforge.net/

Start by upping the "Max Size" a bit, then type in two strings and press "Execute LCS Lengths" to see how it works.

One real-world use case where this would come in handy is when searching for "git push", note how it's third place even though "git" and "push" are exact matches. Single-word matches would be caught by this as well.
gitpush

@plrenaudin
Copy link

Interesting idea. Found the algo for CoffeeScript: http://rosettacode.org/wiki/Longest_common_subsequence#CoffeeScript

Will try to implement it in my PR

@jeancroy
Copy link

jeancroy commented Jul 6, 2015

LCS alone will not help for the order of item in the screenshot.

We have LCS("git push","git push") = "git push"
But also LCS("git push","git plus hunk") = "git push"

What you need is to normalize matches against the length of both string. I myself have a project using LCS and use something like this:

m = Length of LCS(a,b)
lena = a.length
lenb = b.length
score = m*(m/lena+ m/lenb)

Note that you can probably use current algorithm to calculate m too. This is because current algorithm looks like a greedy approximation of LCS, with special score depending of the nature of the character.
By greedy i mean score("hzello","helloz") will match the z because it's in both sub-string but will miss the ello part doing so. LCS will skip the z to match ello

If you want to see behavior of above scoring among other things, I have a demo page

@plrenaudin
Copy link

Tried the LCS, it's way too slow.

Filtering 66672 entries for 'index' took 6557ms for 6168 results
Matching 66672 entries for 'index' took 237ms for 6168 results

Instead of

Filtering 66672 entries for 'index' took 267ms for 6168 results
Matching 66672 entries for 'index' took 218ms for 6168 results

Maybe I'm missing something...

I also added a test in my PR for the use case above.
The issue is with using 2 words: "push" retrieve the correct result (see test) but "git push" doesn't 😞

@walles
Copy link
Author

walles commented Jul 7, 2015

Well spotted @jeancroy, I see now that LCS won't handle my use case.

Your scoring function can be greatly simplified into "prefer shorter hits" if used to compare already-filtered hits. Since the hits are already filtered, we actually already have the LCS, and:

  • m = the length of the LCS = constant between all options
  • lena = the length of the LCS = m = constant between all options
  • lenb = the only thing varying = the length of the candidate string

So if we replace lena with m, we get...
score = m * (m/m + m/lenb) = m * (1 + m / lenb) = m + m^2/lenb
... and since m is the same between all candidates its value won't affect comparisons and we can just as well replace with a constant of our choosing (I choose 1)...
score = 1 * (1 + 1^2/lenb) = 1 + 1/lenb
... and since adding the same constant to all values we compare doesn't change anything we can remove that too, giving us...
score = 1 / lenb
... or in other words prefer shorter hits.

Which I think sounds just fine actually :)

@walles
Copy link
Author

walles commented Jul 7, 2015

@plrenaudin, thanks for trying this and benchmarking it!

After the above discussion however I'm not sure this is actually the way to go any more, and I'm not convinced about pure substring matches either.

Real-world example, searching for install using cmd-shift-p gives me:
findinstall

A substring match would put "Find And Replace: Select All" at the bottom (good), but it would also consider "uninstall", "install" and "installed" as equally good (bad).

So I'm starting to think that the way to go about this is to:

  • Collect some real world use cases to support. I suggest "git push" and "install". Some people seem to want "cccn" to match "CamelCasedClassName.java" as well (I don't care personally) or "gaa" to match "Git Plus: Add All". Implementor's choice really :).
  • Think of how a human would sort them.
  • Write specs based on that. Make sure the specs contain the full list of hits for each use case.
  • Implement something that makes all the new specs pass.

That's probably better than starting with a solution like I suggested here initially.

@plrenaudin
Copy link

You're right, People would also want to score starting substring with a higher score...

I'd like to have similar suggestion as IntelliJ or Sublime text but it would probably require to rewrite the whole scoring system.

We really need improvement on this because for projects with lots of files, the pertinence is just so bad that it's rare to have what we search for in first result.

@jeancroy
Copy link

jeancroy commented Jul 7, 2015

@walles the part you simplified out is actually important for accuracy should you use LCS.
match("uni","university") should be better than match("unicorn","university")

Now I because your scoring function have a no typo policy I'd agree with your simplification.

@jeancroy
Copy link

jeancroy commented Jul 7, 2015

I have put something together that I believe account for some of the issues mentioned.

  1. Use special character to split query / subject into tokens. Match only token against token.
    • This will prevent to match across tokens push will not fully match plus husk. Same for directory, SomeDir will not match Awsome/David_R
  2. Give bonus to prefix, applicable for each token.
    • install will match prefix for second token of git install but will not match prefix of git uninstall
  3. If you are still interested in using LCS, I have attached an implementation that should be faster that what @plrenaudin tested (With also the token stuff mentioned above).

https://gist.github.com/jeancroy/a6a7f59db14de85b2b97

@plrenaudin
Copy link

@jeancroy Thanks for the idea!
I like the first point (match the tokens instead of the entire query).
Just tried to change the actual score function by your Gist but lots of tests are failling and benchamrks result is:

Filtering 66672 entries for 'index' took 1322ms for 2040 results
Matching 66672 entries for 'index' took 250ms for 2040 results
Results count changed! 2040 instead of 6168

It's missing some results so even if it worked, it would still be quite slow compared to the actual implementation.

But there is definitely a potential improvement in the tokenizing of the query / subject.

@jeancroy
Copy link

jeancroy commented Jul 7, 2015

@plrenaudin can I play with your implementation of that benchmark? It should not miss an "index" so i'm curious to debug that, also query is not meant to be re-computed each time.


Also open question ... Why do this project score special character higher ? I mean any character of query not in subject squash the score to 0. Isn't not being squashed to 0 a good enough outcome ? Also every result not being 0 will have every char including every special char and their bonus, so it doesn't help the sort very much.

@plrenaudin
Copy link

@jeancroy My "implementation" is just to change the score function by your gist, nothing more nothing less. This is where the magic should happen and the score put the relevant results in front of the rest.

I'm sure you could improve thist but it looks like the algo has much more complexity than the one in place so having a slower benchmark is not that surprising. I personally wouldn't mind having a slower fuzzy algo for more pertinence though.

@jeancroy
Copy link

jeancroy commented Jul 8, 2015

@plrenaudin sorry I did not meant you have changed something in a wrong way. Just I never played with coffeescript / node only before today so I didn't know where to start. I corrected something in my code.

Filtering 66672 entries for 'index' took 877ms for 2040 results
Matching 66672 entries for 'index' took 307ms for 2040 results
Results count changed! 2040 instead of 6168

I'm also OK with matches that are missed. I believe any token implementation (idea 1) will miss them.

index vs
./node_modules/npm/node_modules/path-is-inside/LICENSE.txt

and some more

@plrenaudin
Copy link

@jeancroy You didn't change anything in a wrong way, it just requires more changes to match the expected results of tests.
I'm eager to see it fully compliant with the tests.
When you look at the specs, most of them make sense so it is a good way to see if your algo is compatible. it just misses the tests about the "git push" query situation. (and maybe also a more complete test about CamelCase query)

It bothers me to see that for such an important feature there is so few people concerned in improving it.

@thomasjo
Copy link

thomasjo commented Jul 8, 2015

@plrenaudin This issue has many "lurkers", I assure you. But I guess many, like myself, are very busy with other stuff. Wish I had time to invest in improving this...

@thomasjo
Copy link

thomasjo commented Jul 8, 2015

I applaud the effort put into this so far by the way. Love it when the community (even if only a handful of people) come together like this ❤️

@lee-dohm
Copy link

lee-dohm commented Jul 8, 2015

I just got off of a work bender ... pushing for release after my vacation time the last couple weeks. I'm still trying to catch up on all the work you all are doing. One thing I'd like to note though ...

I'd like to see more testing of "fuzzy" matches, just to make sure that all bases are covered, such as:

screen shot 2015-07-08 at 8 05 36 am

This is the way a lot of people search for things, not when they don't know what they're looking for ... but when they do. As another example, when I need to set the grammar of a file by hand, I use my set-syntax package which allows the Command Palette to mimic how I did things in Sublime:

screen shot 2015-07-08 at 8 06 38 am

@plrenaudin
Copy link

The question then is: should ssrb match "Set Syntax: Ruby" or a class called "SsrbController" for example?
I think at some point there will have to be some controversy about which result should score better than the other.

@lee-dohm
Copy link

lee-dohm commented Jul 8, 2015

Definitely there will be. I just didn't want things getting completely optimized for whole or partial words when scattered single-character matches are also an important use case.

@plrenaudin
Copy link

Indeed, I completely agree with the use case as for the one with CamelCaseFiles

@jeancroy
Copy link

jeancroy commented Jul 8, 2015

Ok a new day, a new algorithm :)

This one is based on the idea of splitting the speed/accuracy trade off in two.

  1. Quickly reject non-match
  2. Take the time it takes to do an accurate scoring on good candidates.

For 1, i simplified current algorithm will all score = 1. Output is match yes/no.. This mean new algorithm report same positive result than original.

For 2, i used a local sequence alignment algorithm with afine gap penalty (Smith Waterman / Gotoh). I also used the measure of similarity between characters to implement lower/upper case bonus, CamelCase and This is an acronym

What this mean in layman term

a. We get a lot of knob to play with.
b. Algorithm will try to find the combination of matches that give the best score.
c. I'm currently using that algorithm to highlight matches on another project, we we decide to use this as a scorer, I know it's relatively easy to make the matcher.

@plrenaudin

I've downloaded your updated filter.spec. Current proposal pass all test but 4 "base path" ones. It's also faster on benchmark than last proposal mostly due to quick exit.

https://gist.github.com/jeancroy/a37b701659658f2529bf

@plrenaudin
Copy link

Nicely done @jeancroy
Glad we are moving in a good direction!
Can you test the ordering of the results?
Something like:
https://gist.github.com/plrenaudin/c8dace4edeb5ea58c949

@jeancroy
Copy link

jeancroy commented Jul 8, 2015

added test, it pass.
0: "install"
1: "Application: Install Update"
2: "Settings View: Uninstall Packages"
3: "Find And Replace: Selet All"

@jeancroy
Copy link

jeancroy commented Jul 9, 2015

Ok this version pass the whole suite.
https://gist.github.com/jeancroy/a37b701659658f2529bf

Anyone has any other test case ?

Once we have cleared some other test/ideas I might do a PR.

@walles
Copy link
Author

walles commented Jul 9, 2015

@jeancroy, I'd like to have "Settings View: View Installed Themes" added to the "install" test.

@plrenaudin
Copy link

@jeancroy congrats! I think it covers already a good bunch of them now.

@jeancroy
Copy link

jeancroy commented Jul 9, 2015

@walles it now rank like so, make sens?

0: "install"
1: "Application: Install Update"
2: "Settings View: Uninstall Packages"
3: "Settings View: View Installed Themes"
4: "Find And Replace: Selet All"

I believe the rule that come into play is both "rank start of match higher" and "prefer smaller subject", there will be some point for "installed" because the proper letter "i" follow a token separator (space)

@plrenaudin
Copy link

That's a question about exact case substring score vs start of the word score.
Personally I would prefer a better score for the former one but it's already better that what we currently have.
Would it be possible to have some benchmark to compare?

@jeancroy
Copy link

jeancroy commented Jul 9, 2015

note that because the install test is made like so
result = filter(candidates, 'install');

Uninstall have the exact case but not Installed

@jeancroy
Copy link

jeancroy commented Jul 9, 2015

Filtering 66672 entries for 'index' took 658ms for 6168 results
Matching 66672 entries for 'index' took 253ms for 6168 results

@plrenaudin
Copy link

Yes, what I mean is the "exact case substring" score should not beat the "ignore case substring at start of the word" score. But that's totally personal, I'm already fine with the result 😃

Great stuff for the benchmark!

@walles
Copy link
Author

walles commented Jul 9, 2015

@jeancroy I think your last list seems splendid!

Also I think that a PR wouldn't need to solve everything, so as long as the "installed" test is in the PR I think it's a good (new) starting point for making more refinements from.

PR!

@jeancroy
Copy link

jeancroy commented Jul 9, 2015

What do I do with credit from https://github.com/joshaven/string_score/ ?
I'm not using any of this anymore. Basically the whole scorer.coffee is rewritten

@plrenaudin
Copy link

I assume it's safe to remove it as for the line (https://github.com/atom/fuzzaldrin/blob/master/src/fuzzaldrin.coffee#L12) if tests passes afterward

@jeancroy
Copy link

@lee-dohm I considered your example. They are particular in the sens the "take first concurrence of the matching char after previous match" algorithm magically align with word boundary. By magically I mean, I'm pretty sure someone with experience can craft a query that match, but Sublime, Chrome dev tool and others software setup a higher expectation of user-friendliness .

Consider this example

match 'acon' in 'application':
<a>ppli<c>ati<o><n>

Scattered letter is the best we can do

match 'acon' in 'application_control':
<a>pplication_<c><o><n>trol

Here we have to undo 3 greedy matches to get intended result.
And it's in that case - where greedy doesn't align with bonus - that previous algorithm was giving non intuitive results.

For that I provide two mechanism:

  • initialism to boost score of <c>
  • gap open penalty:

let | be a match, # be a gap opening , - be a (in match) gap extension, . be a haystack gap

a#---c#--on
|    |   ||
application

so we have 2 opening 5 extension

a#----------con....
|           |||    
application_control

so we have 10 extension but only 1 opening

So basically:

  • Gap Open -> We prefer grouped characters to scatered
  • Gap Extend + Local Match -> we prefer the more concentrated match (less gap in the middle of the match)

So for me that is the main contribution of PR #22, to get extra information about the match structure. Once we have that information it's easier to fine-tune for particular cases.

But that information have a computing power cost, so we provide quick exit for low quality matches (same==0 rule as before) as well as trivial matches (for now this is exact substring and 100% camel case initialism)


To that I'll add that @plrenaudin wanted to nail the exact substring case and provided plenty of real life test case to the spec, his contribution became the base for the "exact substring" part. In addition to exact substring, CamelCase got a special handler to ensure we nail them.

Also @walles was very useful in asking the right questions, scouring issues for test cases, and providing a patch for a substantial speedup.


Lastly how does this relate to LCS ?
If we set gap penalty to 0 and match weight to 1, we fall back to LCS, so it's really a more flexible generalization.

@jeancroy
Copy link

I'll also add that for the purpose of CamelCase detection, case invariant character count as lowecase
" " == " ".toLowerCase(). So PR #22 should handle @lee-dohm screenshot without any problem

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants