-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scoring algorithm needs improvement #30
Comments
I'm open to the idea, but it depends on the implementation complexity and performance. I actually tried this once, and it worked, but it was a huge drag on performance. I may have just done a bad job. To get high match quality, Selecta finds all matching substrings. The process is: find each occurrence of the first match character, then from each of those find the closest sequence of the rest of the match. This is cheap, because it's a series of This issue's proposed change complicates this: it's no longer a bunch of Each scoring feature also makes others more difficult to implement. Here are the main scoring enhancements that have been proposed:
I suspect that at most one of those will go in, just because the scoring code will end up being hundreds of lines otherwise. Assuming that you've been using Selecta as-is for the last month, do you still miss this feature? In my own use, I write a lot of queries like "libfor"; did you end up doing that, or something else? |
Maybe it doesn't have to be that complicated. What about matching the entire input on word boundaries rather than every character. Example: I'm using Selecta to switch projects and I want to switch to my "intranet" project. After typing "in" Selecta shows:
I know why it is scoring this way, but it seems counter-intuitive. |
@sos4nt That would prevent some common usage patterns, like typing "usspec" for "user_spec". I agree that "intranet" should show up first in your example, but I think it needs to be by scoring higher, not by filtering. |
That's what I've meant - score it higher when the input occurs at the beginning of a word (boundary). |
Ahh, yeah, that's @supermarin's suggestion too. :) |
Have you considered not scoring when there are thousands of matches? I wrote probe, a pure VimL fuzzy file finder for vim. Finding matches is cheap because I can just do a For probe's purposes I considered it essential to give scoring bonuses to characters after word boundaries, relegating me to iterating over the characters in the matches. I've been very happy with probe's scoring behaviour, giving a scoring advantage when characters occur first or last in the match or immediately follow another query character or path separator (word boundary in selecta's case). Finally, one thing that I miss in selecta that I use frequently in probe is matching ordered substrings when there are spaces in the query, which gives a lot more selectivity without much more typing. For example, |
@torbiak, punting when there are many matches is tempting. However, that wouldn't be very good for large sets of files. For example, I sometimes use Selecta in my ~/proj directory, which has 78k files in it spanning dozens of projects. It'd be nice to go into there, type "amu" into Selecta, and have all of the |
Over in #66, @airblade asked what algorithm I'd like. For background: Selecta currently finds the the shortest substring that contains the query characters in order, and the length of that substring is the score. For example, if you type the query "foo", then "f-o-o-foo" will score 3, not 5. If we naively found the first matching substring, we'd get "f-o-o". Here's a simplified version of the code: def compute_match_length(string, chars)
first_indexes = find_char_in_string(string, first_char)
first_indexes.map do |first_index|
last_index = find_end_of_match(string, rest, first_index)
last_index
end.min
end This is theoretically O(n^2), n being the size of the input. However, it's quite fast in practice because the first character of the query will show up at most a few times. This thread began when @supermarin proposed that characters at word boundaries should be privileged, so that e.g. "amu" will score very highly on "app/models/user.rb". That seemed like a win to me, so I started implementing it. My scoring strategy was: a character at a word boundary scores as if it were adjacent to the previous character. For example, "amu" scores 3 on "app/models/user.rb", rather than 12, which is the actual length of the matching substring "app/models/u". The problem is that we can't simply start at each occurrence of the first letter any more. Imagine matching the query "abc" against "a/bxc/c". If we naively find the characters in order, we'll match the substring "a/bxc", scoring 4 (the "/" between "a" and "b" doesn't contribute to the score because "b" is at a word boundary). But the ideal substring match is actually the full string, "a/bxc/c", which scores 3: 1 for the "a", 1 for the "b" at a word boundary, 1 for "c" at a word boundary. To do this right, we have to recurse twice on every single matching character: once on the next occurrence of the character alone, and again for the next occurrence of It may seem like finding the best possible score isn't important. But when I query for "amu" in my ~/proj directory, with dozens of projects and 79k files, I want all of the "app/model/user.rb"s at the top. Many of those paths will have other "amu" sequences in them, so we really do need to find the best substring. I've gone through at least a dozen potential optimizations: pre-filtering the list, caching intermediate results, etc. Some of them help, but I've never gotten it to be fast enough while still finding the optimal match. Most "optimizations" that I tried required doing character-by-character analysis and/or a lot of small allocations, both of which are dog-slow in Ruby. I suspect that a dynamic programming solution with global caching of intermediate results would work very well in other languages, but I didn't even pursue it because it would involve both per-score allocations and a lot of character-by-character matching. If you want to give it a try, there's a decent set of microbenchmarks to help you. Clone readygo, then record the current performance with I've mostly resigned myself to rewriting Selecta in a faster, static language, probably Rust, to make this fast enough. However, there may still be a clever solution in Ruby that I haven't seen, and if someone can find it then I'll be happy to avoid the rewrite. An algorithm redesign will probably be localized to the Score class, which is only 62 lines long right now. |
Thanks for the write-up. On Tuesday, January 6, 2015, Gary Bernhardt [email protected]
|
Total armchair engineering here, but it seems like character-by-character analysis is the way to go (unless there is a clever way to apply regular expressions). Since you don't want to have tons of allocations, maybe you could match against the numeric character value. Numbers shouldn't be allocations. |
Aaron, I hadn't thought of converting everything to character numbers ahead of time! Now I feel dumb. I'll think about how to combine that with a dynamic programming solution to avoid repeated matching on thousands of substrings. Thanks! |
@garybernhardt np! This is a really interesting problem, thanks for writing it up. I'll think about it more too. |
Great, let me know if you come up with anything! |
I haven't put too much thought into it yet, but for a dynamic programming solution, would you need to remember for each char a list of tuples (substr length, min points)? It seems like that plus @tenderlove's suggestion of ints (brilliant) would work out. You could allocate everything ahead of time too if you needed to because you know the max size of your cache would be 2 * str length * substr length. |
Light Table uses an ugly but effective solution.
It seems to work fairly well even though the code is a mess and could be heavily improved. Tonight I might try cleaning it up and running it on your benchmarks. |
@garybernhardt with the string |
@aaronjensen that would score 2; the algorithm just sees "a[word boundary]c" |
One other approach would be to convert to a DAG and shortest path it. |
Hmm, on second thought the fact that it's multiple starting points and multiple destinations may make that tricky/wrong. Nevermind. |
Would be a lot easier to get motivated with a set of inputs / outputs. Something like
|
@JoshCheek not sure I'm getting this right, but why would |
there are specs in see also the edit see clarification from gary below! |
naive attempt at dynamic programming led to way too many object allocations because I had to def compute_match_length2(choice, query_chars)
query_length = query_chars.length
query_last_index = query_length - 1
# cache :: [[index, score, distance]]
cache = []
choice.each_char.each_with_index do |choice_char, choice_index|
cache.each do |data|
index, score, distance = data
if index == query_last_index
return score if score == query_length
next
end
if choice_char == query_chars[index + 1]
data[0] += 1
data[1] += distance + 1
data[2] = 0
else
data[2] += 1
end
end
if choice_char == query_chars[0]
cache << [0, 1, 0]
end
end
cache.select { |index, _, _| index == query_last_index }
.map { |_, score, _| score }.min
end |
@mcwumbly, That pending spec was for other behavior that was discussed at one point. I just removed it to avoid confusion. You're right about the scoring_redesign branch's specs, though; those are useful. Specifically, this bit, which isn't on master. (I changed it slightly here; there are pending specs on that branch and some of them aren't very clear.) describe "at word boundaries" do
it "doesn't score characters before a match at a word boundary" do
score("fooxbar", "foobar").should == 7
score("foo-x-bar", "foobar").should == 6
end
it "finds optimal non-boundary matches when boundary matches are present" do
# The "xay" matches in both cases because it's shorter than "xaa-aay"
# even considering the latter's boundary bonus.
score("xayz/x-yaaz", "xyz").should == 4
score("x-yaaz/xayz", "xyz").should == 4
end
it "finds optimal boundary matches when non-boundary matches are present" do
score("x-yaz/xaaaayz", "xyz").should == 4
score("xaaaayz/x-yaz", "xyz").should == 4
end
end Off the top of my head, here are some more assertions that might trigger corner cases depending on the implementation: score("x-x-y-y-z-z", "xyz").should == 3
score("xayazxayz", "xyz").should == 4
score("x/yaaaz/yz", "xyz").should == 3 That last assertion will fail even on the (old, outdated) scoring_redesign branch. That shows that even that gnarly code isn't a general solution because it only finds the nearest occurrence of the next character, with and without a boundary. The truly general solution has to consider (1) the closest occurrence of the character, whether that's at a boundary or not, and also all following boundary occurrences of the character. Finally, here's a corrected version of Josh's table. All of this is covered by the existing specs plus those above, though, so no need to test these cases. They might be useful for understanding the behavior, though.
|
Ok, switched to using |
I forked I doubt the following is going to be a very popular idea as it would be a bit of a left turn for this codebase to take, but I've enjoyed thinking about it this evening. I wonder if Calling the command could then look something more like:
Or
And having a default command would keep things the way they are today:
Splitting things up also has the added bonus of making it easy to switch the search algorithm to a different more performant language, while not losing the ease with which the interface is written in Ruby. [1]: a brief summary is that it never searches across path parts (maybe like "word boundaries" as @garybernhardt was talking like above?) unless a slash (or a space for ease of typing) is specified. So, a search for |
@garybernhardt - this passes the specs on
|
(to run the specs, just to |
Convincing rspec to use ruby2.1 was fun... |
I'm probably just going to play with writing it in rust rather than beat my head against ruby. Sorry :) |
I'd love to read a version of the algorithm in Rust if you write it! I've tried to start learning Rust twice and given up in frustration both times. |
Yeah, it's definitely not optimised for ease of use, but I have another project that is starting to want control over memory layout so I'm more incentivised to push through it. |
Yeah, Selecta has basically been that force for me. I assume that I'll eventually push hard enough on it to get over this initial hump. |
Ok, I think I just ran into the same problem you did where |
Or maybe I was just editing the wrong file. Embarassing. |
This is eerily on par with the js version.
|
It was pretty pleasant so far, apart from the repeat.take.collect nonsense. The compiler errors were mostly very helpful and often suggested the correct fix. Having benchmarking and testing tools built in is pretty nice. |
I would like to see a flymake mode that can give me type/lifetime info in the editor. |
With some help from reddit it now handles unicode correctly and is maybe 30-40% faster than js. There were some gotchas on the way though and not having a debugger or repl was pretty frustrating. On the other hand, callgrind is a lot better than the chrome profiler, which seems to get confused by its own jit a lot. |
I also took a stab at a dynamic programming based solution in JS. Here is the matching function implementation and there is also a proof-of-concept hacky program (similar to selecta) to play around with Performance is comparable to selecta (slower under node 0.10, about the same in iojs master) which means that it should be really fast when written in Rust (and will probably handle unicode much better; this version only case-insensitive in ascii) The main goal was to make a prototype where the scoring function really easy to tweak and experiment with. Currently it favors matching characters after separators (or at the beginning) as well as multiple consecutive characters, but its quite easy to use very different scoring schemes. Its based on a gain function that takes the query, the line and their two index positions and returns a score that determines how well they match at that position. |
@spion How do I run that?
|
Oops, forgot to add argument checking and usage. Fixed. Thats the non-interactive, oneshot version; |
@spion The results don't seem right:
|
Thats because
vs mostly because If I adjust Originally I though about progressive gain the more letters in a sequence are matched, but that may be a bit slow to check edit: I just pushed a version where |
Since a Rust rewrite has come up a few times in this thread, I'll leave a link to my Rust rewrite of Selecta, called Heatseeker: https://github.com/rschmitt/heatseeker It's a fully functional Selecta clone that is tested and working on OS X, Linux, and Windows. It currently only implements the classic Selecta scoring algorithm (I haven't yet looked at the |
The scoring_redesign_4 branch that @rschmitt mentioned is pushed to GitHub. It's (1) faster than master, (2) consumes multiple input characters at a time, which makes it feel much faster overall, (3) has an algorithm that privileges word boundaries and sequential runs of matching characters, and (4) highlights the matching substring in the UI. I'd love to hear what people think about it, especially the scoring algorithm changes. |
Does consuming multiple input characters at a time make it possible to support <ESC> as a way to exit Selecta without making a selection? (If not, I would suggest Ctrl-G for that functionality, by analogy with emacs.) |
The speed of scoring_redesign_4 is awesome, by the way. I've already implemented some of the same optimizations in Heatseeker, but I'm starting to feel like I'm making a faster ballpoint pen. I'll experiment with the new scoring algorithm later this week; I'd like to see what it considers a "word boundary," and how it works with different projects in different languages. |
@rschmitt I don't think that this change will make ^G any easier or harder. Selecta used to support ESC as a way to exit with status 0 (as opposed to ^C, which exits with status 1). I backed that out because it got confused by escape codes, like arrow keys. It wouldn't be terribly hard to re-add it with another keystroke, but I'm skeptical of its utility. When I think about use cases for Selecta exiting with status 0 but printing nothing, they always end up involving a downstream conditional to check whether the output was empty. If that conditional exists, it may as well be a check for $? instead, as I see it. |
My understanding is that it was hard to distinguish <ESC> (as an actual keypress) from escape codes, because when you got an ESC you would have to wait insert-arbitrary-duration-here in order to disambiguate one from the other. But if you're now using |
Right, distinguishing ESC from control codes was the problem. But we'd really need an escape code parser to do that well. An ESC and |
In scoring_redesign_4, if I wanted to boost the weighting given to boundary characters, or sequential runs of matching characters, how would I do it? As far as I understand the current algorithm, lower scores are better and a boundary or sequential character adds 1 to the score instead of the subsequence length. Must scores be integers or could one add, say, 0.5 on a boundary character to double its impact? |
I've pulled
and compared them throughout the day. |
I've been working on a port of It handles |
The new algorithm is now merged to master. If you try it, please let me know what you think over in #80. |
Beginning of words should get ranked before the ones containing a letter.
For example:
f
:Gemfile
,some_other_file
,lib/xcpretty/formatter.rb
formatter.rb
should appear on top of the listIf it means anything, this is how TextMate's CMD+T behaves.
The text was updated successfully, but these errors were encountered: