Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scoring algorithm needs improvement #30

Closed
supermarin opened this issue Jan 19, 2014 · 103 comments
Closed

Scoring algorithm needs improvement #30

supermarin opened this issue Jan 19, 2014 · 103 comments

Comments

@supermarin
Copy link

Beginning of words should get ranked before the ones containing a letter.
For example:

  • searching with just a character f:
    • Given there's Gemfile, some_other_file, lib/xcpretty/formatter.rb
    • formatter.rb should appear on top of the list

If it means anything, this is how TextMate's CMD+T behaves.

@garybernhardt
Copy link
Owner

I'm open to the idea, but it depends on the implementation complexity and performance. I actually tried this once, and it worked, but it was a huge drag on performance. I may have just done a bad job.

To get high match quality, Selecta finds all matching substrings. The process is: find each occurrence of the first match character, then from each of those find the closest sequence of the rest of the match. This is cheap, because it's a series of String#index calls.

This issue's proposed change complicates this: it's no longer a bunch of String#index calls, or a simple "take the shortest match" process. It seems like you have to score every substring of the candidate that matches the input, even if it's super long. For example, if you search for "ff", you can't take a "ffoo" match because "some/dir/foo/folder" might match higher due to the word boundary bonus. This is what made my implementation slow, but I may have missed some obvious optimizations.

Each scoring feature also makes others more difficult to implement. Here are the main scoring enhancements that have been proposed:

  1. "Abc" only matches if "A" begins a word (there's an open PR for this)
  2. Beginning of word scores higher (this issue)
  3. "ab cd" does two non-overlapping searches, "ab" and "cd", but they don't have to be in order, so e.g. "card alibi" matches (I've been curious about this but haven't implemented it)

I suspect that at most one of those will go in, just because the scoring code will end up being hundreds of lines otherwise.

Assuming that you've been using Selecta as-is for the last month, do you still miss this feature? In my own use, I write a lot of queries like "libfor"; did you end up doing that, or something else?

@sos4nt
Copy link
Contributor

sos4nt commented Mar 10, 2014

Maybe it doesn't have to be that complicated. What about matching the entire input on word boundaries rather than every character.

Example: I'm using Selecta to switch projects and I want to switch to my "intranet" project. After typing "in" Selecta shows:

> in
bowling
sinatra
intranet
...

I know why it is scoring this way, but it seems counter-intuitive.

@garybernhardt
Copy link
Owner

@sos4nt That would prevent some common usage patterns, like typing "usspec" for "user_spec". I agree that "intranet" should show up first in your example, but I think it needs to be by scoring higher, not by filtering.

@sos4nt
Copy link
Contributor

sos4nt commented Mar 10, 2014

That's what I've meant - score it higher when the input occurs at the beginning of a word (boundary).

@garybernhardt
Copy link
Owner

Ahh, yeah, that's @supermarin's suggestion too. :)

@torbiak
Copy link

torbiak commented Nov 12, 2014

Have you considered not scoring when there are thousands of matches? I wrote probe, a pure VimL fuzzy file finder for vim. Finding matches is cheap because I can just do a :g/<pattern>/d to filter out non-matches, but scoring in VimL is really slow so I only start scoring once I'm down to 400 matches. I figure if there's thousands of matches it's unlikely the scoring means much, anyway. It's definitely a compromise, but it's made probe much more enjoyable to use. I use a red background on the match counter to draw the user's attention to the large number of matches and remind them that they haven't been scored yet.

For probe's purposes I considered it essential to give scoring bonuses to characters after word boundaries, relegating me to iterating over the characters in the matches. I've been very happy with probe's scoring behaviour, giving a scoring advantage when characters occur first or last in the match or immediately follow another query character or path separator (word boundary in selecta's case).

Finally, one thing that I miss in selecta that I use frequently in probe is matching ordered substrings when there are spaces in the query, which gives a lot more selectivity without much more typing. For example, loc man bash would be a very selective query in /usr for a bash manpage at /usr/local/share/man/man1/bash.1. For filepaths I prefer this to the non-overlapping searches you mentioned above.

@garybernhardt
Copy link
Owner

@torbiak, punting when there are many matches is tempting. However, that wouldn't be very good for large sets of files. For example, I sometimes use Selecta in my ~/proj directory, which has 78k files in it spanning dozens of projects. It'd be nice to go into there, type "amu" into Selecta, and have all of the app/models/users matches at the top. That currently doesn't happen, but with the algorithm that @supermarin proposes, it would, but only if Selecta considers the full input.

@garybernhardt
Copy link
Owner

Over in #66, @airblade asked what algorithm I'd like. For background: Selecta currently finds the the shortest substring that contains the query characters in order, and the length of that substring is the score. For example, if you type the query "foo", then "f-o-o-foo" will score 3, not 5. If we naively found the first matching substring, we'd get "f-o-o".

Here's a simplified version of the code:

def compute_match_length(string, chars)
  first_indexes = find_char_in_string(string, first_char)

  first_indexes.map do |first_index|
    last_index = find_end_of_match(string, rest, first_index)
    last_index
  end.min
end

This is theoretically O(n^2), n being the size of the input. However, it's quite fast in practice because the first character of the query will show up at most a few times.

This thread began when @supermarin proposed that characters at word boundaries should be privileged, so that e.g. "amu" will score very highly on "app/models/user.rb". That seemed like a win to me, so I started implementing it. My scoring strategy was: a character at a word boundary scores as if it were adjacent to the previous character. For example, "amu" scores 3 on "app/models/user.rb", rather than 12, which is the actual length of the matching substring "app/models/u".

The problem is that we can't simply start at each occurrence of the first letter any more. Imagine matching the query "abc" against "a/bxc/c". If we naively find the characters in order, we'll match the substring "a/bxc", scoring 4 (the "/" between "a" and "b" doesn't contribute to the score because "b" is at a word boundary). But the ideal substring match is actually the full string, "a/bxc/c", which scores 3: 1 for the "a", 1 for the "b" at a word boundary, 1 for "c" at a word boundary.

To do this right, we have to recurse twice on every single matching character: once on the next occurrence of the character alone, and again for the next occurrence of /\b#{char}/. This will consider every single matching substring, which can be thousands for long input strings like log file input. It's extremely slow, both in theory and practice.

It may seem like finding the best possible score isn't important. But when I query for "amu" in my ~/proj directory, with dozens of projects and 79k files, I want all of the "app/model/user.rb"s at the top. Many of those paths will have other "amu" sequences in them, so we really do need to find the best substring.

I've gone through at least a dozen potential optimizations: pre-filtering the list, caching intermediate results, etc. Some of them help, but I've never gotten it to be fast enough while still finding the optimal match. Most "optimizations" that I tried required doing character-by-character analysis and/or a lot of small allocations, both of which are dog-slow in Ruby. I suspect that a dynamic programming solution with global caching of intermediate results would work very well in other languages, but I didn't even pursue it because it would involve both per-score allocations and a lot of character-by-character matching.

If you want to give it a try, there's a decent set of microbenchmarks to help you. Clone readygo, then record the current performance with ruby ~/path/to/readygo.rb benchmark.rb --record. Then, make your changes and compare with ruby ~/proj/readygo/readygo.rb benchmark.rb --compare. The "X"s in readygo's graphs are the minimum runtimes for the benchmarks; the lines to the Xs' rights indicate the longest attempt recorded, to visualize variance. Those lines will be short if you got a good recording and comparison. You'll probably find that it's easy to speed up some, or even most, of the benchmarks, but then one or two of them will become pathologically slow. The "matching broken up" and "overlapping matches" tend to be especially volatile. You should also do subjective tests on large inputs, since the microbenchmarks don't catch everything (a ~100k line Rails log file tends to work well).

I've mostly resigned myself to rewriting Selecta in a faster, static language, probably Rust, to make this fast enough. However, there may still be a clever solution in Ruby that I haven't seen, and if someone can find it then I'll be happy to avoid the rewrite. An algorithm redesign will probably be localized to the Score class, which is only 62 lines long right now.

@marcomorain
Copy link
Contributor

Thanks for the write-up.

On Tuesday, January 6, 2015, Gary Bernhardt [email protected]
wrote:

Over in #66 #66,
@airblade https://github.com/airblade asked what algorithm I'd like.
For background, Selecta currently finds the the shortest substring that
contains the query characters in order, and the length of that substring is
the score. For example, if you type the query "foo", then "f-o-o-foo" will
score 3, not 5. If we naively found the first matching substring, we'd get
"f-o-o".

Here's a simplified version of the code if my explanation wasn't clear:

def compute_match_length(string, chars)
first_indexes = find_char_in_string(string, first_char)

first_indexes.map do |first_index|
last_index = find_end_of_match(string, rest, first_index)
last_index
end.minend

This is theoretically O(n^2), n being the size of the input. However, it's
quite fast in practice because the first character of the query will show
up at most a few times.

This thread began when @supermarin https://github.com/supermarin
proposed that characters at word boundaries should be privileged, so that
e.g. "amu" will score very highly on "app/models/user.rb". That seemed like
a win to me, so I started implementing it. My scoring strategy was: a
character at a word boundary scores as if it were adjacent to the previous
character. For example, "amu" scores 3 on "app/models/user.rb", rather than
12, which is the actual length of the matching substring "app/models/u".

The problem is that we can't simply start at each occurrence of the first
letter any more. Imagine matching the query "abc" against "a/bxc/c". If we
naively find the characters in order, we'll match the substring "a/bxc",
scoring 4 (the "/" between "a" and "b" doesn't contribute to the score
because "b" is at a word boundary). But the ideal substring match is
actually the full string, "a/bxc/c", which scores 3: 1 for the "a", 1 for
the "b" at a word boundary, 1 for "c" at a word boundary.

To do this right, we have to recurse twice on every single matching
character: once on the next occurrence of the character alone, and again
for the next occurrence of /\b#{char}/. This will consider every single
matching substring, which can be thousands for long input strings like log
file input. It's extremely slow, both in theory and practice.

It may seem like finding the best possible score isn't important. But when
I query for "amu" in my ~/proj directory, with dozens of projects and 79k
files, I want all of the "app/model/user.rb"s at the top. Many of those
paths will have other "amu" sequences in them, so we really do need to find
the best substring.

I've gone through at least a dozen potential optimizations: pre-filtering
the list, caching intermediate results, etc. Some of them help, but I've
never gotten it to be fast enough while still finding the optimal match.
Most "optimizations" that I tried required doing character-by-character
analysis and/or a lot of small allocations, both of which are dog-slow in
Ruby. I suspect that a dynamic programming solution with global caching of
intermediate results would work very well in other languages, but I didn't
even pursue it because it would involve both per-score allocations and a
lot of character-by-character matching.

If you want to give it a try, there's a decent set of microbenchmarks to
help you. Clone readygo https://github.com/garybernhardt/readygo, then
record the current performance with ruby ~/path/to/readygo.rb
benchmark.rb --record. Then, make your changes and compare with ruby
~/proj/readygo/readygo.rb benchmark.rb --compare. The "X"s in readygo's
graphs are the minimum runtimes for the benchmarks; the lines to the Xs'
rights indicate the longest attempt recorded, to visualize variance. Those
lines will be short if you got a good recording and comparison. You'll
probably find that it's easy to score some, or even most of the benchmarks
up, but then one or two of them will become pathologically slow. The
"matching broken up" and "overlapping matches" tend to be especially
volatile. You should also do subjective tests on large inputs, since the
microbenchmarks don't catch everything (a ~100k line Rails log file tends
to work well).

I've mostly resigned myself to rewriting Selecta in a faster, static
language, probably Rust, to make this fast enough. However, there may still
be a clever solution in Ruby that I haven't seen, and if someone can find
it then I'll be happy to avoid the rewrite. An algorithm redesign will
probably be localized to the Score class, which is only 62 lines long right
now.


Reply to this email directly or view it on GitHub
#30 (comment)
.

@tenderlove
Copy link

Most "optimizations" that I tried required doing character-by-character analysis and/or a lot of small allocations, both of which are dog-slow in Ruby.

Total armchair engineering here, but it seems like character-by-character analysis is the way to go (unless there is a clever way to apply regular expressions). Since you don't want to have tons of allocations, maybe you could match against the numeric character value. Numbers shouldn't be allocations.

@garybernhardt
Copy link
Owner

Aaron, I hadn't thought of converting everything to character numbers ahead of time! Now I feel dumb. I'll think about how to combine that with a dynamic programming solution to avoid repeated matching on thousands of substrings. Thanks!

@tenderlove
Copy link

@garybernhardt np! This is a really interesting problem, thanks for writing it up. I'll think about it more too.

@garybernhardt
Copy link
Owner

Great, let me know if you come up with anything!

@aaronjensen
Copy link

I haven't put too much thought into it yet, but for a dynamic programming solution, would you need to remember for each char a list of tuples (substr length, min points)? It seems like that plus @tenderlove's suggestion of ints (brilliant) would work out. You could allocate everything ahead of time too if you needed to because you know the max size of your cache would be 2 * str length * substr length.

@jamii
Copy link

jamii commented Jan 6, 2015

Light Table uses an ugly but effective solution.

  1. Throw away anything that doesn't contain any of the characters in the search.
  2. Keep anything that has an exact match for the search and throw that up on screen already, since it must have the highest score.
  3. Do the dynamic programming solution for everything that's left.

It seems to work fairly well even though the code is a mess and could be heavily improved. Tonight I might try cleaning it up and running it on your benchmarks.

@aaronjensen
Copy link

@garybernhardt with the string a/b/c would ac score a 2 or a 3?

@garybernhardt
Copy link
Owner

@aaronjensen that would score 2; the algorithm just sees "a[word boundary]c"

@aaronjensen
Copy link

One other approach would be to convert to a DAG and shortest path it.

@aaronjensen
Copy link

Hmm, on second thought the fact that it's multiple starting points and multiple destinations may make that tricky/wrong. Nevermind.

@JoshCheek
Copy link

Would be a lot easier to get motivated with a set of inputs / outputs. Something like

query | string   | score
------|----------|------
abc   | abc      | 3
abc   | xabc     | 3
abc   | abcx     | 3
abc   | xabcx    | 3
abc   | abcc     | 3
abc   | acb      | 2
abc   | ab       | 2
abc   | ac       | 2
abc   | ab/c     | 3
ab/c  | ab/c     | 4
abc   | a/b/c    | 3
abc   | a/babc/c | 5
...

@supermarin
Copy link
Author

@JoshCheek not sure I'm getting this right, but why would a/babc/c score better than abc given the search input was abc?

@mcwumbly
Copy link

mcwumbly commented Jan 6, 2015

there are specs in spec/score_spec.rb for the existing algorithm, with one test ignored for this new behavior.

see also the scoring_redesign branch for updated specs

edit see clarification from gary below!

@aaronjensen
Copy link

naive attempt at dynamic programming led to way too many object allocations because I had to #each_char each choice. The current algorithm is quite a bit faster because stdlib is able to do most of the heavy string lifting...

    def compute_match_length2(choice, query_chars)
      query_length = query_chars.length
      query_last_index = query_length - 1

      # cache :: [[index, score, distance]]
      cache = []

      choice.each_char.each_with_index do |choice_char, choice_index|
        cache.each do |data|
          index, score, distance = data

          if index == query_last_index
            return score if score == query_length
            next
          end

          if choice_char == query_chars[index + 1]
            data[0] += 1
            data[1] += distance + 1
            data[2] = 0
          else
            data[2] += 1
          end
        end

        if choice_char == query_chars[0]
          cache << [0, 1, 0]
        end
      end

      cache.select { |index, _, _| index == query_last_index }
        .map { |_, score, _| score }.min
    end

@garybernhardt
Copy link
Owner

@mcwumbly, That pending spec was for other behavior that was discussed at one point. I just removed it to avoid confusion. You're right about the scoring_redesign branch's specs, though; those are useful. Specifically, this bit, which isn't on master. (I changed it slightly here; there are pending specs on that branch and some of them aren't very clear.)

    describe "at word boundaries" do
      it "doesn't score characters before a match at a word boundary" do
        score("fooxbar", "foobar").should == 7
        score("foo-x-bar", "foobar").should == 6
      end

      it "finds optimal non-boundary matches when boundary matches are present" do
        # The "xay" matches in both cases because it's shorter than "xaa-aay"
        # even considering the latter's boundary bonus.
        score("xayz/x-yaaz", "xyz").should == 4
        score("x-yaaz/xayz", "xyz").should == 4
      end

      it "finds optimal boundary matches when non-boundary matches are present" do
        score("x-yaz/xaaaayz", "xyz").should == 4
        score("xaaaayz/x-yaz", "xyz").should == 4
      end
    end

Off the top of my head, here are some more assertions that might trigger corner cases depending on the implementation:

score("x-x-y-y-z-z", "xyz").should == 3
score("xayazxayz", "xyz").should == 4
score("x/yaaaz/yz", "xyz").should == 3

That last assertion will fail even on the (old, outdated) scoring_redesign branch. That shows that even that gnarly code isn't a general solution because it only finds the nearest occurrence of the next character, with and without a boundary. The truly general solution has to consider (1) the closest occurrence of the character, whether that's at a boundary or not, and also all following boundary occurrences of the character.

Finally, here's a corrected version of Josh's table. All of this is covered by the existing specs plus those above, though, so no need to test these cases. They might be useful for understanding the behavior, though.

query | string     | score
------|------------|------
abc   | abc        | 3
abc   | xabcx      | 3
abc   | abcc       | 3
abc   | acb        | 0
abc   | ab         | 0
abc   | ac         | 0
abc   | ab/c       | 3
ab/c  | ab/c       | 4
abc   | a/b/c      | 3

@aaronjensen
Copy link

Ok, switched to using each_codepoint instead and that got rid of the object allocations, but it's still quite a bit slower.

@bentomas
Copy link

bentomas commented Jan 6, 2015

I forked selecta a month ago (just locally) because I found that the search algorithm wasn't working as well as I thought it could. I wrote a new algorithm that I find almost reads my mind. I'm rather fond of my algorithm, but it works in a very particular way[1].

I doubt the following is going to be a very popular idea as it would be a bit of a left turn for this codebase to take, but I've enjoyed thinking about it this evening. I wonder if selecta should be split into two commands. One command handles the interface side of things, and a second command handles the searching. This way it would be easy to use different search algorithms depending on the situation. Searching path names is very different than searching say, contacts or something. It would also make this library more reusable, as those of us that aren't happy with the built-in search would be able to choose our own while still getting the benefits of the delightful interface selecta provides.

Calling the command could then look something more like:

find . -type f | selecta --search-command="selecta-search"

Or

find . -type d | selecta --search-command="custom-search"

And having a default command would keep things the way they are today:

find . -type f | selecta

Splitting things up also has the added bonus of making it easy to switch the search algorithm to a different more performant language, while not losing the ease with which the interface is written in Ruby.

[1]: a brief summary is that it never searches across path parts (maybe like "word boundaries" as @garybernhardt was talking like above?) unless a slash (or a space for ease of typing) is specified. So, a search for amu would only match file names that contain those letters, and a search for a/m/u would match a path that has a directory that has an a, which contains (recursively) a directory that contains a m, which contains (recursively) a file that has a u. This works much better with how I think, in that I want to find files and directories by their names and be in control of when the search moves across path parts. I then also tweaked the algorithm to favor search results at the beginning of file/directory names over matches that start later in a name. It was horrible slow at first, but with a little massaging it is fast enough for my largest project (though it doesn't have anywhere close to 78,000 files).

@mcwumbly
Copy link

mcwumbly commented Jan 6, 2015

@garybernhardt - this passes the specs on scoring_redesign and benches close to master, if I'm interpreting the results correctly (+10 ms on the worst two cases each of which are ~30 ms vs ~20 ms before on my machine):
mcwumbly@06e90eb

scoring non-matching
  Baseline: |                                                               X--|
  Current:  |                                             X--                  |
            0                                                           4.395 us

scoring matching exactly
  Baseline: |                                                           X------|
  Current:  |  X                                                               |
            0                                                          38.250 us

scoring matching broken up
  Baseline: |                    X--                                           |
  Current:  |                                                           X------|
            0                                                         121.813 us

scoring overlapping matches
  Baseline: |                                                              X---|
  Current:  | X                                                                |
            0                                                          79.938 us

scoring almost overlapping matches
  Baseline: |                 X----                                            |
  Current:  |                                                              X---|
            0                                                         263.125 us

scoring paths, non-matching
  Baseline: |                                                            X-----|
  Current:  |                                                X-                |
            0                                                          18.546 ms

scoring paths, empty query
  Baseline: |                                                           X----  |
  Current:  |                                                             X----|
            0                                                         593.000 us

scoring paths, trivial query
  Baseline: |                                                          X-------|
  Current:  |                  X--                                             |
            0                                                          14.261 ms

search matching a large list of choices filtering with no matches
  Baseline: |                                                            X-----|
  Current:  |                                                X---              |
            0                                                           1.745 ms

search matching a large list of choices filtering with many matches
  Baseline: |                                                               X--|
  Current:  |              X-                                                  |
            0                                                           3.157 ms

search matching a large list of choices progressive query appending, sequential, few matches
  Baseline: |                                                           X------|
  Current:  |                                               X--                |
            0                                                           9.416 ms

search matching a large list of choices progressive query appending, sequential, many matches
  Baseline: |                                           X-----                 |
  Current:  |                                                              X---|
            0                                                          32.357 ms

search matching a large list of choices progressive query appending, non-sequential, few matches
  Baseline: |                                                         X--------|
  Current:  |                                              X-                  |
            0                                                           7.998 ms

search matching a large list of choices progressive query appending, non-sequential, many matches
  Baseline: |                                  X----                           |
  Current:  |                                                            X-----|
            0                                                          32.852 ms

@garybernhardt
Copy link
Owner

(to run the specs, just to gem install rspec and then just run rspec)

@jamii
Copy link

jamii commented Jan 7, 2015

Convincing rspec to use ruby2.1 was fun...

@jamii
Copy link

jamii commented Jan 7, 2015

I'm probably just going to play with writing it in rust rather than beat my head against ruby. Sorry :)

@garybernhardt
Copy link
Owner

I'd love to read a version of the algorithm in Rust if you write it! I've tried to start learning Rust twice and given up in frustration both times.

@jamii
Copy link

jamii commented Jan 8, 2015

Yeah, it's definitely not optimised for ease of use, but I have another project that is starting to want control over memory layout so I'm more incentivised to push through it.

@garybernhardt
Copy link
Owner

Yeah, Selecta has basically been that force for me. I assume that I'll eventually push hard enough on it to get over this initial hump.

@jamii
Copy link

jamii commented Jan 8, 2015

Ok, I think I just ran into the same problem you did where cargo test is compiling some random old cached version of my project.

@jamii
Copy link

jamii commented Jan 8, 2015

Or maybe I was just editing the wrong file. Embarassing.

@jamii
Copy link

jamii commented Jan 8, 2015

This is eerily on par with the js version.

test almost_overlapping_matches      ... bench:      3573 ns/iter (+/- 61)
test matching_broken_up              ... bench:       916 ns/iter (+/- 106)
test matching_exactly                ... bench:      1532 ns/iter (+/- 32)
test non_matching                    ... bench:       201 ns/iter (+/- 98)
test overlapping_matches             ... bench:      3888 ns/iter (+/- 910)
test paths_empty_query               ... bench:      7819 ns/iter (+/- 1584)
test paths_non_matching              ... bench:   1037549 ns/iter (+/- 188323)
test paths_trivial_query             ... bench:    963238 ns/iter (+/- 56167)
test progressive_non_sequential_few  ... bench:    676234 ns/iter (+/- 123964)
test progressive_non_sequential_many ... bench:   2242298 ns/iter (+/- 282439)
test progressive_sequential_few      ... bench:    761329 ns/iter (+/- 66473)
test progressive_sequential_many     ... bench:   2806800 ns/iter (+/- 56153)

https://github.com/jamii/fuzzy

@jamii
Copy link

jamii commented Jan 8, 2015

It was pretty pleasant so far, apart from the repeat.take.collect nonsense. The compiler errors were mostly very helpful and often suggested the correct fix. Having benchmarking and testing tools built in is pretty nice.

@jamii
Copy link

jamii commented Jan 8, 2015

I would like to see a flymake mode that can give me type/lifetime info in the editor.

@jamii
Copy link

jamii commented Jan 9, 2015

With some help from reddit it now handles unicode correctly and is maybe 30-40% faster than js. There were some gotchas on the way though and not having a debugger or repl was pretty frustrating. On the other hand, callgrind is a lot better than the chrome profiler, which seems to get confused by its own jit a lot.

@spion
Copy link

spion commented Jan 12, 2015

I also took a stab at a dynamic programming based solution in JS. Here is the matching function implementation and there is also a proof-of-concept hacky program (similar to selecta) to play around with

Performance is comparable to selecta (slower under node 0.10, about the same in iojs master) which means that it should be really fast when written in Rust (and will probably handle unicode much better; this version only case-insensitive in ascii)

The main goal was to make a prototype where the scoring function really easy to tweak and experiment with. Currently it favors matching characters after separators (or at the beginning) as well as multiple consecutive characters, but its quite easy to use very different scoring schemes. Its based on a gain function that takes the query, the line and their two index positions and returns a score that determines how well they match at that position.

@garybernhardt
Copy link
Owner

@spion How do I run that?

failbowl:fuzzy-matcher(master) $ echo foo | ./main.js

/Users/grb/proj/selecta/fuzzy-matcher/index.js:12
    var i = 0, j, m = query.length, n = line.length;
                           ^
TypeError: Cannot read property 'length' of undefined
    at regularMatch (/Users/grb/proj/selecta/fuzzy-matcher/index.js:12:28)
    at match (/Users/grb/proj/selecta/fuzzy-matcher/index.js:4:10)
    at Stream.process.stdin.on.pipe.pipe.self (/Users/grb/proj/selecta/fuzzy-matcher/main.js:16:17)
    at Stream.stream.write (/Users/grb/proj/selecta/fuzzy-matcher/node_modules/through/index.js:26:11)
    at Stream.ondata (stream.js:51:26)
    at Stream.emit (events.js:95:17)
    at drain (/Users/grb/proj/selecta/fuzzy-matcher/node_modules/split/node_modules/through/index.js:36:16)
    at Stream.stream.queue.stream.push (/Users/grb/proj/selecta/fuzzy-matcher/node_modules/split/node_modules/through/index.js:45:5)
    at emit (/Users/grb/proj/selecta/fuzzy-matcher/node_modules/split/index.js:36:14)
    at next (/Users/grb/proj/selecta/fuzzy-matcher/node_modules/split/index.js:48:7)
failbowl:fuzzy-matcher(master) $ node --version
v0.10.35

@spion
Copy link

spion commented Jan 13, 2015

Oops, forgot to add argument checking and usage. Fixed.

Thats the non-interactive, oneshot version; input | ./main.js query will output lines matching the query to stdout . fuzzy-select adds some tty interactivity on top of it

@garybernhardt
Copy link
Owner

@spion The results don't seem right:

$ cat files.txt
../.git/refs/remotes/teoljungberg/refactor-search
selecta
../.gem/ruby/2.1.2/gems/rspec-expectations-2.14.3/lib/rspec/expectations/extensions/array.rb
$ cat files.txt | ./main.js selecta
../.gem/ruby/2.1.2/gems/rspec-expectations-2.14.3/lib/rspec/expectations/extensions/array.rb
selecta
../.git/refs/remotes/teoljungberg/refactor-search

@spion
Copy link

spion commented Jan 23, 2015

Thats because r(s)pec-(e)xpectations.../(l)ib/.../(e)xpe(cta)tions/... - scores higher

3 * Gain.Beginning + 2 * Gain.Normal + 2 * Gain.TwoOrMore - lengthPenalty ~= 15

vs selecta which is 1 * Gain.Beginning + 5 * Gain.TwoOrMore - lengthPenalty ~= 11

mostly because Gain.Beginning > Gain.TwoOrMore.

If I adjust Gain.TwoOrMore to 3 then selecta wins.

Originally I though about progressive gain the more letters in a sequence are matched, but that may be a bit slow to check

edit: I just pushed a version where Gain.TwoOrMore = 3.

@rschmitt
Copy link
Contributor

rschmitt commented Feb 3, 2015

Since a Rust rewrite has come up a few times in this thread, I'll leave a link to my Rust rewrite of Selecta, called Heatseeker:

https://github.com/rschmitt/heatseeker

It's a fully functional Selecta clone that is tested and working on OS X, Linux, and Windows. It currently only implements the classic Selecta scoring algorithm (I haven't yet looked at the scoring_redesign_4 branch in much detail), but if you're interested in hacking on scoring algorithms, performance optimization, multithreading, or just learning Rust, you might be interested in Heatseeker.

@garybernhardt
Copy link
Owner

The scoring_redesign_4 branch that @rschmitt mentioned is pushed to GitHub. It's (1) faster than master, (2) consumes multiple input characters at a time, which makes it feel much faster overall, (3) has an algorithm that privileges word boundaries and sequential runs of matching characters, and (4) highlights the matching substring in the UI. I'd love to hear what people think about it, especially the scoring algorithm changes.

@rschmitt
Copy link
Contributor

rschmitt commented Feb 4, 2015

Does consuming multiple input characters at a time make it possible to support <ESC> as a way to exit Selecta without making a selection? (If not, I would suggest Ctrl-G for that functionality, by analogy with emacs.)

@rschmitt
Copy link
Contributor

rschmitt commented Feb 4, 2015

The speed of scoring_redesign_4 is awesome, by the way. I've already implemented some of the same optimizations in Heatseeker, but I'm starting to feel like I'm making a faster ballpoint pen. I'll experiment with the new scoring algorithm later this week; I'd like to see what it considers a "word boundary," and how it works with different projects in different languages.

@garybernhardt
Copy link
Owner

@rschmitt I don't think that this change will make ^G any easier or harder. Selecta used to support ESC as a way to exit with status 0 (as opposed to ^C, which exits with status 1). I backed that out because it got confused by escape codes, like arrow keys. It wouldn't be terribly hard to re-add it with another keystroke, but I'm skeptical of its utility.

When I think about use cases for Selecta exiting with status 0 but printing nothing, they always end up involving a downstream conditional to check whether the output was empty. If that conditional exists, it may as well be a check for $? instead, as I see it.

@rschmitt
Copy link
Contributor

rschmitt commented Feb 4, 2015

My understanding is that it was hard to distinguish <ESC> (as an actual keypress) from escape codes, because when you got an ESC you would have to wait insert-arbitrary-duration-here in order to disambiguate one from the other. But if you're now using io/wait and handling buffered input all at once, it seems like that disambiguation could be performed much more simply: if the last thing in the input buffer is an ESC, it must not be a control code. (I realize that I'm making an assumption about atomicity here, but this is merely something I'm curious about.)

@garybernhardt
Copy link
Owner

Right, distinguishing ESC from control codes was the problem. But we'd really need an escape code parser to do that well. An ESC and [ could appear because they were typed that way and got buffered. Of course, nothing saves you from a buffered <ESC>[A, which is indistinguishable from a true up arrow, even if it's typed by a user. (Curses claims to magically take care of all of this, but it also becomes broken sometimes, especially on OS X.)

@airblade
Copy link

airblade commented Feb 4, 2015

In scoring_redesign_4, if I wanted to boost the weighting given to boundary characters, or sequential runs of matching characters, how would I do it?

As far as I understand the current algorithm, lower scores are better and a boundary or sequential character adds 1 to the score instead of the subsequence length.

Must scores be integers or could one add, say, 0.5 on a boundary character to double its impact?

@wpp
Copy link

wpp commented Feb 4, 2015

I've pulled scoring_redesign_4, mapped

  • "old selecta" to <leader>f
  • "new selecta" to <leader>b

and compared them throughout the day. scoring_redesign_4 feels much snappier to me. Privileging word boundaries made the "top-group" of matches often more relevant I believe. And of course highlighting the matched substrings is a total game-changer!

screen shot 2015-02-04 at 12 20 32

Only complaint is that all filenames are down cased.
screen shot 2015-02-04 at 12 34 09

@felipesere
Copy link

I've been working on a port of selecta to Rust.
You can find it here: https://github.com/felipesere/athena

It handles find ~ | target/release/selecta without any hickups on my current MacBook Pro.
find ~ | wc -l says that I have about 260.000 files in my home folder.

@garybernhardt
Copy link
Owner

The new algorithm is now merged to master. If you try it, please let me know what you think over in #80.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests