Should we keep the new boundary-aware scoring algorithm? #80

garybernhardt · 2015-03-02T23:49:13Z

I've made the scoring algorithm smarter about sequential matching characters and word boundaries (to improve results when querying for acronyms). It's merged to master, along with some other changes, in d874c99. The README contains a summary (search it for "algorithm").

I'd love to hear feedback from actual Selecta users, especially after you've used it on actual projects.

The text was updated successfully, but these errors were encountered:

rschmitt · 2015-03-03T00:59:13Z

I'm not really happy with the way Selecta handles case currently. There are two distinct issues. The first issue is this:

(Selecta will correctly output "ASDF" if you hit enter here, but the displayed case is incorrect.) The second issue is that in some languages, ignoring case completely actually throws out a lot of valuable information that could trivially be used for matching. This is best demonstrated by an example. Without case sensitivity:

With case sensitivity:

I think that a good solution here might be to have lowercase query characters match both uppercase and lowercase characters, but uppercase query characters only match uppercase characters in matches. There must be a term for this--partial case sensitivity? case covariance?--but I don't know what it is.

jwhitley · 2015-03-03T03:48:55Z

@rschmitt That's usually called "smart case", and it's pretty common in interactive search systems these days. E.g. built-in interactive search for both Emacs and Vim has had this for ages.

rschmitt · 2015-03-03T03:56:06Z

@jwhitley That rings a bell. I've gone ahead and implemented it in Heatseeker (rschmitt/heatseeker@7a3aa4b). So far it seems like an awesome improvement for languages that conventionally use CamelCase filenames--Java, Scala, Haskell, some C++, etc.

wpp · 2015-03-03T08:05:01Z

(Selecta will correctly output "ASDF" if you hit enter here, but the displayed case is incorrect.)

Have to agree with @rschmitt here. So far the only issue I ran into.

airblade · 2015-03-03T09:26:34Z

I prefer the new scoring.

I noticed that a query of amuser will score 3 against banjo/app/models/user.rb instead of 2, because the score count starts at the a in banjo instead of the a in app.

Most of the time I imagine selecta is used to match file paths. File paths aren't uniformly weighted; the tail of the path is more specific, in a way, than the head (big-endian?). Therefore I was wondering about matching from the more specific to the less specific, i.e. from right to left.

Clearly you can't just reverse the query and the choices and pass those to the scoring algorithm. I can't quite tell at the moment how to change the algorithm, and of course benchmarking might well rule it out. But I thought I'd mention it.

mvaltas · 2015-03-03T14:19:28Z

I see that the algorithm favor directory matches instead of file matches in certain conditions, here's an example of a chef project I'm working on:

Note that default.rb is a closer (in terms of directory depth) than the other files and the input matches partially the file default.rb but not the other ones, which led me to believe that it should take precedence.

PS: There are no more files in this example, all of them show up in this screenshot.

gshutler · 2015-03-03T17:42:03Z

I think it's a general improvment, I'm still getting acquainted to the new behaviour, learning new "first hits", etc.

The boundary-aware matching hasn't worked as I expect in a few cases:

> selecta
gshutler/goselecta
garybernhardt/selecta

I would expect garybernhardt/selecta to rank higher as selecta starts after a boundary.

> core
./rspec-core
./cronofy/core

I would expect ./cronofy/core to rank higher as core starts after a harder boundary. I think of -_ as softer than /\.

> ccore
./rspec-core
./cronofy/core

A variant of the above, but I would definitely expect ./cronofy/core to rank higher as the first c matches the leading c of cronofy and the trailing c of rspec.

> vepres
app/presenters/event_presenter.rb
app/presenters/v_event_presenter.rb
app/presenters/api_event_presenter.rb

I think this is similar to the case @airblade mentioned. I'm expecting [v]_[e]vent_[pres]enter.rb to be chosen but it's using v_e[ve]nt_[pres]enter.rb. I think that's because it's the shorter substring. The only way to avoid this would be to evaluate all possible matches to find the best score which would be slower.

If a primary use case of selecta is selecting files, then I think that matches "further" into the strings should have more weight, as the "deeper" you go the more specific the match is to that string.

It might help if I give an example of where this approach definitely works.

Imagine you've got a Rails project-like structure:

app/controllers/application_controller.rb
app/controllers/special_controller.rb
spec/controllers/application_controller_specs.rb
spec/controllers/special_controller_specs.rb

This splits on boundaries into something like:

[app, controllers, application, controller, rb]
[app, controllers, special, controller, rb]
[spec, controllers, application, controller, specs rb]
[spec, controllers, special, controller, specs rb]

When I search for something like appcon I would expect the results:

app/controllers/[app]lication_[con]troller.rb
spec/controllers/[app]lication_[con]troller_specs.rb
[app]/[con]trollers/special_controller.rb

Currently we get:

[app]/[con]trollers/special_controller.rb
[app]/[con]trollers/application_controller.rb
spec/controllers/[app]li[c]ati[on]_controller_specs.rb

If I refine the search to appcons I would expect the results:

spec/controllers/[app]lication_[con]troller_[s]pecs.rb
[app]/[con]trollers/[s]pecial_controller.rb
[app]/[con]troller[s]/application_controller.rb

Currently we get:

[app]/[con]troller[s]/special_controller.rb
[app]/[con]troller[s]/application_controller.rb
spec/controllers/[app]li[c]ati[on]_controller_[s]pecs.rb

I hope that's in some way useful.

garybernhardt · 2015-03-04T01:43:39Z

The UI now prints paths with the correct case; that was a silly little bug.

I think that smart case seems like a good idea, but it sounds hairy and I want to put it off for a bit since it should be independent of these recent algorithm changes.

Comments on left-vs-right in a moment.

garybernhardt · 2015-03-04T01:56:41Z

I see two possible adjustments for left vs. right matching:

Favoring strings where the match occurs farther to the right (whether we judge by start point, end point, median, whatever). This is a sorting issue; the scoring stay the same. This is easy to add.
Actually matching from right-to-left: start with the rightmost character of the query, then move left through the candidate string. The algorithm is greedy and not fully general; otherwise it would be O(2^n). Matching from right to left is doable, and I just implemented it, but it does add quite a bit of complication.

I think that (1) should definitely be done, but (2) may not be worth it.

Comments on specific matching examples in yet another moment...

garybernhardt · 2015-03-04T02:06:14Z

In @airblade's example of querying "banjo/app/models/user.rb" for "amuser", the score is 3 because the first character isn't considered for purposes of the boundary and sequential character bonuses. It definitely should be, but I didn't see an obvious way to implement it that way, so I cowardly punted on it.

For @gshutler's examples, in order:

"garybernhardt/selecta" should rank higher than "gshutler/goselecta" for "selecta"; this will happen with the above change.
Favoring "./cronofy/core" over "./rspec-core" for "core" due to a "/" before "core": this is specific to file paths. In other input text, this difference may not matter may even be reversed, so I'd like to avoid this to keep Select input-agnostic.
Favoring "./cronofy/core" over "./rspec-core" for "ccore" due to the "c" starting a word: like (1), this will be fixed with the above change.
I think that you're right: this would require solving the general case, which is O(2^n). There are some middle-ground solutions where you limit how far you'll look. They're quite a bit more complex than the greedy algorithm because you have to keep track of all of that stuff.

I think that we should:

Make the boundary/sequential special cases apply to the first matching character as well, which will fix three of the five cases mentioned above.
Depending on how that goes, maybe also add a sub-sort that favors matches to the right of the string. (This has some problems. It requires that the scoring algorithm choose the rightmost matching substring if there are multiple substrings tied for a score. It's also somewhat specific to file paths, which I don't like, but it doesn't affect the actual scores, so the impact will be small unless there are many results with identical scores.)

garybernhardt changed the title ~~Should we keep the new scoring algorithm?~~ Should we keep the new boundary-aware scoring algorithm? Mar 2, 2015

This was referenced Mar 2, 2015

Scoring algorithm needs improvement #30

Closed

Use word boundary scoring algorithm #69

Closed

Word boundary match #24

Closed

gshutler mentioned this issue Mar 3, 2015

Preserve case of inputs for rendering #84

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should we keep the new boundary-aware scoring algorithm? #80

Should we keep the new boundary-aware scoring algorithm? #80

garybernhardt commented Mar 2, 2015

rschmitt commented Mar 3, 2015

jwhitley commented Mar 3, 2015

rschmitt commented Mar 3, 2015

wpp commented Mar 3, 2015

airblade commented Mar 3, 2015

mvaltas commented Mar 3, 2015

gshutler commented Mar 3, 2015

garybernhardt commented Mar 4, 2015

garybernhardt commented Mar 4, 2015

garybernhardt commented Mar 4, 2015