Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quality of matches: Extended acronyms #28

Open
mrkishi opened this issue Oct 17, 2016 · 9 comments
Open

Quality of matches: Extended acronyms #28

mrkishi opened this issue Oct 17, 2016 · 9 comments

Comments

@mrkishi
Copy link

mrkishi commented Oct 17, 2016

I came across another potential improvement.


Suppose we want to match editor localization from a long list of candidates:

const candidates = ['editor localization', 'edit', 'editor layout', 'expand line', 'enforce legion', ...]

We could start typing it from the start, and we'd find out there were conflicts up to editor lo. That's no good. We also realize it'd be foolish to try querying for el, as that's a very common acronym. Smart as we are, we decide to improve on the acronym and type:

fuzz.filter(candidates, 'edlo')

With a smug on our faces, we hit enter.

['deniedlol;p', 'edit localization', ...]

Scoring acronyms similarly to start-of-word matches is really great. However, it looks like it could be improved by also taking into account consecutive matching letters. In the previous example, edlo is a 4 consecutive middle-of-word letters match, but it is also an acronym with 4 start-of-word letter matches.

I have no idea how hard it'd be to make this modification (I still have to dive deeper into the algorithm), but wouldn't it be a desirable addition?

@jeancroy
Copy link
Owner

Yeah mixing acronym and consecutive letters in query don't work well.
There's an heuristic that try to determine if you want acronym or
consecutive letters and you'll confuse it.

eloc probably would work best.

On Mon, Oct 17, 2016, 00:23 mrkishi [email protected] wrote:

I came across another potential improvement.

Suppose we want to match editor localization from a long list of
candidates:

const candidates = ['editor localization', 'edit', 'editor layout', 'expand line', 'enforce legion', ...]

We could start typing it from the start, and we'd find out there were
conflicts up to editor lo. That's no good. We also realize it'd be
foolish to try querying for el, as that's a very common acronym. Smart as
we are, we decide to improve on the acronym and type:

fuzz.filter(candidates, 'edlo')

With a smug on our faces, we hit enter.

['deniedlol;p', 'edit localization', ...]


Scoring acronyms similarly to start-of-word matches is really great.
However, it looks like it could be improved by also taking into account
consecutive matching letters. In the previous example, edlo is a 4
consecutive middle-of-word letters match, but it is also an acronym with 4
start-of-word letter matches.

I have no idea how hard it'd be to make this modification (I still have to
dive deeper into the algorithm), but wouldn't it be a desirable addition?


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#28, or mute the
thread
https://github.com/notifications/unsubscribe-auth/AMLCEh08IhpN4gW0iGFfZPQEgMdgxtPzks5q0vhBgaJpZM4KYOGj
.

@jeancroy
Copy link
Owner

The problem is something like 'ed' vs 'edit database' the d is ambiguous so
we compare two dumb strategy : consecutive letter and consecutive acronym
to take a decision. Mixing consecutive and acronym won't score well for
acronym although those may be word part.

On Mon, Oct 17, 2016, 00:31 Jean Christophe Roy [email protected] wrote:

Yeah mixing acronym and consecutive letters in query don't work well.
There's an heuristic that try to determine if you want acronym or
consecutive letters and you'll confuse it.

eloc probably would work best.

On Mon, Oct 17, 2016, 00:23 mrkishi [email protected] wrote:

I came across another potential improvement.

Suppose we want to match editor localization from a long list of
candidates:

const candidates = ['editor localization', 'edit', 'editor layout', 'expand line', 'enforce legion', ...]

We could start typing it from the start, and we'd find out there were
conflicts up to editor lo. That's no good. We also realize it'd be
foolish to try querying for el, as that's a very common acronym. Smart as
we are, we decide to improve on the acronym and type:

fuzz.filter(candidates, 'edlo')

With a smug on our faces, we hit enter.

['deniedlol;p', 'edit localization', ...]


Scoring acronyms similarly to start-of-word matches is really great.
However, it looks like it could be improved by also taking into account
consecutive matching letters. In the previous example, edlo is a 4
consecutive middle-of-word letters match, but it is also an acronym with 4
start-of-word letter matches.

I have no idea how hard it'd be to make this modification (I still have to
dive deeper into the algorithm), but wouldn't it be a desirable addition?


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#28, or mute the
thread
https://github.com/notifications/unsubscribe-auth/AMLCEh08IhpN4gW0iGFfZPQEgMdgxtPzks5q0vhBgaJpZM4KYOGj
.

@jeancroy
Copy link
Owner

Edlo count as two segment of two letters, we are not able to teleport while
counting consecutive.(we do count consecutive in real space and acronym
space and take the vest of both)

On Mon, Oct 17, 2016, 00:42 Jean Christophe Roy [email protected] wrote:

The problem is something like 'ed' vs 'edit database' the d is ambiguous
so we compare two dumb strategy : consecutive letter and consecutive
acronym to take a decision. Mixing consecutive and acronym won't score well
for acronym although those may be word part.

On Mon, Oct 17, 2016, 00:31 Jean Christophe Roy [email protected] wrote:

Yeah mixing acronym and consecutive letters in query don't work well.
There's an heuristic that try to determine if you want acronym or
consecutive letters and you'll confuse it.

eloc probably would work best.

On Mon, Oct 17, 2016, 00:23 mrkishi [email protected] wrote:

I came across another potential improvement.

Suppose we want to match editor localization from a long list of
candidates:

const candidates = ['editor localization', 'edit', 'editor layout', 'expand line', 'enforce legion', ...]

We could start typing it from the start, and we'd find out there were
conflicts up to editor lo. That's no good. We also realize it'd be
foolish to try querying for el, as that's a very common acronym. Smart as
we are, we decide to improve on the acronym and type:

fuzz.filter(candidates, 'edlo')

With a smug on our faces, we hit enter.

['deniedlol;p', 'edit localization', ...]


Scoring acronyms similarly to start-of-word matches is really great.
However, it looks like it could be improved by also taking into account
consecutive matching letters. In the previous example, edlo is a 4
consecutive middle-of-word letters match, but it is also an acronym with 4
start-of-word letter matches.

I have no idea how hard it'd be to make this modification (I still have to
dive deeper into the algorithm), but wouldn't it be a desirable addition?


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#28, or mute the
thread
https://github.com/notifications/unsubscribe-auth/AMLCEh08IhpN4gW0iGFfZPQEgMdgxtPzks5q0vhBgaJpZM4KYOGj
.

@mrkishi
Copy link
Author

mrkishi commented Oct 17, 2016

Could acronym space be extended to take more than 1 start-of-word letters into account? Without drastic changes to the algorithm, that is.

Something that would represent the following rules:

  • ed could be counted as a segment of two letters: 1*2
  • el, a segment of two acronyms (1-letter): 2*1
  • edl, a segment of two acronyms ([2, 1]-letter): 2+1
  • edlo, a segment of two acronyms ([2, 2]-letter): 2+2

@jeancroy
Copy link
Owner

First, before making large change to algorithm I try to evaluate how far are we from a solution.
In this particular case, typing one or two letters edlo => edloca would probably suffice.

Then it's a bit hard to answer if something is possible. Mostly because I'd hate to answer no.
But currently acronym and consecutive are handled by very separate part of code.

There's a few area where I know the code that handle acronym can be improved.
However because start of word count as much as acronym it does not matter if e belong to el or to ed it will be scored the same.

I rely a lot on sequence length to sort useful from garbage. (And there's multiple example of garbage accidentally hitting acronym letters). A sequence of two is one letter away from an accidental match, hence my first recommandation to try to aim for a sequence of at least 3.

@jeancroy
Copy link
Owner

jeancroy commented Oct 17, 2016

I'll add thank you for the time you put into these report.
Even if I don't see an obvious solution now I keep those in mind for later.

@mrkishi
Copy link
Author

mrkishi commented Oct 17, 2016

Makes sense! For now, I'll train to use longer consecutive matches instead of acronyms.

I'm the one who should be thanking you for such an awesome library. I wish I could help you with more than simple use-case discussions, but I'm afraid it'll take me quite a while to grasp the algorithms in play here (I'm trying to, nevertheless!).

Also, I just saw your FuzzySearch library. It seems to share (maybe more than) a few similarities with fuzzaldrin-plus, while being "fuzzier" in its heuristics usage. Is that a fair statement? Do you see one of them ever deprecating the other?

Thank you, again 💯

@jeancroy
Copy link
Owner

jeancroy commented Oct 17, 2016

Hi

I'll train to use longer consecutive matches instead of acronyms.

One way that works well is typing all the acronym first letters, then if you need more refinement, you add parts of last letters.

Also, I just saw your FuzzySearch library.

FuzzySearch does word-against-word search, including out of order words.

Use case:
I want to .... ( paint my room )
would match against ( room painting )

fuzzaldin-plus does bag-of-letters against programming symbol or path search.

In the end it was natural language vs programming language.
Also FuzzySearch allow more complex structure. IE search in title, author, description.

I'll try to answer your other analysis later today.

@aguynamedben
Copy link

aguynamedben commented Jul 25, 2017

@jeancroy Wow, I came here wondering if there was a way to get fuzzaldrin-plus to match out-of-order queries (i.e. "bar foo" to match "foo bar") and FuzzySearch is exactly what I'm looking for. My project is using more natural language than file paths. Thanks for the time you spend on these projects, and for making them so fast using radical algorithms =)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants