Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quality of matches: Proper casing #27

Open
mrkishi opened this issue Oct 17, 2016 · 3 comments
Open

Quality of matches: Proper casing #27

mrkishi opened this issue Oct 17, 2016 · 3 comments

Comments

@mrkishi
Copy link

mrkishi commented Oct 17, 2016

Hello, folks.

I've came across the following situation:

const data = ['Eat ten pizzas', 'Ten Pizzas']
fuzz.filter(data, 'tp')
// > ['Eat ten pizzas', 'Ten Pizzas']

Wouldn't Ten Pizzas make more sense, here? I haven't studied the algorithm too deeply, yet, but I think this is caused by the proper casing rules, as evidenced by this version of the same test:

const data = ['eat ten pizzas', 'ten pizzas']
fuzz.filter(data, 'tp')
// > ['ten pizzas', 'eat ten pizzas']

Now, while proper casing is indeed important when choosing matches, I feel like it's not a good indicator on lowercase queries, and it's currently being given too much weight.

A query that contains uppercase characters conveys a proper casing intention quite strongly. The opposite, however, is not true: a lowercase query doesn't mean you'd prefer lowercase matches.

Consider these hypothetical queries:

(['Proper Case', 'A proper case'], 'pc') => 'Proper Case'
(['proper case', 'a Proper Case'], 'PC') => 'A Proper Case'

fuzzaldrin-plus gets the second one, but misses the first. Am I off-base here in what I consider better matches?

@jeancroy
Copy link
Owner

jeancroy commented Oct 17, 2016

Hi mrkishi , thanks for the report.

while proper casing is indeed important when choosing matches, I feel like it's not a good indicator on lowercase queries,

I call what you describe as smart-case. Uppercase means uppercase, lowercase can mean anything.
That convention is very popular in vim circle, among other.

There are different reasons I did not went that way. One of which is that I try to be agnostic of programming style. snake_case CamelCase and kebab-case pretty much are weighted the same.

Also my main concern is reachability. Imagine you have a local variable named something and also a method named Something or SomethingElse. There must exist a query that allow to select the lowercase local variable.

Once reachability is there, then I know that after a bit of learning curve we have an useful tool.
If I optimize against reachability then some option will not be selectable no matter the experience.

and it's currently being given too much weight.

There was a LOT of pressure for proper casing. Often for CamelCase. But also some use case for proper casing as-is.

What to do from here ?

It might be a coincidence but both your example fall into what I call acronym exact match. (that is the acronym of the subject is exactly the query)

  • 'pc' => 'Proper Case'
  • 'tp' => 'Ten Pizzas'

In theory it's also a strong bonus, but it grows with acronym length so I may investigate what to think here.


Another possibility is to have an option switch to behave in smartCase mode. It's not that hard to do, and in the end, it would be about testing if it's too slow to maintain both code path.

@mrkishi
Copy link
Author

mrkishi commented Oct 17, 2016

Thank you for the detailed (and prompt) response, @jeancroy!

The reachability argument is extremely convincing, and I didn't think of that. However, I'm not sure I completely understand its impact on these examples. It doesn't seem like smart-case goes against reachability. On the something vs Something example, an s query would favor something regardless of smart-casing support.

But even disregarding smart-case, I still come across some odd behaviors.

Let me preface this message with some (made-up, sorry) term definitions to minimize confusion (it's still pretty confusing..):

Literal [pattern]: a pattern of consecutive letters
Acronym [pattern]: a pattern of consecutive start-of-word letters

Match: any combination of sequential literal and/or acronym patterns

Literal [exact] match: an acronym pattern that spans 100% of the query
Acronym [exact] match: a literal pattern that spans 100% of the query
Exact match: a literal or acronym match

Full-length literal [exact] match: a literal match that also spans 100% of the candidate
Full-length acronym [exact] match: an acronym match that also spans 100% of the candidate
Full-length [exact] match: a full-length literal or full-length acronym match

For instance, proper casing is apparently not as influential on literal matches as it is on acronym matches:

(['A PROPER CASE', 'a proper case'], 'pc') => 'a proper case'
(['A PROPER CASE', 'a proper case'], 'PC') => 'A PROPER CASE'

(['A PROPERCASE', 'a propercase'], 'pr') => 'A PROPERCASE'
(['A PROPERCASE', 'a propercase'], 'PR') => 'A PROPERCASE'

// factor in length
(['A PR', 'a pr'], 'pr') => 'A PR'
(['A PR', 'a pr'], 'PR') => 'A PR'

A full-length literal match will "ignore" case errors, while an equivalent full-length acronym will not:

(['PROPER CASE', 'a proper case'], 'pc') => 'a proper case'
(['PROPERCASE', 'a propercase'], 'propercase') => 'PROPERCASE'

(['PR', 'a pr'], 'pr') => 'PR'

--

(['A PROPER CASE', 'proper case'], 'PC') => 'A PROPER CASE'
(['A PROPERCASE', 'propercase'], 'PROPERCASE') => 'propercase'

(['A PR', 'pr'], 'PR') => 'pr'

I have the feeling that acronyms would work better if these behaviors were aligned: either full-length acronyms should be more lenient towards case mismatches (like full-length literals), or literal matches should favor proper casing over being full-length.

Personally, I think giving full-length acronyms the same text-casing tolerance as full-length literal matches would be the more useful approach:

(['PROPER CASE', 'a proper case'], 'pc') => 'PROPER CASE'
(['PROPERCASE', 'a propercase'], 'propercase') => 'PROPERCASE'

Thoughts?

@jeancroy
Copy link
Owner

Thank you for the report .

Typical definition of smart case I've seen is that proper case on lowercase
don't matter. I guess it can also be interpretted as matter less and you'd
be correct that would allow better reach.

For the other finding you may have found a bug. There's express lane for
some cases and I may not have synced those properly.

What I can do is implements smart case. Then lower bonus for case
sensitive. And see if I pass all the tests with that.

Your idea of being more lenient with exact acronym seems good. It's hard to
get in that case by chance.

---------- Forwarded message ---------
From: mrkishi [email protected]
Date: Mon, Oct 17, 2016, 09:56
Subject: Re: [jeancroy/fuzzaldrin-plus] Quality of matches: Proper casing
(#27)
To: jeancroy/fuzzaldrin-plus [email protected]
Cc: Jean Christophe Roy [email protected], Mention <
[email protected]>

Thank you for the detailed (and prompt) response, @jeancroy
https://github.com/jeancroy!

The reachability argument is extremely convincing, and I didn't think of
that. However, I'm not sure I completely understand its impact on these
examples. It doesn't seem like smart-case goes against reachability. On
the something vs Something example, an s query would favor something
regardless of smart-casing support.

But even disregarding smart-case, I still come across some odd behaviors.

Let me preface this message with some (made-up, sorry) term definitions to
minimize confusion (it's still pretty confusing..):

Literal [pattern]: a pattern of consecutive lettersAcronym [pattern]:
a pattern of consecutive start-of-word letters
Match: any combination of sequential literal and/or acronym patterns
Literal [exact] match: an acronym pattern that spans 100% of the
queryAcronym [exact] match: a literal pattern that spans 100% of the
queryExact match: a literal or acronym match
Full-length literal [exact] match: a literal match that also spans
100% of the candidateFull-length acronym [exact] match: an acronym
match that also spans 100% of the candidateFull-length [exact] match:
a full-length literal or full-length acronym match


For instance, proper casing is apparently not as influential on literal
matches as it is on acronym matches:

(['A PROPER CASE', 'a proper case'], 'pc') => 'a proper case'
(['A PROPER CASE', 'a proper case'], 'PC') => 'A PROPER CASE'

(['A PROPERCASE', 'a propercase'], 'pr') => 'A PROPERCASE'
(['A PROPERCASE', 'a propercase'], 'PR') => 'A PROPERCASE'
// factor in length
(['A PR', 'a pr'], 'pr') => 'A PR'
(['A PR', 'a pr'], 'PR') => 'A PR'


A full-length literal match will "ignore" case errors, while an equivalent
full-length acronym will not:

(['PROPER CASE', 'a proper case'], 'pc') => 'a proper case'
(['PROPERCASE', 'a propercase'], 'propercase') => 'PROPERCASE'

(['PR', 'a pr'], 'pr') => 'PR'

(['A PROPER CASE', 'proper case'], 'PC') => 'A PROPER CASE'
(['A PROPERCASE', 'propercase'], 'PROPERCASE') => 'propercase'

(['A PR', 'pr'], 'PR') => 'pr'

I have the feeling that acronyms would work better if these behaviors were
aligned: either full-length acronyms should be more lenient towards case
mismatches (like full-length literals), or literal matches should favor
proper casing over being full-length.

Personally, I think giving full-length acronyms the same text-casing
tolerance as full-length literal matches would be the more useful approach:

(['PROPER CASE', 'a proper case'], 'pc') => 'PROPER CASE'
(['PROPERCASE', 'a propercase'], 'propercase') => 'PROPERCASE'

Thoughts?


You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub
#27 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AMLCEh77aUwHRC11m4xGmTKxhC6s-muGks5q036KgaJpZM4KYMbK
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants