Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: specifying allowed characters when fuzzy-matching #338

Closed
mrabarnett opened this issue Aug 14, 2019 · 6 comments
Closed
Labels
enhancement New feature or request minor

Comments

@mrabarnett
Copy link
Owner

Original report by Anonymous.


It would be a neat feature if one, in addition to the type of fuzziness, could specify character classes when performing a fuzzy match. For instance, trying to fuzzy match a word with spelling errors it hardly makes sense to allow introduction of [^a-z] characters.

@mrabarnett
Copy link
Owner Author

Original comment by Matthew Barnett (Bitbucket: mrabarnett, GitHub: mrabarnett).


Could you mock up an example?

@mrabarnett
Copy link
Owner Author

Original comment by Peter Holm (Bitbucket: [Peter Holm](https://bitbucket.org/Peter Holm), ).


Say I have a large corpus of news articles, blogposts, etc. and I am looking for misspellings of "matthew mcconaughey". His last name is notoriously difficult to get right, so it is expected that a lot of variations will pop up. With a regex like (?i)matthew (mcconaughey){1<=e<=5} i would match stuff like mcconaughey. and _mcconaughey which isn't misspellings. One could imagine extending the syntax of the fuzzy matching to specify which kind of characters are allowed to be inserted/replaced/substituted. One idea of a potential syntax: (?i)matthew (mcconaughey){1<=e<=5|[a-z]} .

@mrabarnett
Copy link
Owner Author

Original comment by Matthew Barnett (Bitbucket: mrabarnett, GitHub: mrabarnett).


Your example looks wrong because it has [^a-z] where I'd expect [a-z].

@mrabarnett
Copy link
Owner Author

Original comment by Peter Holm (Bitbucket: [Peter Holm](https://bitbucket.org/Peter Holm), ).


Sorry, you’re correct I’ve edited it now :)

@mrabarnett
Copy link
Owner Author

Original comment by Matthew Barnett (Bitbucket: mrabarnett, GitHub: mrabarnett).


Done in regex 2019.08.19.

I've chosen to use ":" instead of a "|", so your example would be (?i)matthew (mcconaughey){1<=e<=5:[a-z]}, and it's not limited to a character set, but can also be, say, a property like \d or \p{Digit}.

@mrabarnett
Copy link
Owner Author

Original comment by Peter Holm (Bitbucket: [Peter Holm](https://bitbucket.org/Peter Holm), ).


Awesome! I like that solution a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request minor
Projects
None yet
Development

No branches or pull requests

1 participant