Feature request: specifying allowed characters when fuzzy-matching #338

mrabarnett · 2019-08-14T09:37:45Z

It would be a neat feature if one, in addition to the type of fuzziness, could specify character classes when performing a fuzzy match. For instance, trying to fuzzy match a word with spelling errors it hardly makes sense to allow introduction of [^a-z] characters.

mrabarnett · 2019-08-14T18:18:47Z

Original comment by Matthew Barnett (Bitbucket: mrabarnett, GitHub: mrabarnett).

Could you mock up an example?

mrabarnett · 2019-08-15T13:21:44Z

Original comment by Peter Holm (Bitbucket: [Peter Holm](https://bitbucket.org/Peter Holm), ).

Say I have a large corpus of news articles, blogposts, etc. and I am looking for misspellings of "matthew mcconaughey". His last name is notoriously difficult to get right, so it is expected that a lot of variations will pop up. With a regex like (?i)matthew (mcconaughey){1<=e<=5} i would match stuff like mcconaughey. and _mcconaughey which isn't misspellings. One could imagine extending the syntax of the fuzzy matching to specify which kind of characters are allowed to be inserted/replaced/substituted. One idea of a potential syntax: (?i)matthew (mcconaughey){1<=e<=5|[a-z]} .

mrabarnett · 2019-08-16T18:55:47Z

Original comment by Matthew Barnett (Bitbucket: mrabarnett, GitHub: mrabarnett).

Your example looks wrong because it has [^a-z] where I'd expect [a-z].

mrabarnett · 2019-08-17T14:10:06Z

Original comment by Peter Holm (Bitbucket: [Peter Holm](https://bitbucket.org/Peter Holm), ).

Sorry, you’re correct I’ve edited it now :)

mrabarnett · 2019-08-19T17:31:50Z

Original comment by Matthew Barnett (Bitbucket: mrabarnett, GitHub: mrabarnett).

Done in regex 2019.08.19.

I've chosen to use ":" instead of a "|", so your example would be (?i)matthew (mcconaughey){1<=e<=5:[a-z]}, and it's not limited to a character set, but can also be, say, a property like \d or \p{Digit}.

mrabarnett · 2019-08-19T21:14:02Z

Original comment by Peter Holm (Bitbucket: [Peter Holm](https://bitbucket.org/Peter Holm), ).

Awesome! I like that solution a lot.

mrabarnett closed this as completed Aug 19, 2019

mrabarnett mentioned this issue Jan 8, 2023

For Fuzzy Matching, it is impossible to specify a test character set for both insertion and substitution at the same time. #487

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: specifying allowed characters when fuzzy-matching #338

Feature request: specifying allowed characters when fuzzy-matching #338

mrabarnett commented Aug 14, 2019

mrabarnett commented Aug 14, 2019

mrabarnett commented Aug 15, 2019

mrabarnett commented Aug 16, 2019

mrabarnett commented Aug 17, 2019

mrabarnett commented Aug 19, 2019

mrabarnett commented Aug 19, 2019

Feature request: specifying allowed characters when fuzzy-matching #338

Feature request: specifying allowed characters when fuzzy-matching #338

Comments

mrabarnett commented Aug 14, 2019

mrabarnett commented Aug 14, 2019

mrabarnett commented Aug 15, 2019

mrabarnett commented Aug 16, 2019

mrabarnett commented Aug 17, 2019

mrabarnett commented Aug 19, 2019

mrabarnett commented Aug 19, 2019