Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Break down ligatures in preprocessing to make search result more intuitive #275

Open
hehelego opened this issue Feb 12, 2025 · 0 comments
Open

Comments

@hehelego
Copy link

Is your feature request related to a problem? Please describe.

Many PDF file contains ligature substitution, so "fi" or "ff" becomes a single single character instead of two consecutive yet separate ones.
When you use ripgrep-all to search for string containing "fi" or "ff", the subsituted ones will not be matched.

Describe the solution you'd like
Break down common ligatures in rga-preproc.

Describe alternatives you've considered
Identify contractable ligatures in the search pattern and replace thme with (contracted)|(original).
For example, rga definition should actually search for de((fi)|X)nition where X is the ligature "fi".

Additional context

Backgrounds: wikipedia: ligatures in computer typesetting

Image
The results are 27429, 5986, and 21451.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant