-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[api-minor] Support search with or without diacritics (bug 1508345, bug 916883, bug 1651113) #13261
Conversation
calixteman
commented
Apr 18, 2021
•
edited
Loading
edited
- get original index in using a dichotomic seach instead of a linear one;
- normalize the text in using NFD;
- convert the query string into a RegExp;
- replace whitespaces in the query with \s+;
- handle hyphens at eol use to break a word;
- add some \s* around punctuation signs
The PR is almost ready but highlights are wrong with RTL languages. |
For now it doesn't work with RTL languages and I'll do that in an other patch. |
7f7cfdf
to
daa4e64
Compare
2142145
to
c22368f
Compare
You probably want to remove this from the commit message (and PR description) now :-) Given that this implementation uses a lot more, and more complex, regular expressions during both initial text-parsing and subsequent searching: What sort of performance impact, if any, does this patch have for larger and/or more complex documents? For example, what about e.g. the |
Good questions. Normalization (ms): 71, 63, 77, 77 And the same in master with Normalization (ms): 29, 29, 27, 20 From a user pov, I don't see any differences with both searches and so I think that this perf regression is acceptable. I added some code to remove the diacritics stuff in the query regexp when there are no diacritics on the page (which is the case in kjv.pdf) and there are no significant difference in time for both searches, so in the search part we pay for the use of regexp instead of And as usual, I'm open to any good idea to improve this. |
@timvandermeij, @Snuffleupagus do you have any objections for landing that stuff ? or any idea to improve perf or whatever ? |
I've been a little bit short on time to really review this properly, and have only looked briefly at the implementation, so it'd probably be a very good idea to actually do a "full" review before landing it since this is a significant change to the find implementation. |
Same here. I think it would be good to, aside from the full review, wait until #13418 is done before merging since it's quite a change to the implementation and it would be good to get the release out first to avoid any risks. (We usually do this for significant changes, such as the text layer and struct tree PRs.) |
This comment has been minimized.
This comment has been minimized.
a20b623
to
d92e0b2
Compare
From: Bot.io (Linux m4)SuccessFull output at http://54.241.84.105:8877/f79b098ad1ade97/output.txt Total script time: 4.67 mins Published |
/botio unittest |
From: Bot.io (Linux m4)ReceivedCommand cmd_unittest from @Snuffleupagus received. Current queue size: 0 Live output at: http://54.241.84.105:8877/39d0ab5729ad9a2/output.txt |
From: Bot.io (Windows)ReceivedCommand cmd_unittest from @Snuffleupagus received. Current queue size: 0 Live output at: http://54.193.163.58:8877/174cda3f5d3abe1/output.txt |
From: Bot.io (Linux m4)SuccessFull output at http://54.241.84.105:8877/39d0ab5729ad9a2/output.txt Total script time: 3.15 mins
|
From: Bot.io (Windows)SuccessFull output at http://54.193.163.58:8877/174cda3f5d3abe1/output.txt Total script time: 6.12 mins
|
/botio integrationtest |
From: Bot.io (Windows)ReceivedCommand cmd_integrationtest from @Snuffleupagus received. Current queue size: 0 Live output at: http://54.193.163.58:8877/aa860f7ef9c0a8e/output.txt |
From: Bot.io (Linux m4)ReceivedCommand cmd_integrationtest from @Snuffleupagus received. Current queue size: 0 Live output at: http://54.241.84.105:8877/1aa7bf3800cc78f/output.txt |
From: Bot.io (Linux m4)SuccessFull output at http://54.241.84.105:8877/1aa7bf3800cc78f/output.txt Total script time: 4.01 mins
|
From: Bot.io (Windows)SuccessFull output at http://54.193.163.58:8877/aa860f7ef9c0a8e/output.txt Total script time: 6.58 mins
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For your information, we'd like to have this feature in the next nightly cycle (the soft freeze is next week) and we'll ask to QA to check that everything is fine.
OK, let's try landing this and see how it goes :-)
I've tried to do some manual testing, but that's obviously quite limited and given the scope/size of these changes we obviously need more widespread testing.
If the quality is not good enough we'll backout and it I'll work on it to improve what is needed.
Unless there's really big problems, it might be easier to fix in place (and uplift patches) as necessary since trying to back-out a PR this size could very quickly become difficult.
…ug 1651113) - get original index in using a dichotomic seach instead of a linear one; - normalize the text in using NFD; - convert the query string into a RegExp; - replace whitespaces in the query with \s+; - handle hyphens at eol use to break a word; - add some \s* around punctuation signs
774b053
to
1f41028
Compare
Note that the *browser* findbar in Firefox uses "Title Case" for the labels, and it thus seem like a good idea to ensure that `PDFFindBar` in consistent with that. Furthermore, the new label added in PR mozilla#13261 uses the "Title Case" format which means that currently the default viewer findbar looks inconsistent. *Please note:* Based on the official Firefox localization docs, see https://firefox-source-docs.mozilla.org/l10n/overview.html#string-updates, changing only the casing should *not* require updating the key: > 1) If the change is minor, like fixing a spelling error or case, the developer should update the en-US translation without changing the l10n-id.
Note that the *browser* findbar in Firefox uses "Title Case" for the labels, and it thus seem like a good idea to ensure that `PDFFindBar` in consistent with that. Furthermore, the new label added in PR mozilla#13261 uses the "Title Case" format which means that currently the default viewer findbar looks inconsistent. *Please note:* Based on the official Firefox localization docs, see https://firefox-source-docs.mozilla.org/l10n/overview.html#string-updates, changing only the casing should *not* require updating the key: > 1) If the change is minor, like fixing a spelling error or case, the developer should update the en-US translation without changing the l10n-id.
Note that the *browser* findbar in Firefox uses "Title Case" for the labels, and it thus seem like a good idea to ensure that `PDFFindBar` in consistent with that. Furthermore, the new label added in PR mozilla#13261 uses the "Title Case" format which means that currently the default viewer findbar looks inconsistent. *Please note:* Based on the official Firefox localization docs, see https://firefox-source-docs.mozilla.org/l10n/overview.html#string-updates, changing only the casing should *not* require updating the key: > 1) If the change is minor, like fixing a spelling error or case, the developer should update the en-US translation without changing the l10n-id.
Note that the *browser* findbar in Firefox uses "Title Case" for the labels, and it thus seem like a good idea to ensure that `PDFFindBar` in consistent with that. Furthermore, the new label added in PR mozilla#13261 uses the "Title Case" format which means that currently the default viewer findbar looks inconsistent. *Please note:* Based on the official Firefox localization docs, see https://firefox-source-docs.mozilla.org/l10n/overview.html#string-updates, changing only the casing should *not* require updating the key: > 1) If the change is minor, like fixing a spelling error or case, the developer should update the en-US translation without changing the l10n-id.