-
-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to handle input with characters having more than one byte in UTF-8 #154
Comments
Because |
Thank you for the reply! I understand that that's the reason, but is there any workaround? Or is this a limitation of hyperscan? In the sense that you cannot get exact offsets with UTF-8. |
try add flag
|
I'm facing the same issue. Adding the UTF-8 flag does not solve the issue, and the matches returned by For instance, if we have The problem gets worse if it is a kanji (Chinese characters), katakana or hiragana (Japanese characters) which yields 3 chars each when encoded, making the match indexes be misplaced by 2 for every character it encounters. Looks like a bug that should be addressed by the internal processing of the HS_FLAG_UTF8 flag. |
Hi,
first of all thank you for this amazing library.
While playing around with it I stumbled upon this issue.
When matching on strings containing characters that UTF-8 converts into more then one byte, the end offset is wrong.
See for instance this example:
The highest end offset is
6
butlen("test®") is
5`.Is there any workaround to this? Am I misunderstanding something?
Thank you!
The text was updated successfully, but these errors were encountered: