Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to handle input with characters having more than one byte in UTF-8 #154

Open
mar4th3 opened this issue Aug 29, 2024 · 4 comments
Open

Comments

@mar4th3
Copy link

mar4th3 commented Aug 29, 2024

Hi,

first of all thank you for this amazing library.

While playing around with it I stumbled upon this issue.

When matching on strings containing characters that UTF-8 converts into more then one byte, the end offset is wrong.

See for instance this example:

import hyperscan

matches = []


def match_event_handler(dbid, start, end, flags, context) -> bool | None:
    matches.append(end)


expressions = ("test.+",)
db = hyperscan.Database()
db.compile(
    expressions=[e.encode("utf-8") for e in expressions],
)


text = "test®"
db.scan(text.encode("utf-8"), match_event_handler=match_event_handler)

print(matches)
# [5, 6]

The highest end offset is 6 but len("test®") is 5`.

Is there any workaround to this? Am I misunderstanding something?

Thank you!

@betterlch
Copy link

Because len(text.encode()) is 6
text.encode() == b'test\xc2\xae'

@mar4th3
Copy link
Author

mar4th3 commented Sep 2, 2024

Thank you for the reply! I understand that that's the reason, but is there any workaround?

Or is this a limitation of hyperscan? In the sense that you cannot get exact offsets with UTF-8.

@betterlch
Copy link

Thank you for the reply! I understand that that's the reason, but is there any workaround?

Or is this a limitation of hyperscan? In the sense that you cannot get exact offsets with UTF-8.

try add flag HS_FLAG_UTF8

expressions = ("test.+",)
db = hyperscan.Database()
db.compile(
    expressions=[e.encode("utf-8") for e in expressions], flags=[hyperscan.HS_FLAG_UTF8],
)

@LucianoBAF
Copy link

LucianoBAF commented Jan 24, 2025

I'm facing the same issue. Adding the UTF-8 flag does not solve the issue, and the matches returned by db.scan() come with wrong indexes after encountering an unicode char.

For instance, if we have my_string="österreich" is encoded with bytes(my_string, 'utf-8') or my_string.encode('utf-8'), it results in b'\xc3\x96sterreich', which has 1 char more than the original text. The hyperscan match position index will by shifted by one char to the right due to this.

The problem gets worse if it is a kanji (Chinese characters), katakana or hiragana (Japanese characters) which yields 3 chars each when encoded, making the match indexes be misplaced by 2 for every character it encounters.

Looks like a bug that should be addressed by the internal processing of the HS_FLAG_UTF8 flag.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants