How to handle input with characters having more than one byte in UTF-8 #154

mar4th3 · 2024-08-29T11:05:30Z

Hi,

first of all thank you for this amazing library.

While playing around with it I stumbled upon this issue.

When matching on strings containing characters that UTF-8 converts into more then one byte, the end offset is wrong.

See for instance this example:

import hyperscan

matches = []


def match_event_handler(dbid, start, end, flags, context) -> bool | None:
    matches.append(end)


expressions = ("test.+",)
db = hyperscan.Database()
db.compile(
    expressions=[e.encode("utf-8") for e in expressions],
)


text = "test®"
db.scan(text.encode("utf-8"), match_event_handler=match_event_handler)

print(matches)
# [5, 6]

The highest end offset is 6 but len("test®") is 5`.

Is there any workaround to this? Am I misunderstanding something?

Thank you!

The text was updated successfully, but these errors were encountered:

betterlch · 2024-09-02T02:33:18Z

Because len(text.encode()) is 6
text.encode() == b'test\xc2\xae'

mar4th3 · 2024-09-02T09:52:33Z

Thank you for the reply! I understand that that's the reason, but is there any workaround?

Or is this a limitation of hyperscan? In the sense that you cannot get exact offsets with UTF-8.

betterlch · 2024-09-04T07:16:17Z

Thank you for the reply! I understand that that's the reason, but is there any workaround?

Or is this a limitation of hyperscan? In the sense that you cannot get exact offsets with UTF-8.

try add flag HS_FLAG_UTF8

expressions = ("test.+",)
db = hyperscan.Database()
db.compile(
    expressions=[e.encode("utf-8") for e in expressions], flags=[hyperscan.HS_FLAG_UTF8],
)

LucianoBAF · 2025-01-24T14:44:43Z

I'm facing the same issue. Adding the UTF-8 flag does not solve the issue, and the matches returned by db.scan() come with wrong indexes after encountering an unicode char.

For instance, if we have my_string="österreich" is encoded with bytes(my_string, 'utf-8') or my_string.encode('utf-8'), it results in b'\xc3\x96sterreich', which has 1 char more than the original text. The hyperscan match position index will by shifted by one char to the right due to this.

The problem gets worse if it is a kanji (Chinese characters), katakana or hiragana (Japanese characters) which yields 3 chars each when encoded, making the match indexes be misplaced by 2 for every character it encounters.

Looks like a bug that should be addressed by the internal processing of the HS_FLAG_UTF8 flag.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to handle input with characters having more than one byte in UTF-8 #154

How to handle input with characters having more than one byte in UTF-8 #154

mar4th3 commented Aug 29, 2024 •

edited

Loading

betterlch commented Sep 2, 2024

mar4th3 commented Sep 2, 2024

betterlch commented Sep 4, 2024

LucianoBAF commented Jan 24, 2025 •

edited

Loading

How to handle input with characters having more than one byte in UTF-8 #154

How to handle input with characters having more than one byte in UTF-8 #154

Comments

mar4th3 commented Aug 29, 2024 • edited Loading

betterlch commented Sep 2, 2024

mar4th3 commented Sep 2, 2024

betterlch commented Sep 4, 2024

LucianoBAF commented Jan 24, 2025 • edited Loading

mar4th3 commented Aug 29, 2024 •

edited

Loading

LucianoBAF commented Jan 24, 2025 •

edited

Loading