-
-
Notifications
You must be signed in to change notification settings - Fork 131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Undefined behaviour using .
matcher on unicode
#375
Comments
bend-n
changed the title
Undefined behaviour on unicode
Undefined behaviour using Feb 14, 2024
.
matcher on unicode
I was debugging a similar issue recently, and I found the problem. logos/logos-codegen/src/graph/regex.rs Lines 163 to 182 in ba69cc3
These two checks assert that a max size range always should take the ASCII fast path, but this is wildly incorrect for non-ASCII text. With a local clone that correctly checks that |
RustyYato
added a commit
to RustyYato/logos
that referenced
this issue
Feb 16, 2024
see maciejhirsz#375 for an example of undefined behavior because of this fast path. TLDR: the ASCII fast path will stop matching on the first matching byte, however this would split multi-byte codepoints. Combined with `Lexer::remaining` (or even just capturing the string like in the issue), this leads to non-utf8 strings escaping into user code. This is UNSOUND.
Merged
jeertmans
added a commit
that referenced
this issue
Feb 16, 2024
* The `.` regex should not take the ASCII fast path see #375 for an example of undefined behavior because of this fast path. TLDR: the ASCII fast path will stop matching on the first matching byte, however this would split multi-byte codepoints. Combined with `Lexer::remaining` (or even just capturing the string like in the issue), this leads to non-utf8 strings escaping into user code. This is UNSOUND. * Add tests for unicode dot in both str and bytes * chore(lib): rewrite using ClassUnicode methods As suggested by @RustyYato * Revert "chore(lib): rewrite using ClassUnicode methods" This reverts commit 80bd23f. * try: fallback to previous impl Tests are still passing * try: add repetition check * chore(lib): cleanup code * fix and move * another fix --------- Co-authored-by: Jérome Eertmans <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
execution of this provides
which is occuring as logos is providing a string with bytes
0xee
, which is clearly invalid utf8, and logos is committing library-ub by slicing a unicode bound.The text was updated successfully, but these errors were encountered: