Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Undefined behaviour using . matcher on unicode #375

Closed
bend-n opened this issue Feb 14, 2024 · 1 comment · Fixed by #376
Closed

Undefined behaviour using . matcher on unicode #375

bend-n opened this issue Feb 14, 2024 · 1 comment · Fixed by #376

Comments

@bend-n
Copy link

bend-n commented Feb 14, 2024

use logos::Logos;

fn dec(tags: &str) {
    #[derive(logos::Logos, PartialEq, Debug)]
    #[logos(skip r"[\s\n]+")]
    enum Tokens<'s> {
        #[regex(r".", priority = 6)] // causes a split of bytes ee, a1, 93 to just ee
        String(&'s str),
    }
    let lexer = Tokens::lexer(&tags).map(Result::unwrap);
    for Tokens::String(x) in lexer {
        println!("{:?}", x.as_bytes());
        for c in x.chars() {} // << bad
    }
}

#[test]
fn dect() {
    dec("");
}

execution of this provides

running 1 test
test dect ... error: Undefined Behavior: entering unreachable code
  --> /home/os/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/str/validations.rs:49:23
   |
49 |     let y = unsafe { *bytes.next().unwrap_unchecked() };
   |                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ entering unreachable code
   |
   = help: this indicates a bug in the program: it performed an invalid operation, and caused Undefined Behavior
   = help: see https://doc.rust-lang.org/nightly/reference/behavior-considered-undefined.html for further information
   = note: BACKTRACE:
   = note: inside `core::str::validations::next_code_point::<'_, std::slice::Iter<'_, u8>>` at /home/os/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/str/validations.rs:49:23: 49:54
   = note: inside `<std::str::Chars<'_> as std::iter::Iterator>::next` at /home/os/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/str/iter.rs:45:18: 45:49
note: inside `dec`
  --> src/main.rs:13:18
   |
13 |         for c in x.chars() {} // << bad
   |                  ^^^^^^^^^
note: inside `dect`
  --> src/main.rs:19:5
   |
19 |     dec("");
   |     ^^^^^^^^
note: inside closure
  --> src/main.rs:18:10
   |
17 | #[test]
   | ------- in this procedural macro expansion
18 | fn dect() {
   |          ^
   = note: this error originates in the attribute macro `test` (in Nightly builds, run with -Z macro-backtrace for more info)

note: some details are omitted, run with `MIRIFLAGS=-Zmiri-backtrace=full` for a verbose backtrace

error: aborting due to 1 previous error; 1 warning emitted

error: test failed, to rerun pass `--bin logo`

which is occuring as logos is providing a string with bytes 0xee, which is clearly invalid utf8, and logos is committing library-ub by slicing a unicode bound.

@bend-n bend-n changed the title Undefined behaviour on unicode Undefined behaviour using . matcher on unicode Feb 14, 2024
@RustyYato
Copy link
Contributor

I was debugging a similar issue recently, and I found the problem.

fn is_ascii(class: &ClassUnicode) -> bool {
class.iter().all(|range| {
let start = range.start() as u32;
let end = range.end() as u32;
start < 128 && (end < 128 || end == 0x0010_FFFF)
})
}
fn is_one_ascii(class: &ClassUnicode) -> bool {
if class.ranges().len() != 1 {
return false;
}
let range = &class.ranges()[0];
let start = range.start() as u32;
let end = range.end() as u32;
start < 128 && (end < 128 || end == 0x0010_FFFF)
}

These two checks assert that a max size range always should take the ASCII fast path, but this is wildly incorrect for non-ASCII text. With a local clone that correctly checks that end < 128 only, fixes the issue.

RustyYato added a commit to RustyYato/logos that referenced this issue Feb 16, 2024
see maciejhirsz#375 for an example of undefined behavior because of this fast path.

TLDR: the ASCII fast path will stop matching on the first matching byte,
however this would split multi-byte codepoints. Combined with
`Lexer::remaining` (or even just capturing the string like in the issue),
this leads to non-utf8 strings escaping into user code. This is UNSOUND.
jeertmans added a commit that referenced this issue Feb 16, 2024
* The `.` regex should not take the ASCII fast path

see #375 for an example of undefined behavior because of this fast path.

TLDR: the ASCII fast path will stop matching on the first matching byte,
however this would split multi-byte codepoints. Combined with
`Lexer::remaining` (or even just capturing the string like in the issue),
this leads to non-utf8 strings escaping into user code. This is UNSOUND.

* Add tests for unicode dot in both str and bytes

* chore(lib): rewrite using ClassUnicode methods

As suggested by @RustyYato

* Revert "chore(lib): rewrite using ClassUnicode methods"

This reverts commit 80bd23f.

* try: fallback to previous impl

Tests are still passing

* try: add repetition check

* chore(lib): cleanup code

* fix and move

* another fix

---------

Co-authored-by: Jérome Eertmans <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants