Undefined behaviour using `.` matcher on unicode #375

bend-n · 2024-02-14T01:36:58Z

use logos::Logos;

fn dec(tags: &str) {
    #[derive(logos::Logos, PartialEq, Debug)]
    #[logos(skip r"[\s\n]+")]
    enum Tokens<'s> {
        #[regex(r".", priority = 6)] // causes a split of bytes ee, a1, 93 to just ee
        String(&'s str),
    }
    let lexer = Tokens::lexer(&tags).map(Result::unwrap);
    for Tokens::String(x) in lexer {
        println!("{:?}", x.as_bytes());
        for c in x.chars() {} // << bad
    }
}

#[test]
fn dect() {
    dec("");
}

execution of this provides

running 1 test
test dect ... error: Undefined Behavior: entering unreachable code
  --> /home/os/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/str/validations.rs:49:23
   |
49 |     let y = unsafe { *bytes.next().unwrap_unchecked() };
   |                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ entering unreachable code
   |
   = help: this indicates a bug in the program: it performed an invalid operation, and caused Undefined Behavior
   = help: see https://doc.rust-lang.org/nightly/reference/behavior-considered-undefined.html for further information
   = note: BACKTRACE:
   = note: inside `core::str::validations::next_code_point::<'_, std::slice::Iter<'_, u8>>` at /home/os/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/str/validations.rs:49:23: 49:54
   = note: inside `<std::str::Chars<'_> as std::iter::Iterator>::next` at /home/os/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/str/iter.rs:45:18: 45:49
note: inside `dec`
  --> src/main.rs:13:18
   |
13 |         for c in x.chars() {} // << bad
   |                  ^^^^^^^^^
note: inside `dect`
  --> src/main.rs:19:5
   |
19 |     dec("");
   |     ^^^^^^^^
note: inside closure
  --> src/main.rs:18:10
   |
17 | #[test]
   | ------- in this procedural macro expansion
18 | fn dect() {
   |          ^
   = note: this error originates in the attribute macro `test` (in Nightly builds, run with -Z macro-backtrace for more info)

note: some details are omitted, run with `MIRIFLAGS=-Zmiri-backtrace=full` for a verbose backtrace

error: aborting due to 1 previous error; 1 warning emitted

error: test failed, to rerun pass `--bin logo`

which is occuring as logos is providing a string with bytes 0xee, which is clearly invalid utf8, and logos is committing library-ub by slicing a unicode bound.

The text was updated successfully, but these errors were encountered:

RustyYato · 2024-02-16T02:09:05Z

I was debugging a similar issue recently, and I found the problem.

logos/logos-codegen/src/graph/regex.rs

Lines 163 to 182 in ba69cc3

    
           fn is_ascii(class: &ClassUnicode) -> bool { 
        
               class.iter().all(|range| { 
        
                   let start = range.start() as u32; 
        
                   let end = range.end() as u32; 
        
                   start < 128 && (end < 128 || end == 0x0010_FFFF) 
        
               }) 
        
           } 
        
           fn is_one_ascii(class: &ClassUnicode) -> bool { 
        
               if class.ranges().len() != 1 { 
        
                   return false; 
        
               } 
        
               let range = &class.ranges()[0]; 
        
               let start = range.start() as u32; 
        
               let end = range.end() as u32; 
        
               start < 128 && (end < 128 || end == 0x0010_FFFF) 
        
           }

These two checks assert that a max size range always should take the ASCII fast path, but this is wildly incorrect for non-ASCII text. With a local clone that correctly checks that end < 128 only, fixes the issue.

see maciejhirsz#375 for an example of undefined behavior because of this fast path. TLDR: the ASCII fast path will stop matching on the first matching byte, however this would split multi-byte codepoints. Combined with `Lexer::remaining` (or even just capturing the string like in the issue), this leads to non-utf8 strings escaping into user code. This is UNSOUND.

@RustyYato

* The `.` regex should not take the ASCII fast path see #375 for an example of undefined behavior because of this fast path. TLDR: the ASCII fast path will stop matching on the first matching byte, however this would split multi-byte codepoints. Combined with `Lexer::remaining` (or even just capturing the string like in the issue), this leads to non-utf8 strings escaping into user code. This is UNSOUND. * Add tests for unicode dot in both str and bytes * chore(lib): rewrite using ClassUnicode methods As suggested by @RustyYato * Revert "chore(lib): rewrite using ClassUnicode methods" This reverts commit 80bd23f. * try: fallback to previous impl Tests are still passing * try: add repetition check * chore(lib): cleanup code * fix and move * another fix --------- Co-authored-by: Jérome Eertmans <[email protected]>

bend-n changed the title ~~Undefined behaviour on unicode~~ Undefined behaviour using . matcher on unicode Feb 14, 2024

RustyYato mentioned this issue Feb 16, 2024

Fix unicode dot #376

Merged

jeertmans closed this as completed in #376 Feb 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Undefined behaviour using `.` matcher on unicode #375

Undefined behaviour using `.` matcher on unicode #375

bend-n commented Feb 14, 2024 •

edited

Loading

RustyYato commented Feb 16, 2024

Undefined behaviour using . matcher on unicode #375

Undefined behaviour using . matcher on unicode #375

Comments

bend-n commented Feb 14, 2024 • edited Loading

RustyYato commented Feb 16, 2024

Undefined behaviour using `.` matcher on unicode #375

Undefined behaviour using `.` matcher on unicode #375

bend-n commented Feb 14, 2024 •

edited

Loading