Proper grapheme truncation for windows-style newline #61

alexrutar · 2024-11-19T15:31:57Z

Another take at #58 but now with fewer changes: when truncating graphemes, properly map \r\n to \n so that newlines can be detected in the match objects.

pascalkuthe · 2024-12-07T16:44:43Z

this is not really sufficient unfortunately. In the all-ascii case, we don't do any grapheme segmentation at all. I think we likely need to add some special handling everywhere we do str.is_ascii() to also check that !str.contain('\r')

pascalkuthe · 2024-12-07T16:45:08Z

also would like to see testcases and a comment explaining the normalization

alexrutar · 2024-12-07T18:10:09Z

@pascalkuthe

I'm trying to understand the buf.iter().all(|c| c.is_ascii()) here:

nucleo/matcher/src/utf32_str.rs

Lines 44 to 55 in 474edc7

    
               pub fn new(str: &'a str, buf: &'a mut Vec<char>) -> Self { 
        
                   if str.is_ascii() { 
        
                       Utf32Str::Ascii(str.as_bytes()) 
        
                   } else { 
        
                       buf.clear(); 
        
                       buf.extend(crate::chars::graphemes(str)); 
        
                       if buf.iter().all(|c| c.is_ascii()) { 
        
                           return Utf32Str::Ascii(str.as_bytes()); 
        
                       } 
        
                       Utf32Str::Unicode(&*buf) 
        
                   } 
        
               }

For example, if I enter ü as the pair u\u{0308} (i.e. using 'Combining Diaeresis'), then Utf32Str::new("u\u{0308}") will implicitly drop the diaresis entirely? I.e.

Utf32Str::new("u\u{0308}") == Utf32Str::Ascii(b"u")

Is this an oversight or intended behaviour?

alexrutar · 2024-12-07T18:14:21Z

I guess the more precise version of this question is what is the proper interpretation of Utf32Str::Ascii? Should I think of it as Utf32Str::Unicode (i.e. as a Vec<char>) except with better cache locality, which means that we can just replace char -> u8 "for free"?

If this is the case (and it makes sense to me in terms of how I understand the internal matcher implementation, i.e. using optimized byte methods on ASCII-only buffers), I think the is_ascii function is extremely unclear since

assert!(Utf32Str::new("u\u{0308}").is_ascii())

pascalkuthe · 2024-12-07T18:36:54Z

Is this an oversight or intended behaviour?

it's the intended behaviour. Nucleo cannot match full graphemes so we reduce all graphemes to a single char (the first one because that is usually close enough for matching like in this case). I think this specific if condition is a bug tough left over from earlier versions where things worked differently.

I guess the more precise version of this question is what is the proper interpretation of Utf32Str::Ascii? Should I think of it as Utf32Str::Unicode (i.e. as a Vec) except with better cache locality, which means that we can just replace char -> u8 "for free"?

yes it's entirely an optimization. Matching is much faster in the ASCII case (and generally nucle is optimized under the assumption that non-ascsii text is rare). That is also the reason for the if condition here.

assert!(Utf32Str::new("u\u{0308}").is_ascii())

for the most part this method is just a utility function to check for the ascii variant (and whether the corresponding fast paths apply). Indeed it's possible that you end up with a ::Unicode variant made up entirely of ascii chars so maybe a bit confusing. At the end of the day this type is not meant as a general purpose string type and only meant to be useful within the matcher and as an input to the matcher

This commit corrects the internal handling of grapheme truncation. Most notably, it fixes two bugs with the previous implementation of Utf32Str(ing): 1. Fixes a bug where an Ascii variant could have been returned even though the original string was not ASCII. (The converse, where a Unicode variant consists only of ASCII, is totally fine). 2. Fixes the handling of windows-style newline (i.e. `\r\n`) since these are single graphemes. Moreover, the `\r\n` grapheme is now mapped to `\n` rather than `\r`. In particular, Utf32Str(ing)s constructed from text containing windows-style newlines will result in Unicode variants, even if the string is entirely valid Ascii.

alexrutar · 2024-12-07T20:40:21Z

Thanks, that makes sense! I've added tests and documentation. Most notably, I made an attempt to clarify the API guarantees made by the Utf32Str(ing) implementation, which I hope will be useful for downstream consumers. As far as I understand your earlier comment, the API guarantees are correct, but of course let me know if you have issues with the wording and I will fix it. I also added a number of examples to indicate the ways in which Utf32Str(ing) can behave very much unlike a true string type, which will hopefully help to clarify the situation for anybody else using these types.

Also, I fixed the implementation of Utf32Str::new to no longer construct an Ascii variant when grapheme truncation results in pure Ascii. (Of course, the alternative situation in which a Unicode variant which consists only of ASCII is totally fine).

pascalkuthe · 2024-12-07T21:18:29Z

thanks!

alexrutar force-pushed the handle-windows-newline branch from 76ebd23 to 36d3747 Compare December 7, 2024 20:34

alexrutar force-pushed the handle-windows-newline branch from 36d3747 to 21fc173 Compare December 7, 2024 20:35

pascalkuthe approved these changes Dec 7, 2024

View reviewed changes

pascalkuthe merged commit d17e29a into helix-editor:master Dec 7, 2024
5 checks passed

alexrutar mentioned this pull request Dec 7, 2024

Grapheme handling issue with \r\n #58

Closed

alexrutar deleted the handle-windows-newline branch December 9, 2024 14:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proper grapheme truncation for windows-style newline #61

Proper grapheme truncation for windows-style newline #61

alexrutar commented Nov 19, 2024

pascalkuthe commented Dec 7, 2024

pascalkuthe commented Dec 7, 2024 •

edited

Loading

alexrutar commented Dec 7, 2024

alexrutar commented Dec 7, 2024 •

edited

Loading

pascalkuthe commented Dec 7, 2024

alexrutar commented Dec 7, 2024

pascalkuthe commented Dec 7, 2024

Proper grapheme truncation for windows-style newline #61

Proper grapheme truncation for windows-style newline #61

Conversation

alexrutar commented Nov 19, 2024

pascalkuthe commented Dec 7, 2024

pascalkuthe commented Dec 7, 2024 • edited Loading

alexrutar commented Dec 7, 2024

alexrutar commented Dec 7, 2024 • edited Loading

pascalkuthe commented Dec 7, 2024

alexrutar commented Dec 7, 2024

pascalkuthe commented Dec 7, 2024

pascalkuthe commented Dec 7, 2024 •

edited

Loading

alexrutar commented Dec 7, 2024 •

edited

Loading