-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proper grapheme truncation for windows-style newline #61
Proper grapheme truncation for windows-style newline #61
Conversation
this is not really sufficient unfortunately. In the all-ascii case, we don't do any grapheme segmentation at all. I think we likely need to add some special handling everywhere we do |
also would like to see testcases and a comment explaining the normalization |
I'm trying to understand the nucleo/matcher/src/utf32_str.rs Lines 44 to 55 in 474edc7
For example, if I enter ü as the pair Utf32Str::new("u\u{0308}") == Utf32Str::Ascii(b"u") Is this an oversight or intended behaviour? |
I guess the more precise version of this question is what is the proper interpretation of If this is the case (and it makes sense to me in terms of how I understand the internal matcher implementation, i.e. using optimized byte methods on ASCII-only buffers), I think the assert!(Utf32Str::new("u\u{0308}").is_ascii()) |
it's the intended behaviour. Nucleo cannot match full graphemes so we reduce all graphemes to a single char (the first one because that is usually close enough for matching like in this case). I think this specific if condition is a bug tough left over from earlier versions where things worked differently.
yes it's entirely an optimization. Matching is much faster in the ASCII case (and generally nucle is optimized under the assumption that non-ascsii text is rare). That is also the reason for the if condition here.
for the most part this method is just a utility function to check for the ascii variant (and whether the corresponding fast paths apply). Indeed it's possible that you end up with a |
76ebd23
to
36d3747
Compare
This commit corrects the internal handling of grapheme truncation. Most notably, it fixes two bugs with the previous implementation of Utf32Str(ing): 1. Fixes a bug where an Ascii variant could have been returned even though the original string was not ASCII. (The converse, where a Unicode variant consists only of ASCII, is totally fine). 2. Fixes the handling of windows-style newline (i.e. `\r\n`) since these are single graphemes. Moreover, the `\r\n` grapheme is now mapped to `\n` rather than `\r`. In particular, Utf32Str(ing)s constructed from text containing windows-style newlines will result in Unicode variants, even if the string is entirely valid Ascii.
36d3747
to
21fc173
Compare
Thanks, that makes sense! I've added tests and documentation. Most notably, I made an attempt to clarify the API guarantees made by the Also, I fixed the implementation of |
thanks! |
Another take at #58 but now with fewer changes: when truncating graphemes, properly map
\r\n
to\n
so that newlines can be detected in the match objects.