Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proper grapheme truncation for windows-style newline #61

Merged

Conversation

alexrutar
Copy link
Contributor

Another take at #58 but now with fewer changes: when truncating graphemes, properly map \r\n to \n so that newlines can be detected in the match objects.

@pascalkuthe
Copy link
Member

this is not really sufficient unfortunately. In the all-ascii case, we don't do any grapheme segmentation at all. I think we likely need to add some special handling everywhere we do str.is_ascii() to also check that !str.contain('\r')

@pascalkuthe
Copy link
Member

pascalkuthe commented Dec 7, 2024

also would like to see testcases and a comment explaining the normalization

@alexrutar
Copy link
Contributor Author

@pascalkuthe

I'm trying to understand the buf.iter().all(|c| c.is_ascii()) here:

pub fn new(str: &'a str, buf: &'a mut Vec<char>) -> Self {
if str.is_ascii() {
Utf32Str::Ascii(str.as_bytes())
} else {
buf.clear();
buf.extend(crate::chars::graphemes(str));
if buf.iter().all(|c| c.is_ascii()) {
return Utf32Str::Ascii(str.as_bytes());
}
Utf32Str::Unicode(&*buf)
}
}

For example, if I enter ü as the pair u\u{0308} (i.e. using 'Combining Diaeresis'), then Utf32Str::new("u\u{0308}") will implicitly drop the diaresis entirely? I.e.

Utf32Str::new("u\u{0308}") == Utf32Str::Ascii(b"u")

Is this an oversight or intended behaviour?

@alexrutar
Copy link
Contributor Author

alexrutar commented Dec 7, 2024

I guess the more precise version of this question is what is the proper interpretation of Utf32Str::Ascii? Should I think of it as Utf32Str::Unicode (i.e. as a Vec<char>) except with better cache locality, which means that we can just replace char -> u8 "for free"?

If this is the case (and it makes sense to me in terms of how I understand the internal matcher implementation, i.e. using optimized byte methods on ASCII-only buffers), I think the is_ascii function is extremely unclear since

assert!(Utf32Str::new("u\u{0308}").is_ascii())

@pascalkuthe
Copy link
Member

Is this an oversight or intended behaviour?

it's the intended behaviour. Nucleo cannot match full graphemes so we reduce all graphemes to a single char (the first one because that is usually close enough for matching like in this case). I think this specific if condition is a bug tough left over from earlier versions where things worked differently.

I guess the more precise version of this question is what is the proper interpretation of Utf32Str::Ascii? Should I think of it as Utf32Str::Unicode (i.e. as a Vec) except with better cache locality, which means that we can just replace char -> u8 "for free"?

yes it's entirely an optimization. Matching is much faster in the ASCII case (and generally nucle is optimized under the assumption that non-ascsii text is rare). That is also the reason for the if condition here.

assert!(Utf32Str::new("u\u{0308}").is_ascii())

for the most part this method is just a utility function to check for the ascii variant (and whether the corresponding fast paths apply). Indeed it's possible that you end up with a ::Unicode variant made up entirely of ascii chars so maybe a bit confusing. At the end of the day this type is not meant as a general purpose string type and only meant to be useful within the matcher and as an input to the matcher

@alexrutar alexrutar force-pushed the handle-windows-newline branch from 76ebd23 to 36d3747 Compare December 7, 2024 20:34
This commit corrects the internal handling of grapheme truncation. Most
notably, it fixes two bugs with the previous implementation of
Utf32Str(ing):

1. Fixes a bug where an Ascii variant could have been returned even
   though the original string was not ASCII. (The converse, where a
   Unicode variant consists only of ASCII, is totally fine).
2. Fixes the handling of windows-style newline (i.e. `\r\n`)
   since these are single graphemes. Moreover, the `\r\n` grapheme is
   now mapped to `\n` rather than `\r`. In particular, Utf32Str(ing)s
   constructed from text containing windows-style newlines will result
   in Unicode variants, even if the string is entirely valid Ascii.
@alexrutar alexrutar force-pushed the handle-windows-newline branch from 36d3747 to 21fc173 Compare December 7, 2024 20:35
@alexrutar
Copy link
Contributor Author

Thanks, that makes sense! I've added tests and documentation. Most notably, I made an attempt to clarify the API guarantees made by the Utf32Str(ing) implementation, which I hope will be useful for downstream consumers. As far as I understand your earlier comment, the API guarantees are correct, but of course let me know if you have issues with the wording and I will fix it. I also added a number of examples to indicate the ways in which Utf32Str(ing) can behave very much unlike a true string type, which will hopefully help to clarify the situation for anybody else using these types.

Also, I fixed the implementation of Utf32Str::new to no longer construct an Ascii variant when grapheme truncation results in pure Ascii. (Of course, the alternative situation in which a Unicode variant which consists only of ASCII is totally fine).

@pascalkuthe
Copy link
Member

thanks!

@pascalkuthe pascalkuthe merged commit d17e29a into helix-editor:master Dec 7, 2024
5 checks passed
@alexrutar alexrutar deleted the handle-windows-newline branch December 9, 2024 14:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants