Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conserve BOM and properly support UTF16 #6497

Merged
merged 7 commits into from
Apr 30, 2023
Merged

Conserve BOM and properly support UTF16 #6497

merged 7 commits into from
Apr 30, 2023

Conversation

Alexis-Lapierre
Copy link
Contributor

@Alexis-Lapierre Alexis-Lapierre commented Mar 31, 2023

Create an enum EncodingBom in helix-view/src/document.rs that represent the Document encoding with BOM information.

Previous behaviour:

$ printf '\xEF\xBB\xBFtest\n' > utf8
$ file utf8
utf8: Unicode text, UTF-8 (with BOM) text
$ hx utf8    # Only type ':x' inside helix
$ file utf8
utf8: ASCII text

New behaviour:

$ printf '\xEF\xBB\xBFtest\n' > utf8
$ file utf8
utf8: Unicode text, UTF-8 (with BOM) text
$ hx utf8    # Only type ':x' inside helix
$ file utf8
utf8: Unicode text, UTF-8 (with BOM) text

This close #6246.

I'm still new to Rust, so feedback to improve this pull request is appreciated.

Edit

After some work, I removed the enum EncodingBom and replaced it by a boolean value, as per the feedback from @archseer.

I added some more commits on top of it to support UTF16 on top of UTF8. I had to implement an enum Encoder to buffer the writes to UTF16 since encoding_rs does not support writing UTF16 files.

So this pull request now also close #6542 on top of #6246.

I tested my branch against UTF8 with and without BOM, and UTF16 LE and BE.

@kirawi
Copy link
Member

kirawi commented Mar 31, 2023

encoding_rs looks to also handle UTF-16LE and UTF-16BE: https://docs.rs/encoding_rs/latest/src/encoding_rs/lib.rs.html#3757

@pascalkuthe
Copy link
Member

yeah:

    /// Performs non-incremental BOM sniffing.
    ///
    /// The argument must either be a buffer representing the entire input
    /// stream (non-streaming case) or a buffer representing at least the first
    /// three bytes of the input stream (streaming case).
    ///
    /// Returns `Some((UTF_8, 3))`, `Some((UTF_16LE, 2))` or
    /// `Some((UTF_16BE, 2))` if the argument starts with the UTF-8, UTF-16LE
    /// or UTF-16BE BOM or `None` otherwise.
    ///
    /// Available via the C wrapper.
    #[inline]
    pub fn for_bom(buffer: &[u8]) -> Option<(&'static Encoding, usize)> {
        if buffer.starts_with(b"\xEF\xBB\xBF") {
            Some((UTF_8, 3))
        } else if buffer.starts_with(b"\xFF\xFE") {
            Some((UTF_16LE, 2))
        } else if buffer.starts_with(b"\xFE\xFF") {
            Some((UTF_16BE, 2))
        } else {
            None
        }
    }
    ```

@Alexis-Lapierre
Copy link
Contributor Author

Alexis-Lapierre commented Mar 31, 2023

At the start I implemented UTF16 support with this merge request, but I reverted it (8e38adf) because editing UTF16 on master mess up the file.

Here is the behaviour I have on master:

$ hx --version
helix 23.03 (3cf03723)
$ file /tmp/utf16.txt
/tmp/utf16.txt: Unicode text, UTF-16, little-endian text
$ hx /tmp/utf16.txt # Only type ':x' inside the editor
$ file /tmp/utf16.txt
/tmp/utf16.txt: ISO-8859 text

utf16.txt was downloaded from https://github.com/stain/encoding-test-files/blob/master/utf16.txt.

If it is reproducible on more machines, we should probably create an issue for that

@Alexis-Lapierre
Copy link
Contributor Author

Alexis-Lapierre commented Apr 1, 2023

At the start I implemented UTF16 support with this merge request, but I reverted it (8e38adf) because editing UTF16 on master mess up the file.

Here is the behaviour I have on master:

$ hx --version
helix 23.03 (3cf03723)
$ file /tmp/utf16.txt
/tmp/utf16.txt: Unicode text, UTF-16, little-endian text
$ hx /tmp/utf16.txt # Only type ':x' inside the editor
$ file /tmp/utf16.txt
/tmp/utf16.txt: ISO-8859 text

utf16.txt was downloaded from https://github.com/stain/encoding-test-files/blob/master/utf16.txt.

If it is reproducible on more machines, we should probably create an issue for that

I created #6542 to explain in more details the problem I encountered while writing this pull request for UTF16

@kirawi
Copy link
Member

kirawi commented Apr 1, 2023

encoding_rs only supports decoding of UTF-16: hsivonen/encoding_rs#31 (comment)

@pascalkuthe
Copy link
Member

encoding_rs only supports decoding of UTF-16: hsivonen/encoding_rs#31 (comment)

to be fair std supports encoding utf-16 so it shouldn't be too hard to roll our encoder: hsivonen/encoding_rs#31 (comment) (can be done more efficiently and would look different in our case, simply iterating the doc chars and converting each to utf16 should work decently enough)

@kirawi kirawi mentioned this pull request Apr 2, 2023
helix-view/src/document.rs Outdated Show resolved Hide resolved
helix-view/src/document.rs Outdated Show resolved Hide resolved
@Alexis-Lapierre
Copy link
Contributor Author

NB: I'm waiting for #6549 to be merged to remove aacf4e3 from this merge request, so that helix support utf8 and utf16 bom

@Alexis-Lapierre Alexis-Lapierre changed the title UTF8 conserve BOM: No longer overwrite BOM byte if present on save UTF conserve BOM: No longer overwrite BOM byte if present on save Apr 9, 2023
@Alexis-Lapierre Alexis-Lapierre changed the title UTF conserve BOM: No longer overwrite BOM byte if present on save UTF Conserve BOM bytes and properly support UTF16 Apr 9, 2023
@Alexis-Lapierre Alexis-Lapierre requested a review from archseer April 9, 2023 20:27
@Alexis-Lapierre
Copy link
Contributor Author

NB: I'm waiting for #6549 to be merged to remove aacf4e3 from this merge request, so that helix support utf8 and utf16 bom

I implemented the proper way to save files in UTF16 with buffering in this commit: 642d9e7 .

pascalkuthe
pascalkuthe previously approved these changes Apr 12, 2023
Copy link
Member

@pascalkuthe pascalkuthe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM. The utf16 encoding implementation is exactly what I had in mind and endsup less disruptive to the codebase (while being faster) and feels more inline with what is actually changed (adding an.improved encoder instead of adding special cases to the code that uses the encoder). This could use some integration tests tough both for the bom and encoding handeling. This kind of stuff should ne pretty easy to write tests for. I don't have a lot of utf-16 text on hand to test this manually :D

@pascalkuthe pascalkuthe added C-enhancement Category: Improvements E-medium Call for participation: Experience needed to fix: Medium / intermediate S-waiting-on-review Status: Awaiting review from a maintainer. labels Apr 13, 2023
@Alexis-Lapierre
Copy link
Contributor Author

This LGTM. The utf16 encoding implementation is exactly what I had in mind and endsup less disruptive to the codebase (while being faster) and feels more inline with what is actually changed (adding an.improved encoder instead of adding special cases to the code that uses the encoder). This could use some integration tests tough both for the bom and encoding handeling. This kind of stuff should ne pretty easy to write tests for. I don't have a lot of utf-16 text on hand to test this manually :D

Thank you for your comment, I am waiting on the second reviewer to see if I need to modify anything in my pull request.
I think I'll make a follow up pull request adding the unit tests after this pr.

@pascalkuthe
Copy link
Member

pascalkuthe commented Apr 17, 2023

We usually keep tests in the same PR if they are not too complex so I would prefer if you add them in this PR. The tests should just be simple integration tests (in helix-term/src/tests) that open an utf16 file with the editor and check that the text of the active document is equal to what is expected and that the file remains unchanged after saving (you can do versions with and without BOM) so even if something about the code changes these tests would not interact with your API so the risk of duplicate work is fairly small IMO.

@Alexis-Lapierre
Copy link
Contributor Author

OK thanks for the feedback. Expect an update on this Pull Request during this week.

@Alexis-Lapierre
Copy link
Contributor Author

I've added 1ec6cdd

The commit add a simple integration test that repeat the steps to reproduce #6542 and #6246.

Copy link
Member

@pascalkuthe pascalkuthe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for adding those tests! LGTM now altough we will likely wait until after the bugfix release to merge

@pascalkuthe pascalkuthe added this to the 23.03.1 milestone Apr 25, 2023
@kekePower
Copy link

This looks like it's working as expected. Thanks for you efforts!

% file *txt
utf16-helix-01.txt: Unicode text, UTF-16, little-endian text
utf16-nvim-01.txt:  Unicode text, UTF-16, little-endian text
utf16.txt:          Unicode text, UTF-16, little-endian text

I opened utx16.txt and wrote the file to disk using helix 23.03 with this patch and neovim from git master.

@pascalkuthe pascalkuthe merged commit b0b3f45 into helix-editor:master Apr 30, 2023
@pascalkuthe pascalkuthe changed the title UTF Conserve BOM bytes and properly support UTF16 Conserve BOM and properly support UTF16 Apr 30, 2023
@Alexis-Lapierre Alexis-Lapierre deleted the conserve_bom_byte branch May 1, 2023 16:48
Triton171 pushed a commit to Triton171/helix that referenced this pull request Jun 18, 2023
wes-adams pushed a commit to wes-adams/helix that referenced this pull request Jul 4, 2023
smortime pushed a commit to smortime/helix that referenced this pull request Jul 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Category: Improvements E-medium Call for participation: Experience needed to fix: Medium / intermediate S-waiting-on-review Status: Awaiting review from a maintainer.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

UTF16: reading file is fine, but writing file is garbage Helix doesn't maintain BOM
6 participants