Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

U+FFFD Substitution of Maximal Subparts #7

Closed
Moosieus opened this issue Oct 8, 2023 · 9 comments · Fixed by #8
Closed

U+FFFD Substitution of Maximal Subparts #7

Moosieus opened this issue Oct 8, 2023 · 9 comments · Fixed by #8

Comments

@Moosieus
Copy link
Collaborator

Moosieus commented Oct 8, 2023

Background on the issue:

At time of writing U+FFFD substitution isn't built into OTP. I've got an initial "passable" solution at github.com/Moosieus/UniRecover.

@kipcole9 You floated the idea of rolling the above functionality into here, and I think that's appropriate:

  • The solution's more likely to be maintained here.
  • Its functionality is highly relevant to what's already present.
  • I'm not a fan of is-even-esc "micro-packages" for lack of better term, and I consider UniRecover to be one.
  • My gut intuition's that including it here would be right-sized so to speak.

As to next steps - I dunno. My current action items in UniRecover are broadly:

  • Work-shopping a better implementation.
  • Getting some comprehensive tests written.
  • Perhaps implementing more appropriate performance testing (albeit that's a complicated subject)

Thoughts?

@kipcole9
Copy link
Collaborator

kipcole9 commented Oct 8, 2023

@Moosieus, happy to integrate in this library.

I think you're selling your current implementation short - its already performant and memory efficient.

I propose:

  • Integrating your code from UniRecover into Unicode
  • Review the function name and public API
  • Add tests and docs
  • Ship it.

I will add you as as a collaborator on the library so you can also push changes to it without me slowing you down and when you're comfortable I'll publish a new version.

@kipcole9
Copy link
Collaborator

kipcole9 commented Oct 8, 2023

I've merged UniRecover.UTF8 into Unicode in the validate branch as Unicode.validate_utf8/2 and added some basic docs and one doctest.

I'm not sure of the utility of adding the UTF-16 or UTF-32 versions only because Elixir strings are UTF-8 by definition. Is there a use case you had in mind where this would be valuable? Maybe I'm being overly conservative because the library is called Unicode, not UTF8?

One area of testing that needs some exploration (which you may already done) is that surrogate pairs aren't valid UTF-8 (range d800-dfff) but they are required in JSON because JSON can't encode escapes beyond U+FFFF. RFC7159 says:

To escape an extended character that is not in the Basic Multilingual
Plane, the character is represented as a 12-character sequence,
encoding the UTF-16 surrogate pair. So, for example, a string
containing only the G clef character (U+1D11E) may be represented as
"\uD834\uDD1E".

Therefore I wonder if the code should consider these surrogate pairs to be valid - but replace them with the correct normalised codepoint? It's a "special case" - but then JSON is used a lot.

Thoughts?

@kipcole9
Copy link
Collaborator

kipcole9 commented Oct 8, 2023

I'm not sure validate_utf8/2 is the best function name - thoughts on alternatives? I was thinking force_valid_utf8/2 as being the most expressive of the function's intent might be better?

@Moosieus
Copy link
Collaborator Author

Moosieus commented Oct 8, 2023

UTF-16 is used in wide circulation - languages like Java, C#, and JavaScript all represent strings as UTF-16. The default encoding for files on Windows remains UTF-16 as well. Also probably a safe bet there's old APIs and databases still transmitting and storing stuff as UTF-16.

While most programmers don't interface with UTF-32 due to its inefficient usage of memory, its uniformity allows code points to be accessed in constant time. Good 'ol [citation needed] Wikipedia says UTF-32 is significant in text rendering. Although apocryphal, it makes intuitive sense to me. Rendering graphics typically needs to be compute-efficient, and the characters would only be held in memory briefly.

Part of the beauty of Elixir and Erlang is how intuitive it makes delving into bits and bytes. I say support UTF-16 for the 1 in 100 people that need it, and UTF-32 for the one person that truly needs it. (If that one person's reading this, mad respect.) Plus it's not that much more to add.

@Moosieus
Copy link
Collaborator Author

Moosieus commented Oct 8, 2023

Concerning naming the public interface, it's an odd problem. I went with sub on UniRecover as an abridged form of substitute, thinking most would use it after some operation that expected proper UTF-8 failed. Hence there would be some invalid data to be substituted at all, and naming it 'substitute' made sense.

Alternatively one could pass all input through it eagerly, but that's a lot of extra compute at-scale. Most people seemingly get along fine without doing so, indicating to me that encoding errors in the wild are rare... Although I've spoken to several people who say they've had the exact opposite experience. I think there's some merit considering the name in either context.

Above all else, I considered it important the name conveyed some transformation was (potentially) being applied.

Perhaps it'd be nice to return a keyword list that told users what was replaced and where. Would make for good logging and observability.

@Moosieus
Copy link
Collaborator Author

Moosieus commented Oct 8, 2023

I peeked RFC7159 but sidebar indicates it was obsoleted by RFC8259. I'll have to review it in detail later, but it seems the relevant part you quoted remains standing. Looks like there's a few amendments to how strings are handled though.

If I'm interpreting this right, in JSON, characters beyond the BMP are written as escaped UTF-16 code points, but are still all UTF-8 characters. It's UTF-ception.

I'll see about drafting up a quick Livebook to illustrate my point.

@kipcole9
Copy link
Collaborator

kipcole9 commented Oct 8, 2023

Cameron, all good points. Some additional thoughts:

  • Elixir tends to use replace instead of substitute so perhaps replace_invalid/2? I typically think of names that assert the return hence why I was thinking force_valid/2.

  • Got it on UTF-16/32. Makes sense. Perhaps return a {:utf16, binary} in those cases to make clear the encoding and make clear the return type is not and Elixir String.t.

  • Could we add a function to detect encoding? That would be helpful (but not essential) and it would enable Unicode.detect_encoding/1 |> Unicode.replace_invalid/2?

  • I would find a version that returns a representation of what was replaced and where really helpful - great idea.

  • Then to complete the set, a function that transcodes between encodings would be great - it could just wrap erlang's :unicode.characters_to_binary/3 function. I know that's a bit of an antipattern but if someone uses this library to detect and correct Unicode then I would be natural I think to find a function to transcode.

@kipcole9
Copy link
Collaborator

kipcole9 commented Oct 9, 2023

If replace_invalid/2 then perhaps the spec might be:

@spec replace_option :: 
  {:encoding,  :utf8 | :utf16 | :utf32}
  | {:replacement, binary()}

@spec replace_options :: list(replace_option)

@spec replace_invalid(binary :: binary(), options :: replace_options()) :: 
  String.t() | {:utf16, binary()} | {:utf32, binary()}

@default_replace_invalid_options [encoding: :utf8, replacement: "�"]

I don't like the inconsistency of the return type. But I think the default case would be most common and therefore returning a String.t() makes sense. And to make it explicit that the other returns aren't String.t(), a tuple with the appropriate encoding makes that super clear.

WDYT?

@Moosieus
Copy link
Collaborator Author

replace_invalid works well. I think it's inline with what most programmers would expect.

Detecting a source's encoding from just the binary input seems challenging. I'd leave it in the hands of users to specify. They'll likely have more information to work with than the API consumes:

  • I'm reading a file on Windows, therefore the encoding is UTF-16LE
  • I'm reading text from a database, and the table collation's utf8mb3_general_ci, therefore UTF-8

Having the users specify also removes the need for it in the return spec, as what you give is what you get.

For the replacement option, I'd just keep it a string and translate it to the encoding specified by the user, as the current API does now.

@Moosieus Moosieus linked a pull request Oct 19, 2023 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants