-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
U+FFFD Substitution of Maximal Subparts #7
Comments
@Moosieus, happy to integrate in this library. I think you're selling your current implementation short - its already performant and memory efficient. I propose:
I will add you as as a collaborator on the library so you can also push changes to it without me slowing you down and when you're comfortable I'll publish a new version. |
I've merged I'm not sure of the utility of adding the UTF-16 or UTF-32 versions only because Elixir strings are UTF-8 by definition. Is there a use case you had in mind where this would be valuable? Maybe I'm being overly conservative because the library is called One area of testing that needs some exploration (which you may already done) is that surrogate pairs aren't valid UTF-8 (range d800-dfff) but they are required in JSON because JSON can't encode escapes beyond
Therefore I wonder if the code should consider these surrogate pairs to be valid - but replace them with the correct normalised codepoint? It's a "special case" - but then JSON is used a lot. Thoughts? |
I'm not sure |
UTF-16 is used in wide circulation - languages like Java, C#, and JavaScript all represent strings as UTF-16. The default encoding for files on Windows remains UTF-16 as well. Also probably a safe bet there's old APIs and databases still transmitting and storing stuff as UTF-16. While most programmers don't interface with UTF-32 due to its inefficient usage of memory, its uniformity allows code points to be accessed in constant time. Good 'ol [citation needed] Wikipedia says UTF-32 is significant in text rendering. Although apocryphal, it makes intuitive sense to me. Rendering graphics typically needs to be compute-efficient, and the characters would only be held in memory briefly. Part of the beauty of Elixir and Erlang is how intuitive it makes delving into bits and bytes. I say support UTF-16 for the 1 in 100 people that need it, and UTF-32 for the one person that truly needs it. (If that one person's reading this, mad respect.) Plus it's not that much more to add. |
Concerning naming the public interface, it's an odd problem. I went with Alternatively one could pass all input through it eagerly, but that's a lot of extra compute at-scale. Most people seemingly get along fine without doing so, indicating to me that encoding errors in the wild are rare... Although I've spoken to several people who say they've had the exact opposite experience. I think there's some merit considering the name in either context. Above all else, I considered it important the name conveyed some transformation was (potentially) being applied. Perhaps it'd be nice to return a keyword list that told users what was replaced and where. Would make for good logging and observability. |
I peeked RFC7159 but sidebar indicates it was obsoleted by RFC8259. I'll have to review it in detail later, but it seems the relevant part you quoted remains standing. Looks like there's a few amendments to how strings are handled though. If I'm interpreting this right, in JSON, characters beyond the BMP are written as escaped UTF-16 code points, but are still all UTF-8 characters. It's UTF-ception. I'll see about drafting up a quick Livebook to illustrate my point. |
Cameron, all good points. Some additional thoughts:
|
If
I don't like the inconsistency of the return type. But I think the default case would be most common and therefore returning a WDYT? |
Detecting a source's encoding from just the binary input seems challenging. I'd leave it in the hands of users to specify. They'll likely have more information to work with than the API consumes:
Having the users specify also removes the need for it in the return spec, as what you give is what you get. For the replacement option, I'd just keep it a string and translate it to the encoding specified by the user, as the current API does now. |
Background on the issue:
At time of writing U+FFFD substitution isn't built into OTP. I've got an initial "passable" solution at github.com/Moosieus/UniRecover.
@kipcole9 You floated the idea of rolling the above functionality into here, and I think that's appropriate:
is-even
-esc "micro-packages" for lack of better term, and I consider UniRecover to be one.As to next steps - I dunno. My current action items in UniRecover are broadly:
Thoughts?
The text was updated successfully, but these errors were encountered: