-
Notifications
You must be signed in to change notification settings - Fork 171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace invalid unicode with replacement chars #12
Comments
Hey @michalmuskala I'm interested in trying to create a PR for this. Do you have any implementation ideas or other pointers? |
I think the best way to handle this would be to expand the As to how to handle the replacement char, there are actually two ways - one is to replace every invalid byte with one replacement char, another to collect all invalid bytes and use just one replacement char in place of them. I've seen references on the internet to both methods. It would likely require some research to see which one is the most compliant with the standard. We should also probably treat invalid escapes as replacement chars and not errors in that case - basically turn every string decoding error into replacement chars. |
This seems an infrequent but challenging issue people still encounter. Shortly after the last post on this issue, the Unicode Standard was updated to promote W3C's standard for consistent substitution (Seen here, under the heading "U+FFFD Substitution of Maximal Subparts"). The basic gist is:
I couldn't find anything in Elixir or Erlang that did this, so I wrote my own. Ideally though Elixir or OTP would provide a native solution (as other languages do). |
Would it be possible to integrate the matches from String.replace_invalid/2 into the existing |
This could be an optional mode for the parser.
The text was updated successfully, but these errors were encountered: