-
-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow different character encodings in HTML #1
Comments
Hi Thomas. If I wanted to tackle this, do you have some tips? It seems the way forward is to implement entities as a surface. I understand from the code that surfaces are declared as modules that convert between "data" and the respective surface. Where "data" is basically a byte stream, because stuff like base64 or quoted-printable is defined on the bytes, not on the characters. This would be different here, where we basically convert between UCS-2 and entities. So to implement, I guess I would use Or should I implement this as an intermediate step instead, creating a Background: I have just completed some data-wrangling with a large database dump, encoded as JSON and UTF-8 and with some of the fields containing HTML entities with lots of strange data. So I encountered this bug and also #25 and #58. Given that I have some fresh experience now, and a large test-case, I thought I'd give these some thought ;-) |
Sorry, that's Reuben, right? 🙏 |
Yes! No worries, it's an easy mistake to make, happens all the time. |
I'll do my best here; I'm a bit stale on Recode at the moment, so I don't guarantee that my understanding is that good. Reading the manual, it seems there are two main options: either implement entities as a surface (in which case they are independent of HTML, which seems OK: entity syntax is fairly general); or, implement them as a charset (as mentioned in the following paragraph:
The simplest way I can think of to implement entities as a surface is to take literally the idea that, as the manual says, "A "surface" is the varnish added over a charset so it fits in actual bits and bytes.". That is, it's a binary encoding of the characters. So, ASCII characters are represented literally, while any ≥8-bit character is represented as an entity. I think this is compatible with all known charsets: that is, adding such a surface never generates code points that the underlying charset can't handle. So, there's an unambiguous way to add the surface; and removing the surface is OK provided that the original text didn't contain any entities. (Similar problems could occur with other surfaces.) How's that for a start? |
Hi Reuben. Thanks for your thoughts. I have in the meantime tried to use surface and did not have real success. But I think I understand the problem better now.
Yes, this seems to apply here, thanks for the quote. Currently "html" is tied to ISO8859-1 as the charset when decoding, it just passes through existing bytes as characters, extending them to Unicode (UCS-2) without further mapping. On encoding, it takes in UCS-2 and produces ASCII. It seems to me that there is just a requirement of some rule/mapping how to encode/decode binary characters here, and ISO8859-1, UCS-2 or UCS-4 are the only reasonable choices, because those mappings are "trivial", as the mathematicians say. Surfaces do not specify a charset at all and I have not found a way to infer a charset from the data structure that the conversion function gets passed. It is quite possible that I am missing something here. Also something like "utf8/surface..utf8" gets optimized to "/surface.." early on.
That is ok when we encode entities (and it is what the current code does), but it is not what we get when we decode text with entities. On decoding we get characters ≥ 7-bit in addition to entities, and we need to pass them on in the same charset as the result of decoding the entities. So we need to know what that charset is. Which is the problem that this bug report is about, after all ;-) I will try a charset "html+ucs4" next and see where that takes me. Regards, benny |
I'm not sure if these are the only reasonable choices, but they're certainly the most useful ones. (Arguably one would want other 8-bit charsets too for old documents.)
As you mentioned originally, the pseudo-charset
Don't we just need to pass them on as |
See Debian bug #748984: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=748984
The text was updated successfully, but these errors were encountered: