Allow different character encodings in HTML #1

rrthomas · 2018-01-19T00:35:28Z

See Debian bug #748984: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=748984

ccbenny · 2024-09-14T13:32:25Z

Hi Thomas. If I wanted to tackle this, do you have some tips?

It seems the way forward is to implement entities as a surface. I understand from the code that surfaces are declared as modules that convert between "data" and the respective surface. Where "data" is basically a byte stream, because stuff like base64 or quoted-printable is defined on the bytes, not on the characters.

This would be different here, where we basically convert between UCS-2 and entities. So to implement, I guess I would use recode_get/put_usc2 instead of recode_get/put_byte. But how would we hook this into the framework? Is that supported at all?

Or should I implement this as an intermediate step instead, creating a html-ucs2 with the UCS-2 functions and than do utf-8..html-ucs2..utf-8? How do these intermediates work? What gets passed between the steps, is that always UCS-2?

Background: I have just completed some data-wrangling with a large database dump, encoded as JSON and UTF-8 and with some of the fields containing HTML entities with lots of strange data. So I encountered this bug and also #25 and #58. Given that I have some fresh experience now, and a large test-case, I thought I'd give these some thought ;-)

ccbenny · 2024-09-14T13:33:57Z

Hi Thomas

Sorry, that's Reuben, right? 🙏

rrthomas · 2024-09-14T21:31:18Z

Hi Thomas

Sorry, that's Reuben, right? 🙏

Yes! No worries, it's an easy mistake to make, happens all the time.

rrthomas · 2024-09-14T21:44:11Z

I'll do my best here; I'm a bit stale on Recode at the moment, so I don't guarantee that my understanding is that good.

Reading the manual, it seems there are two main options: either implement entities as a surface (in which case they are independent of HTML, which seems OK: entity syntax is fairly general); or, implement them as a charset (as mentioned in the following paragraph:

Even if surfaces may generally be applied to various charsets, some surfaces were specifically designed for a particular charset, and would not make much sense if applied to other charsets. In such cases, these conceptual surfaces have been implemented as Recode charsets, instead of as surfaces.

The simplest way I can think of to implement entities as a surface is to take literally the idea that, as the manual says, "A "surface" is the varnish added over a charset so it fits in actual bits and bytes.". That is, it's a binary encoding of the characters. So, ASCII characters are represented literally, while any ≥8-bit character is represented as an entity. I think this is compatible with all known charsets: that is, adding such a surface never generates code points that the underlying charset can't handle. So, there's an unambiguous way to add the surface; and removing the surface is OK provided that the original text didn't contain any entities. (Similar problems could occur with other surfaces.)

How's that for a start?

ccbenny · 2024-09-20T15:41:10Z

Hi Reuben.

Thanks for your thoughts. I have in the meantime tried to use surface and did not have real success. But I think I understand the problem better now.

Even if surfaces may generally be applied to various charsets, some surfaces were specifically designed for a particular charset, and would not make much sense if applied to other charsets. In such cases, these conceptual surfaces have been implemented as Recode charsets, instead of as surfaces.

Yes, this seems to apply here, thanks for the quote. Currently "html" is tied to ISO8859-1 as the charset when decoding, it just passes through existing bytes as characters, extending them to Unicode (UCS-2) without further mapping. On encoding, it takes in UCS-2 and produces ASCII. It seems to me that there is just a requirement of some rule/mapping how to encode/decode binary characters here, and ISO8859-1, UCS-2 or UCS-4 are the only reasonable choices, because those mappings are "trivial", as the mathematicians say.

Surfaces do not specify a charset at all and I have not found a way to infer a charset from the data structure that the conversion function gets passed. It is quite possible that I am missing something here. Also something like "utf8/surface..utf8" gets optimized to "/surface.." early on.

So, ASCII characters are represented literally, while any ≥8-bit character is represented as an entity.

That is ok when we encode entities (and it is what the current code does), but it is not what we get when we decode text with entities. On decoding we get characters ≥ 7-bit in addition to entities, and we need to pass them on in the same charset as the result of decoding the entities. So we need to know what that charset is. Which is the problem that this bug report is about, after all ;-)

I will try a charset "html+ucs4" next and see where that takes me.

Regards, benny

rrthomas · 2024-09-23T22:11:39Z

Yes, this seems to apply here, thanks for the quote. Currently "html" is tied to ISO8859-1 as the charset when decoding, it just passes through existing bytes as characters, extending them to Unicode (UCS-2) without further mapping. On encoding, it takes in UCS-2 and produces ASCII. It seems to me that there is just a requirement of some rule/mapping how to encode/decode binary characters here, and ISO8859-1, UCS-2 or UCS-4 are the only reasonable choices, because those mappings are "trivial", as the mathematicians say.

I'm not sure if these are the only reasonable choices, but they're certainly the most useful ones. (Arguably one would want other 8-bit charsets too for old documents.)

Surfaces do not specify a charset at all and I have not found a way to infer a charset from the data structure that the conversion function gets passed. It is quite possible that I am missing something here. Also something like "utf8/surface..utf8" gets optimized to "/surface.." early on.

As you mentioned originally, the pseudo-charset data seems to come into play here: surfaces are applied to and removed from data, not any specific encoding.

That is ok when we encode entities (and it is what the current code does), but it is not what we get when we decode text with entities. On decoding we get characters ≥ 7-bit in addition to entities, and we need to pass them on in the same charset as the result of decoding the entities. So we need to know what that charset is. Which is the problem that this bug report is about, after all ;-)

Don't we just need to pass them on as data, and then the charset depends on other steps? I can't see how this would interact with the html charset, but as an alternative that just deals with entities as an encoding mechanism it seems to make sense. One could imagine a set of entity surfaces, one accepting each flavour of HTML entities currently accepted, but not tied to specific charsets. One would need to specify an explicit charset as well as the surface used (although often there would be only one sensible charset for a given entity set).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow different character encodings in HTML #1

Allow different character encodings in HTML #1

rrthomas commented Jan 19, 2018

ccbenny commented Sep 14, 2024

ccbenny commented Sep 14, 2024

rrthomas commented Sep 14, 2024

rrthomas commented Sep 14, 2024

ccbenny commented Sep 20, 2024

rrthomas commented Sep 23, 2024

Allow different character encodings in HTML #1

Allow different character encodings in HTML #1

Comments

rrthomas commented Jan 19, 2018

ccbenny commented Sep 14, 2024

ccbenny commented Sep 14, 2024

rrthomas commented Sep 14, 2024

rrthomas commented Sep 14, 2024

ccbenny commented Sep 20, 2024

rrthomas commented Sep 23, 2024