Filter U+FEFF (BOM) when decoding input data #3200

kena0ki · 2020-12-31T13:38:05Z

This PR adds a check to skip a leading zero width character such as BOM.

Edit:
Changed the way to fix to filter U+FEFF (BOM) in TextDecoder.ts.

jerch

@kena0ki Thx for the patch. I see that your patch simply skips zero width codepoints at the line start. I am not sure if we can do it that way for any zero width codepoint, we would need some checks against other emulators and how they go about zero width codepoints. This is further complicated by the fact, that we currently only support post-combining characters while the unicode grapheme rules also allow pre-combining under certain circumstances. This post/pre handling decides in which buffer cells combining chars end up with big differences in the final render output.

Note that the original issue was only about the BOM char. I suggest we simply filter it for now not to end up in the buffer at all. Technically a BOM should never occur in a terminal data stream (as it falls under the forbid rules of https://tools.ietf.org/html/rfc3629).

kena0ki · 2021-01-13T11:52:53Z

@jerch Thank you for the review.
I didin't know about pre-combining. I need more study😀.

I suggest we simply filter it for now not to end up in the buffer at all.

I believe the patch would not harm current behaviour, but I'd like to follow your suggestion (filtering out only a BOM char).

Technically a BOM should never occur in a terminal data stream (as it falls under the forbid rules of https://tools.ietf.org/html/rfc3629).

I think I don't understand this line. Which forbid rule is applied? The first one? Since xterm.js only supports UTF-8.

jerch · 2021-01-13T13:26:42Z

I think I don't understand this line. Which forbid rule is applied? The first one? Since xterm.js only supports UTF-8.

Kinda both:

o A protocol SHOULD forbid use of U+FEFF as a signature for those
textual protocol elements that the protocol mandates to be always
UTF-8, the signature function being totally useless in those
cases.
o A protocol SHOULD also forbid use of U+FEFF as a signature for
those textual protocol elements for which the protocol provides
character encoding identification mechanisms, when it is expected
that implementations of the protocol will be in a position to
always use the mechanisms properly. This will be the case when
the protocol elements are maintained tightly under the control of
the implementation from the time of their creation to the time of
their (properly labeled) transmission.

First one because we only offer UTF8 input, yes (or Unicode JS-strings). But second one is also true for older terminals in general, as they used to handle different encodings by the much older ECMA-35 standard with proper "signature" (charset commands, a BOM did not even exist that time). We partially support these older charsets, but only if transcoded into the unicode/UTF8 realms beforehand. The BOM is superfluous in any case for us.

If we would allow different UTF versions / endianess as direct input to the terminal a BOM would be helpful. But we are tagged to UTF-8 or JS strings (already LE/BE aligned UTF-16/UCS-2), so BOM handling is an outer issue for us and should have been stripped by outer means. Since thats not the case, maybe we should filter it ourselves during UTF32 decoding in common/input/TextDecoder.ts.

kena0ki · 2021-01-14T10:47:25Z

Thank you for the explanation.

Since thats not the case, maybe we should filter it ourselves during UTF32 decoding in common/input/TextDecoder.ts.

I'll look into it.

kena0ki · 2021-01-23T04:48:53Z

I've updated the PR. Please take a look when you have time.

jerch

The changes look good 👍

Almost there, just one more thing - the UTF8 decoder works with stream chunks, thus there is a chance, that the BOM (3 bytes in UTF8) got split across chunks. I think we should also add the BOM skip rule here

xterm.js/src/common/input/TextDecoder.ts

Line 195 in 93aab95

if (cp < 0x0800 || (cp >= 0xD800 && cp <= 0xDFFF)) {

You can also test this by duplicating the 2/3/4 byte sequences - advance by 3 test and putting some BOMs in the string - they all should get filtered.

kena0ki · 2021-01-23T09:36:07Z

Thank you for the review.
It is my careless mistake. That is probably because I was not sure which is better that we filter only a first FEFF or all FEFFs at any position. (The rendered result would be the same.)
I had chosen the latter, but I found the browser built-in TextDecoder eliminates only the first FEFF.
Don't we need to imitate that behavior?
By adding ignoreBOM option and setting the option to false (i.e. filter FEFF) for only the first chunk of the input data, we can filter only the first FEFF in input data.

jerch · 2021-01-23T09:47:44Z

I had chosen the latter, but I found the browser built-in TextDecoder eliminates only the first FEFF.
Don't we need to imitate that behavior?

Hmm, thats good question. Well the browser is in a more comfortable position here, it can safely assume to deal with documents only (with a proper start and end). In the terminal we only have a byte stream, and dont know about document bounderies normally. Thus I'd say we should removed it at any position. But thats just my guess here.

kena0ki · 2021-01-23T10:01:45Z

In the terminal we only have a byte stream, and dont know about document bounderies normally.

That's true. So I'll choose filtering all FEFFs rather than unnecessarily adding complicated logic.

jerch · 2021-01-23T10:03:27Z

Thinking abit more about this - I think we are safe to always remove a BOM at any position. In theory there are issues with plain binary data transmission, but thats already the case for UTF8-marked terminal data in the first place. Binary data embedded in a terminal data stream is only safe, if mapped into save ASCII regions (e.g. base64), as naked 8bit codes are a nightmare with UTF8 and C0/C1 regions have to be skipped anyway. So yes - skipping the BOM anytime should be safe as it should not occur by random binary data (read as: random binary data should not occur at terminal data "transport level").

That's true. So I'll choose filtering all FEFFs rather than unnecessarily adding complicated logic.

Sounds better than doing complicated stuff, yeah (esp. since the decoders are somewhat hot functions and should be fast as hell).

kena0ki · 2021-01-23T11:48:52Z

I've added a BOM skip rule.

You can also test this by duplicating the 2/3/4 byte sequences - advance by 3 test and putting some BOMs in the string - they all should get filtered.

The string in a test I added only contains BOMs. If you don't like I'll change it.

jerch

LGTM 👍 , thx for dealing with this nasty low level stuff. Imho testing it this way is enough.

jerch requested changes Jan 2, 2021

View reviewed changes

Tyriar marked this pull request as draft January 4, 2021 15:50

Tyriar assigned jerch Jan 11, 2021

filter U+FEFF (BOM) when decoding input data

93aab95

kena0ki force-pushed the issue3032 branch from 08fcebc to 93aab95 Compare January 23, 2021 04:40

kena0ki marked this pull request as ready for review January 23, 2021 04:49

kena0ki requested a review from jerch January 23, 2021 04:49

kena0ki changed the title ~~Skip a leading zero width character~~ Filter U+FEFF (BOM) when decoding input data Jan 23, 2021

jerch requested changes Jan 23, 2021

View reviewed changes

add BOM skip logic

3d941f2

jerch approved these changes Jan 23, 2021

View reviewed changes

jerch added this to the 4.10.0 milestone Jan 23, 2021

jerch merged commit ff74caf into xtermjs:master Jan 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filter U+FEFF (BOM) when decoding input data #3200

Filter U+FEFF (BOM) when decoding input data #3200

kena0ki commented Dec 31, 2020 •

edited

Loading

jerch left a comment

kena0ki commented Jan 13, 2021

jerch commented Jan 13, 2021

kena0ki commented Jan 14, 2021

kena0ki commented Jan 23, 2021

jerch left a comment •

edited

Loading

kena0ki commented Jan 23, 2021

jerch commented Jan 23, 2021

kena0ki commented Jan 23, 2021

jerch commented Jan 23, 2021 •

edited

Loading

kena0ki commented Jan 23, 2021

jerch left a comment •

edited

Loading

Filter U+FEFF (BOM) when decoding input data #3200

Filter U+FEFF (BOM) when decoding input data #3200

Conversation

kena0ki commented Dec 31, 2020 • edited Loading

jerch left a comment

Choose a reason for hiding this comment

kena0ki commented Jan 13, 2021

jerch commented Jan 13, 2021

kena0ki commented Jan 14, 2021

kena0ki commented Jan 23, 2021

jerch left a comment • edited Loading

Choose a reason for hiding this comment

kena0ki commented Jan 23, 2021

jerch commented Jan 23, 2021

kena0ki commented Jan 23, 2021

jerch commented Jan 23, 2021 • edited Loading

kena0ki commented Jan 23, 2021

jerch left a comment • edited Loading

Choose a reason for hiding this comment

kena0ki commented Dec 31, 2020 •

edited

Loading

jerch left a comment •

edited

Loading

jerch commented Jan 23, 2021 •

edited

Loading

jerch left a comment •

edited

Loading