Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter U+FEFF (BOM) when decoding input data #3200

Merged
merged 2 commits into from
Jan 23, 2021
Merged

Conversation

kena0ki
Copy link
Contributor

@kena0ki kena0ki commented Dec 31, 2020

Fixes #3032

This PR adds a check to skip a leading zero width character such as BOM.

Edit:
Changed the way to fix to filter U+FEFF (BOM) in TextDecoder.ts.

Copy link
Member

@jerch jerch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kena0ki Thx for the patch. I see that your patch simply skips zero width codepoints at the line start. I am not sure if we can do it that way for any zero width codepoint, we would need some checks against other emulators and how they go about zero width codepoints. This is further complicated by the fact, that we currently only support post-combining characters while the unicode grapheme rules also allow pre-combining under certain circumstances. This post/pre handling decides in which buffer cells combining chars end up with big differences in the final render output.

Note that the original issue was only about the BOM char. I suggest we simply filter it for now not to end up in the buffer at all. Technically a BOM should never occur in a terminal data stream (as it falls under the forbid rules of https://tools.ietf.org/html/rfc3629).

@Tyriar Tyriar marked this pull request as draft January 4, 2021 15:50
@kena0ki
Copy link
Contributor Author

kena0ki commented Jan 13, 2021

@jerch Thank you for the review.
I didin't know about pre-combining. I need more study😀.

I suggest we simply filter it for now not to end up in the buffer at all.

I believe the patch would not harm current behaviour, but I'd like to follow your suggestion (filtering out only a BOM char).

Technically a BOM should never occur in a terminal data stream (as it falls under the forbid rules of https://tools.ietf.org/html/rfc3629).

I think I don't understand this line. Which forbid rule is applied? The first one? Since xterm.js only supports UTF-8.

@jerch
Copy link
Member

jerch commented Jan 13, 2021

I think I don't understand this line. Which forbid rule is applied? The first one? Since xterm.js only supports UTF-8.

Kinda both:

o A protocol SHOULD forbid use of U+FEFF as a signature for those
textual protocol elements that the protocol mandates to be always
UTF-8, the signature function being totally useless in those
cases.
o A protocol SHOULD also forbid use of U+FEFF as a signature for
those textual protocol elements for which the protocol provides
character encoding identification mechanisms, when it is expected
that implementations of the protocol will be in a position to
always use the mechanisms properly. This will be the case when
the protocol elements are maintained tightly under the control of
the implementation from the time of their creation to the time of
their (properly labeled) transmission.

First one because we only offer UTF8 input, yes (or Unicode JS-strings). But second one is also true for older terminals in general, as they used to handle different encodings by the much older ECMA-35 standard with proper "signature" (charset commands, a BOM did not even exist that time). We partially support these older charsets, but only if transcoded into the unicode/UTF8 realms beforehand. The BOM is superfluous in any case for us.

If we would allow different UTF versions / endianess as direct input to the terminal a BOM would be helpful. But we are tagged to UTF-8 or JS strings (already LE/BE aligned UTF-16/UCS-2), so BOM handling is an outer issue for us and should have been stripped by outer means. Since thats not the case, maybe we should filter it ourselves during UTF32 decoding in common/input/TextDecoder.ts.

@kena0ki
Copy link
Contributor Author

kena0ki commented Jan 14, 2021

Thank you for the explanation.

Since thats not the case, maybe we should filter it ourselves during UTF32 decoding in common/input/TextDecoder.ts.

I'll look into it.

@kena0ki
Copy link
Contributor Author

kena0ki commented Jan 23, 2021

I've updated the PR. Please take a look when you have time.

@kena0ki kena0ki marked this pull request as ready for review January 23, 2021 04:49
@kena0ki kena0ki requested a review from jerch January 23, 2021 04:49
@kena0ki kena0ki changed the title Skip a leading zero width character Filter U+FEFF (BOM) when decoding input data Jan 23, 2021
Copy link
Member

@jerch jerch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good 👍

Almost there, just one more thing - the UTF8 decoder works with stream chunks, thus there is a chance, that the BOM (3 bytes in UTF8) got split across chunks. I think we should also add the BOM skip rule here

if (cp < 0x0800 || (cp >= 0xD800 && cp <= 0xDFFF)) {

You can also test this by duplicating the 2/3/4 byte sequences - advance by 3 test and putting some BOMs in the string - they all should get filtered.

@kena0ki
Copy link
Contributor Author

kena0ki commented Jan 23, 2021

Thank you for the review.
It is my careless mistake. That is probably because I was not sure which is better that we filter only a first FEFF or all FEFFs at any position. (The rendered result would be the same.)
I had chosen the latter, but I found the browser built-in TextDecoder eliminates only the first FEFF.
Don't we need to imitate that behavior?
By adding ignoreBOM option and setting the option to false (i.e. filter FEFF) for only the first chunk of the input data, we can filter only the first FEFF in input data.

@jerch
Copy link
Member

jerch commented Jan 23, 2021

I had chosen the latter, but I found the browser built-in TextDecoder eliminates only the first FEFF.
Don't we need to imitate that behavior?

Hmm, thats good question. Well the browser is in a more comfortable position here, it can safely assume to deal with documents only (with a proper start and end). In the terminal we only have a byte stream, and dont know about document bounderies normally. Thus I'd say we should removed it at any position. But thats just my guess here.

@kena0ki
Copy link
Contributor Author

kena0ki commented Jan 23, 2021

In the terminal we only have a byte stream, and dont know about document bounderies normally.

That's true. So I'll choose filtering all FEFFs rather than unnecessarily adding complicated logic.

@jerch
Copy link
Member

jerch commented Jan 23, 2021

Thinking abit more about this - I think we are safe to always remove a BOM at any position. In theory there are issues with plain binary data transmission, but thats already the case for UTF8-marked terminal data in the first place. Binary data embedded in a terminal data stream is only safe, if mapped into save ASCII regions (e.g. base64), as naked 8bit codes are a nightmare with UTF8 and C0/C1 regions have to be skipped anyway. So yes - skipping the BOM anytime should be safe as it should not occur by random binary data (read as: random binary data should not occur at terminal data "transport level").

That's true. So I'll choose filtering all FEFFs rather than unnecessarily adding complicated logic.

Sounds better than doing complicated stuff, yeah (esp. since the decoders are somewhat hot functions and should be fast as hell).

@kena0ki
Copy link
Contributor Author

kena0ki commented Jan 23, 2021

I've added a BOM skip rule.

You can also test this by duplicating the 2/3/4 byte sequences - advance by 3 test and putting some BOMs in the string - they all should get filtered.

The string in a test I added only contains BOMs. If you don't like I'll change it.

Copy link
Member

@jerch jerch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍 , thx for dealing with this nasty low level stuff. Imho testing it this way is enough.

@jerch jerch added this to the 4.10.0 milestone Jan 23, 2021
@jerch jerch merged commit ff74caf into xtermjs:master Jan 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Leading U+FEFF is interpreted as single-width space
2 participants