-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Filter U+FEFF (BOM) when decoding input data #3200
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kena0ki Thx for the patch. I see that your patch simply skips zero width codepoints at the line start. I am not sure if we can do it that way for any zero width codepoint, we would need some checks against other emulators and how they go about zero width codepoints. This is further complicated by the fact, that we currently only support post-combining characters while the unicode grapheme rules also allow pre-combining under certain circumstances. This post/pre handling decides in which buffer cells combining chars end up with big differences in the final render output.
Note that the original issue was only about the BOM char. I suggest we simply filter it for now not to end up in the buffer at all. Technically a BOM should never occur in a terminal data stream (as it falls under the forbid rules of https://tools.ietf.org/html/rfc3629).
@jerch Thank you for the review.
I believe the patch would not harm current behaviour, but I'd like to follow your suggestion (filtering out only a BOM char).
I think I don't understand this line. Which forbid rule is applied? The first one? Since xterm.js only supports UTF-8. |
Kinda both:
First one because we only offer UTF8 input, yes (or Unicode JS-strings). But second one is also true for older terminals in general, as they used to handle different encodings by the much older ECMA-35 standard with proper "signature" (charset commands, a BOM did not even exist that time). We partially support these older charsets, but only if transcoded into the unicode/UTF8 realms beforehand. The BOM is superfluous in any case for us. If we would allow different UTF versions / endianess as direct input to the terminal a BOM would be helpful. But we are tagged to UTF-8 or JS strings (already LE/BE aligned UTF-16/UCS-2), so BOM handling is an outer issue for us and should have been stripped by outer means. Since thats not the case, maybe we should filter it ourselves during UTF32 decoding in |
Thank you for the explanation.
I'll look into it. |
I've updated the PR. Please take a look when you have time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes look good 👍
Almost there, just one more thing - the UTF8 decoder works with stream chunks, thus there is a chance, that the BOM (3 bytes in UTF8) got split across chunks. I think we should also add the BOM skip rule here
xterm.js/src/common/input/TextDecoder.ts
Line 195 in 93aab95
if (cp < 0x0800 || (cp >= 0xD800 && cp <= 0xDFFF)) { |
You can also test this by duplicating the 2/3/4 byte sequences - advance by 3
test and putting some BOMs in the string - they all should get filtered.
Thank you for the review. |
Hmm, thats good question. Well the browser is in a more comfortable position here, it can safely assume to deal with documents only (with a proper start and end). In the terminal we only have a byte stream, and dont know about document bounderies normally. Thus I'd say we should removed it at any position. But thats just my guess here. |
That's true. So I'll choose filtering all FEFFs rather than unnecessarily adding complicated logic. |
Thinking abit more about this - I think we are safe to always remove a BOM at any position. In theory there are issues with plain binary data transmission, but thats already the case for UTF8-marked terminal data in the first place. Binary data embedded in a terminal data stream is only safe, if mapped into save ASCII regions (e.g. base64), as naked 8bit codes are a nightmare with UTF8 and C0/C1 regions have to be skipped anyway. So yes - skipping the BOM anytime should be safe as it should not occur by random binary data (read as: random binary data should not occur at terminal data "transport level").
Sounds better than doing complicated stuff, yeah (esp. since the decoders are somewhat hot functions and should be fast as hell). |
I've added a BOM skip rule.
The string in a test I added only contains BOMs. If you don't like I'll change it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍 , thx for dealing with this nasty low level stuff. Imho testing it this way is enough.
Fixes #3032
This PR adds a check to skip a leading zero width character such as BOM.
Edit:
Changed the way to fix to filter U+FEFF (BOM) in
TextDecoder.ts
.