Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compensate for non-minimal UTF-8 encodings #3380

Merged
merged 4 commits into from
Oct 31, 2019
Merged

Compensate for non-minimal UTF-8 encodings #3380

merged 4 commits into from
Oct 31, 2019

Conversation

miniksa
Copy link
Member

@miniksa miniksa commented Oct 30, 2019

Summary of the Pull Request

  • Permits substitution of Unicode Replacement for non-minimal codepoint encodings in UTF-8.

PR Checklist

Detailed Description of the Pull Request / Additional comments

  • We do a lot of work to figure out whether or not we have some invalid UTF-8 inside our own internal parser. We're correctly identifying in the first full-conversion pass that something is amiss with the stream of text we were given. And inside the second pass for involved parsing, we are identifying and removing obviously wrong sequences like those that have a lead byte without the correct number of continuations, continuations that come from nowhere, and so on. But there's one big gap:

  • We're not correctly identifying non-minimal forms of characters. Specifically, what is causing the crash, is non-minimal representations of the null character U+0000. The minimal form of this character in UTF-8 is 0x00. But technically, it could also be written as any of the following:

  1. 0xC0 0x80 - Lead byte of 2 byte sequence and a single continuation.
  2. 0xE0 0x80 0x80 - Lead byte of 3 byte sequences and two continuations.
  3. 0xF0 0x80 0x80 0x80 - Lead byte of 4 byte sequences and three continuations.
    All of which didn't fill any of the payload bits.
  • The OS does identify these as invalid non-minimal forms when the data is sent to MultiByteToWideChar after we believe we've removed all invalid data and then it errors out because we set the MB_ERR_INVALID_CHARS flag.

  • If we remove the flag, the error goes away and the OS will substitute one or more U+FFFDs for these sequences and continue past them.

  • This is inconsistent with the rest of our invalid behavior (where we just eat the invalid bytes and walk on instead of substituting them) but the OS doesn't offer that provision as an option.

  • We also can't straight up just call the OS in all cases because we want to be available for the case where a caller sends us part of a valid sequence at the end of the buffer and then continues with the next valid pieces in the next call. That is, think of the putc case where someone drops 0xe3, 0x81, and 0x99 on three calls in a row. We want to form those together into the correct U+3059 once the third one comes in. Just using MultiByteToWideChar straight up will convert each of them into their own U+FFFD on the three calls. Not OK. So we must have some knowledge of UTF-8 to allow this valid scenario to happen.

  • The solution here is to let the somewhat inconsistent behavior of "replacements for non-minimal sequences but suppress clearly invalid things" to happen. We're doing this because the fix is needed in the Windows product for 20H1, which is today subject to ever-tightening requirements to prepare to ship. The smallest and least risky fix possible is preferable right now.

  • Reconcile UTF-8 behavior in utf8ToWideCharParser.cpp #3378 is filed as a follow on to investigate reconciling the somewhat inconsistent behavior as well as other things noted during this investigation in the future to be consumed into Terminal and whatever Windows release comes after 20H1.

Validation Steps Performed

Ran the repro steps in #3320 on 20h1 previews with WSL2. No longer crashes after (because it no longer returns the error).

Added automated test in utf8ToWideCharParserTests.cpp to ensure that non-minimal forms don't cause .Parse to throw an error and turn into some number of replacements instead

…l forms of characters that get past our initial invalid-sequence screening.
@miniksa miniksa marked this pull request as ready for review October 30, 2019 23:35
Copy link
Member

@zadjii-msft zadjii-msft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems fine to me, esp. based on the analysis in the PR body. How does this treat the utf-test file in #3147?

src/host/utf8ToWideCharParser.cpp Show resolved Hide resolved
@miniksa
Copy link
Member Author

miniksa commented Oct 31, 2019

This seems fine to me, esp. based on the analysis in the PR body. How does this treat the utf-test file in #3147?

I believe this resolves #3147 too. But I'm not seeing the pre-change behavior the same way that @egmontkob was in that filing.

I tried catting/typing it out in various forms in conhost and WT (pwsh, cmd, ubuntu, etc.). Before the change, sometimes it gets stuck or incomplete, but not always. And after the change, it seems to be 100% reliable. But I'd prefer if @egmontkob double checked it with the next version to see if the fix satisfies his expectations before closing it.

@ghost
Copy link

ghost commented Nov 26, 2019

🎉Windows Terminal Preview v0.7.3291.0 has been released which incorporates this pull request.:tada:

Handy links:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Writing random output to console output handle fails with no last error set
3 participants