Compensate for non-minimal UTF-8 encodings #3380

miniksa · 2019-10-30T20:49:11Z

Summary of the Pull Request

Permits substitution of Unicode Replacement for non-minimal codepoint encodings in UTF-8.

PR Checklist

Closes Writing random output to console output handle fails with no last error set #3320
I work here
Tests added/passed
Comments added, follow on issue filed Reconcile UTF-8 behavior in utf8ToWideCharParser.cpp #3378
I'm a core contributor

Detailed Description of the Pull Request / Additional comments

We do a lot of work to figure out whether or not we have some invalid UTF-8 inside our own internal parser. We're correctly identifying in the first full-conversion pass that something is amiss with the stream of text we were given. And inside the second pass for involved parsing, we are identifying and removing obviously wrong sequences like those that have a lead byte without the correct number of continuations, continuations that come from nowhere, and so on. But there's one big gap:
We're not correctly identifying non-minimal forms of characters. Specifically, what is causing the crash, is non-minimal representations of the null character U+0000. The minimal form of this character in UTF-8 is 0x00. But technically, it could also be written as any of the following:

0xC0 0x80 - Lead byte of 2 byte sequence and a single continuation.
0xE0 0x80 0x80 - Lead byte of 3 byte sequences and two continuations.
0xF0 0x80 0x80 0x80 - Lead byte of 4 byte sequences and three continuations.
All of which didn't fill any of the payload bits.

The OS does identify these as invalid non-minimal forms when the data is sent to MultiByteToWideChar after we believe we've removed all invalid data and then it errors out because we set the MB_ERR_INVALID_CHARS flag.
If we remove the flag, the error goes away and the OS will substitute one or more U+FFFDs for these sequences and continue past them.
This is inconsistent with the rest of our invalid behavior (where we just eat the invalid bytes and walk on instead of substituting them) but the OS doesn't offer that provision as an option.
We also can't straight up just call the OS in all cases because we want to be available for the case where a caller sends us part of a valid sequence at the end of the buffer and then continues with the next valid pieces in the next call. That is, think of the putc case where someone drops 0xe3, 0x81, and 0x99 on three calls in a row. We want to form those together into the correct U+3059 once the third one comes in. Just using MultiByteToWideChar straight up will convert each of them into their own U+FFFD on the three calls. Not OK. So we must have some knowledge of UTF-8 to allow this valid scenario to happen.
The solution here is to let the somewhat inconsistent behavior of "replacements for non-minimal sequences but suppress clearly invalid things" to happen. We're doing this because the fix is needed in the Windows product for 20H1, which is today subject to ever-tightening requirements to prepare to ship. The smallest and least risky fix possible is preferable right now.
Reconcile UTF-8 behavior in utf8ToWideCharParser.cpp #3378 is filed as a follow on to investigate reconciling the somewhat inconsistent behavior as well as other things noted during this investigation in the future to be consumed into Terminal and whatever Windows release comes after 20H1.

Validation Steps Performed

Ran the repro steps in #3320 on 20h1 previews with WSL2. No longer crashes after (because it no longer returns the error).

Added automated test in utf8ToWideCharParserTests.cpp to ensure that non-minimal forms don't cause .Parse to throw an error and turn into some number of replacements instead

…l forms of characters that get past our initial invalid-sequence screening.

…h as a replacement character (for now).

src/host/ut_host/Utf8ToWideCharParserTests.cpp

zadjii-msft

This seems fine to me, esp. based on the analysis in the PR body. How does this treat the utf-test file in #3147?

src/host/utf8ToWideCharParser.cpp

miniksa · 2019-10-31T17:05:07Z

This seems fine to me, esp. based on the analysis in the PR body. How does this treat the utf-test file in #3147?

I believe this resolves #3147 too. But I'm not seeing the pre-change behavior the same way that @egmontkob was in that filing.

I tried catting/typing it out in various forms in conhost and WT (pwsh, cmd, ubuntu, etc.). Before the change, sometimes it gets stuck or incomplete, but not always. And after the change, it seems to be 100% reliable. But I'd prefer if @egmontkob double checked it with the next version to see if the fix satisfies his expectations before closing it.

…omparison. oops

ghost · 2019-11-26T17:27:17Z

🎉Windows Terminal Preview v0.7.3291.0 has been released which incorporates this pull request.:tada:

Handy links:

miniksa added 2 commits October 30, 2019 13:36

Allow Mb2Wc to substitute U+FFFD (unicode replacement) for non-minima…

8ece89c

…l forms of characters that get past our initial invalid-sequence screening.

Add test to ensure that non-minimal forms don't choke and come throug…

d260307

…h as a replacement character (for now).

miniksa marked this pull request as ready for review October 30, 2019 23:35

DHowett-MSFT reviewed Oct 30, 2019

View reviewed changes

src/host/ut_host/Utf8ToWideCharParserTests.cpp Outdated Show resolved Hide resolved

DHowett-MSFT reviewed Oct 30, 2019

View reviewed changes

src/host/ut_host/Utf8ToWideCharParserTests.cpp Outdated Show resolved Hide resolved

DHowett-MSFT approved these changes Oct 30, 2019

View reviewed changes

zadjii-msft approved these changes Oct 31, 2019

View reviewed changes

src/host/utf8ToWideCharParser.cpp Show resolved Hide resolved

PR feedback, use string comparison in tests, add some more commentary.

2859624

put the test back from how I intentionally broke it to check string c…

5ae708e

…omparison. oops

miniksa mentioned this pull request Oct 31, 2019

Missing output parts when cat'ing UTF-8 stress test #3147

Closed

miniksa merged commit 126d489 into master Oct 31, 2019

miniksa deleted the dev/miniksa/3320 branch October 31, 2019 17:50

miniksa mentioned this pull request Nov 11, 2019

Crash to desktop when running an invalid Export Certificate Powershell command #2861

Closed

ghost mentioned this pull request Nov 26, 2019

Writing random output to console output handle fails with no last error set #3320

Closed

j4james mentioned this pull request Dec 30, 2019

Certain invalid UTF-8 sequences can cause the output to fail #4086

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compensate for non-minimal UTF-8 encodings #3380

Compensate for non-minimal UTF-8 encodings #3380

miniksa commented Oct 30, 2019 •

edited

Loading

zadjii-msft left a comment

miniksa commented Oct 31, 2019

ghost commented Nov 26, 2019

Compensate for non-minimal UTF-8 encodings #3380

Compensate for non-minimal UTF-8 encodings #3380

Conversation

miniksa commented Oct 30, 2019 • edited Loading

Summary of the Pull Request

PR Checklist

Detailed Description of the Pull Request / Additional comments

Validation Steps Performed

zadjii-msft left a comment

Choose a reason for hiding this comment

miniksa commented Oct 31, 2019

ghost commented Nov 26, 2019

miniksa commented Oct 30, 2019 •

edited

Loading