-
Notifications
You must be signed in to change notification settings - Fork 8.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compensate for non-minimal UTF-8 encodings #3380
Conversation
…l forms of characters that get past our initial invalid-sequence screening.
…h as a replacement character (for now).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems fine to me, esp. based on the analysis in the PR body. How does this treat the utf-test file in #3147?
I believe this resolves #3147 too. But I'm not seeing the pre-change behavior the same way that @egmontkob was in that filing. I tried catting/typing it out in various forms in conhost and WT (pwsh, cmd, ubuntu, etc.). Before the change, sometimes it gets stuck or incomplete, but not always. And after the change, it seems to be 100% reliable. But I'd prefer if @egmontkob double checked it with the next version to see if the fix satisfies his expectations before closing it. |
🎉 Handy links: |
Summary of the Pull Request
PR Checklist
utf8ToWideCharParser.cpp
#3378Detailed Description of the Pull Request / Additional comments
We do a lot of work to figure out whether or not we have some invalid UTF-8 inside our own internal parser. We're correctly identifying in the first full-conversion pass that something is amiss with the stream of text we were given. And inside the second pass for involved parsing, we are identifying and removing obviously wrong sequences like those that have a lead byte without the correct number of continuations, continuations that come from nowhere, and so on. But there's one big gap:
We're not correctly identifying non-minimal forms of characters. Specifically, what is causing the crash, is non-minimal representations of the null character U+0000. The minimal form of this character in UTF-8 is 0x00. But technically, it could also be written as any of the following:
All of which didn't fill any of the payload bits.
The OS does identify these as invalid non-minimal forms when the data is sent to
MultiByteToWideChar
after we believe we've removed all invalid data and then it errors out because we set theMB_ERR_INVALID_CHARS
flag.If we remove the flag, the error goes away and the OS will substitute one or more U+FFFDs for these sequences and continue past them.
This is inconsistent with the rest of our invalid behavior (where we just eat the invalid bytes and walk on instead of substituting them) but the OS doesn't offer that provision as an option.
We also can't straight up just call the OS in all cases because we want to be available for the case where a caller sends us part of a valid sequence at the end of the buffer and then continues with the next valid pieces in the next call. That is, think of the
putc
case where someone drops 0xe3, 0x81, and 0x99 on three calls in a row. We want to form those together into the correct U+3059 once the third one comes in. Just usingMultiByteToWideChar
straight up will convert each of them into their own U+FFFD on the three calls. Not OK. So we must have some knowledge of UTF-8 to allow this valid scenario to happen.The solution here is to let the somewhat inconsistent behavior of "replacements for non-minimal sequences but suppress clearly invalid things" to happen. We're doing this because the fix is needed in the Windows product for 20H1, which is today subject to ever-tightening requirements to prepare to ship. The smallest and least risky fix possible is preferable right now.
Reconcile UTF-8 behavior in
utf8ToWideCharParser.cpp
#3378 is filed as a follow on to investigate reconciling the somewhat inconsistent behavior as well as other things noted during this investigation in the future to be consumed into Terminal and whatever Windows release comes after 20H1.Validation Steps Performed
Ran the repro steps in #3320 on 20h1 previews with WSL2. No longer crashes after (because it no longer returns the error).
Added automated test in utf8ToWideCharParserTests.cpp to ensure that non-minimal forms don't cause
.Parse
to throw an error and turn into some number of replacements instead