Certain invalid UTF-8 sequences can cause the output to fail #4086

j4james · 2019-12-30T23:45:09Z

Environment

Windows build number: Version 10.0.18362.418
Windows Terminal version (if applicable): 0.7.3451.0

Steps to reproduce

Open a bash shell.
Execute the command printf "\xA3"

Expected behavior

I'd expect to see something like the U+FFFD error character, or worst case nothing at all.

Actual behavior

The output is aborted with an error:

-bash: printf: write error: Input/output error

This looks similar to issue #3320, but PR #3380 doesn't fix it, and a recent build from master still produces the problem.

I did a bit of experimenting, and if you replace the MB_ERR_INVALID_CHARS with 0 in the _ParseFullRange method, that seems to help, although I don't know whether that's the right solution.

I know there was a follow up task created which may end up fixing this (#3378), but I thought it best to have the issue filed as an actual bug. If you don't think that's necessary, though, you can always close this as a dup.

The text was updated successfully, but these errors were encountered:

german-one · 2020-01-01T00:52:56Z

#4093 could be a first step. I'm quite certain the functions and classes could be used to update Utf8ToWideCharParser in a proper way.

j4james · 2020-01-02T18:54:42Z

@german-one I'm a bit confused. Is #4093 still a work in progress, or is it not intended to replace the conhost utf8 parser yet? I was hoping I could just merge your PR and it would fix the problems I was seeing, but that doesn't seem to be the case.

german-one · 2020-01-02T20:19:39Z

It's intended to be the first step. My implementation is able to supersede Utf8ToWideCharParser entirely. And indeed that's the next item on my list. But first I want to get the hand-rolled conversions reviewed by the core contributors. Unifying the UTF-8 parsers is too much for only one PR I guess.

Replace `utf8Parser` with `til::u8u16` in order to have the same conversion algorithms used in terminal and conhost. This PR addresses item 2 in this list: 1. ✉ Implement `til::u8u16` and `til::u16u8` (done in PR #4093) 2. ✔ **Unify UTF-8 handling using `til::u8u16` (this PR)** 2.1. ✔ **Update VtInputThread::_HandleRunInput()** 2.2. ✔ **Update ApiRoutines::WriteConsoleAImpl()** 2.3. ❌ (optional / ask the core team) Remove Utf8ToWideCharParser from the code base to avoid further use 3. ❌ Enable BOM discarding (follow up) 3.1. ❌ extend `til::u8u16` and `til::u16u8` with a 3rd parameter to enable discarding the BOM 3.2. ❌ Make use of the 3rd parameter to discard the BOM in all current function callers, or (optional / ask the core team) make it the default for `til::u8u16` and `til::u16u8` 4. ❌ Find UTF-16 to UTF-8 conversions and examine if they can be unified, too (follow up) Closes #4086 Closes #3378

… single lead byte only (GH#4673) (#4685) ## Summary of the Pull Request Fixes a flaw that happened if `til::u8u16` received a single lead byte. ## PR Checklist * [x] Closes #4673 * [x] Tests added/passed ## Detailed Description of the Pull Request / Additional comments The loop for caching partials didn't run and thus, the lead byte was converted to U+FFFD. That's because the loop starts with `sequenceLen` initialized with 1. And if the string has a length of 1 the initial condition is `1<1` which is evaluated to `false` and the body of the loop was never executed. ## Validation Steps Performed 1) updated the code of the state class and tested manually that `printf "\xE2"; printf "\x98\xBA\n"` prints a U+263A character 2) updated the unit tests to make sure that still up to 3 partials are cached 3) updated the unit tests to make sure caching also works if the string consists of a lead byte only 4) tested manually that #4086 is still resolved

ghost added Needs-Triage It's a new issue that the core contributor team needs to triage at the next triage meeting Needs-Tag-Fix Doesn't match tag requirements labels Dec 30, 2019

ghost removed the Needs-Tag-Fix Doesn't match tag requirements label Jan 6, 2020

DHowett-MSFT removed the Needs-Triage It's a new issue that the core contributor team needs to triage at the next triage meeting label Jan 6, 2020

DHowett-MSFT added this to the 21H1 milestone Jan 6, 2020

DHowett-MSFT added the Priority-2 A description (P2) label Jan 6, 2020

german-one mentioned this issue Jan 31, 2020

Unify UTF-8 handling using til::u8u16 & revise WriteConsoleAImpl #4422

Merged

5 tasks

ghost added the In-PR This issue has a related PR label Jan 31, 2020

DHowett-MSFT closed this as completed in #4422 Feb 4, 2020

ghost added the Needs-Tag-Fix Doesn't match tag requirements label Feb 4, 2020

ghost added Resolution-Fix-Committed Fix is checked in, but it might be 3-4 weeks until a release. and removed In-PR This issue has a related PR labels Feb 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Certain invalid UTF-8 sequences can cause the output to fail #4086

Certain invalid UTF-8 sequences can cause the output to fail #4086

j4james commented Dec 30, 2019

german-one commented Jan 1, 2020

j4james commented Jan 2, 2020

german-one commented Jan 2, 2020 •

edited

Loading

Certain invalid UTF-8 sequences can cause the output to fail #4086

Certain invalid UTF-8 sequences can cause the output to fail #4086

Comments

j4james commented Dec 30, 2019

Environment

Steps to reproduce

Expected behavior

Actual behavior

german-one commented Jan 1, 2020

j4james commented Jan 2, 2020

german-one commented Jan 2, 2020 • edited Loading

german-one commented Jan 2, 2020 •

edited

Loading