-
Notifications
You must be signed in to change notification settings - Fork 8.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Certain invalid UTF-8 sequences can cause the output to fail #4086
Comments
#4093 could be a first step. I'm quite certain the functions and classes could be used to update |
@german-one I'm a bit confused. Is #4093 still a work in progress, or is it not intended to replace the conhost utf8 parser yet? I was hoping I could just merge your PR and it would fix the problems I was seeing, but that doesn't seem to be the case. |
It's intended to be the first step. My implementation is able to supersede |
Replace `utf8Parser` with `til::u8u16` in order to have the same conversion algorithms used in terminal and conhost. This PR addresses item 2 in this list: 1. ✉ Implement `til::u8u16` and `til::u16u8` (done in PR #4093) 2. ✔ **Unify UTF-8 handling using `til::u8u16` (this PR)** 2.1. ✔ **Update VtInputThread::_HandleRunInput()** 2.2. ✔ **Update ApiRoutines::WriteConsoleAImpl()** 2.3. ❌ (optional / ask the core team) Remove Utf8ToWideCharParser from the code base to avoid further use 3. ❌ Enable BOM discarding (follow up) 3.1. ❌ extend `til::u8u16` and `til::u16u8` with a 3rd parameter to enable discarding the BOM 3.2. ❌ Make use of the 3rd parameter to discard the BOM in all current function callers, or (optional / ask the core team) make it the default for `til::u8u16` and `til::u16u8` 4. ❌ Find UTF-16 to UTF-8 conversions and examine if they can be unified, too (follow up) Closes #4086 Closes #3378
… single lead byte only (GH#4673) (#4685) ## Summary of the Pull Request Fixes a flaw that happened if `til::u8u16` received a single lead byte. ## PR Checklist * [x] Closes #4673 * [x] Tests added/passed ## Detailed Description of the Pull Request / Additional comments The loop for caching partials didn't run and thus, the lead byte was converted to U+FFFD. That's because the loop starts with `sequenceLen` initialized with 1. And if the string has a length of 1 the initial condition is `1<1` which is evaluated to `false` and the body of the loop was never executed. ## Validation Steps Performed 1) updated the code of the state class and tested manually that `printf "\xE2"; printf "\x98\xBA\n"` prints a U+263A character 2) updated the unit tests to make sure that still up to 3 partials are cached 3) updated the unit tests to make sure caching also works if the string consists of a lead byte only 4) tested manually that #4086 is still resolved
Environment
Windows build number: Version 10.0.18362.418
Windows Terminal version (if applicable): 0.7.3451.0
Steps to reproduce
printf "\xA3"
Expected behavior
I'd expect to see something like the U+FFFD error character, or worst case nothing at all.
Actual behavior
The output is aborted with an error:
This looks similar to issue #3320, but PR #3380 doesn't fix it, and a recent build from master still produces the problem.
I did a bit of experimenting, and if you replace the
MB_ERR_INVALID_CHARS
with 0 in the_ParseFullRange
method, that seems to help, although I don't know whether that's the right solution.I know there was a follow up task created which may end up fixing this (#3378), but I thought it best to have the issue filed as an actual bug. If you don't think that's necessary, though, you can always close this as a dup.
The text was updated successfully, but these errors were encountered: