Unify UTF-8 handling using til::u8u16 & revise WriteConsoleAImpl #4422

german-one · 2020-01-31T01:23:55Z

Summary of the Pull Request

Replace utf8Parser with til::u8u16 in order to have the same conversion algorithms used in terminal and conhost.

References

This PR is a follow up of #4093

PR Checklist

Closes Certain invalid UTF-8 sequences can cause the output to fail #4086 , Closes Reconcile UTF-8 behavior in utf8ToWideCharParser.cpp #3378
CLA signed. If not, go over here and sign the CLA
Tests added/passed
Requires documentation to be updated
I've discussed this with core contributors already. If not checked, I'm ready to accept this work might be rejected in favor of a different grand plan. Issue number where discussion took place: #xxx

Detailed Description of the Pull Request / Additional comments

This PR addresses item 2 in this list:

✉ Implement til::u8u16 and til::u16u8 (done in PR Implement til::u8u16 and til::u16u8 conversion functions #4093)
✔ Unify UTF-8 handling using til::u8u16 (this PR)
2.1. ✔ Update VtInputThread::_HandleRunInput()
2.2. ✔ Update ApiRoutines::WriteConsoleAImpl()
2.3. ❌ (optional / ask the core team) Remove Utf8ToWideCharParser from the code base to avoid further use
❌ Enable BOM discarding (follow up)
3.1. ❌ extend til::u8u16 and til::u16u8 with a 3rd parameter to enable discarding the BOM
3.2. ❌ Make use of the 3rd parameter to discard the BOM in all current function callers, or (optional / ask the core team) make it the default for til::u8u16 and til::u16u8
❌ Find UTF-16 to UTF-8 conversions and examine if they can be unified, too (follow up)

@miniksa @DHowett-MSFT

Please check if this PR, along with the investigations done in Implement til::u8u16 and til::u16u8 conversion functions #4093, does really close Reconcile UTF-8 behavior in utf8ToWideCharParser.cpp #3378
Please advice if I should remove Utf8ToWideCharParser now that it isn't used anymore

Validation Steps Performed

long UTF-8 files outputted to the console
printf tested as shown in Certain invalid UTF-8 sequences can cause the output to fail #4086

…d of a couple of warnings in `_stream.cpp`

DHowett-MSFT · 2020-01-31T01:45:21Z

src/host/_stream.cpp

@@ -1053,36 +1052,25 @@ constexpr unsigned int LOCAL_BUFFER_SIZE = 100;
        const auto codepage = gci.OutputCP;

        // Convert our input parameters to Unicode
-        std::unique_ptr<wchar_t[]> wideCharBuffer{ nullptr };
-        static Utf8ToWideCharParser parser{ gci.OutputCP };


IGNORE, SEE BELOW
this one unfortunately cannot change just yet. gci.OutputCP must be the user's console output codepage for now and the forseeable future.

This poses an interesting conundrum for u8u16: should we have "ASCII" versions that take a codepage? au16 and u16a? I'm not sure 😄

maybe it's actually part of the u*state itself...

OH, I understand. You can almost entirely ignore this comment. I was confused because this was created outside the CP_UTF8 block.
I also understand that writing in another codepage means we need to kill the u8 state -- so we can't move it inside this block.

@DHowett-MSFT Your understanding is correct. The reset() method is called in the else branch where codepages other than UTF-8 are processed. I knew I would need the reset here, that's why I implemented it from the beginning.

This poses an interesting conundrum for u8u16: should we have "ASCII" versions that take a codepage? au16 and u16a? I'm not sure 😄

Interesting indeed. Well, we already have function ConvertToW (and ConvertToA). But that's not simply applicable in WriteConsoleAImpl() due to the fact that we may receive DBCS-encoded text where caching of partials is required, too. @miniksa briefly mentioned that in #386 (comment)
I only have a poor understanding of how DBCS has to be processed though. The manpage of IsDBCSLeadByte states that even if you validated a lead byte you may not rely on MultiByteToWideChar being able to process the substring correctly. So, unfortunately I don't know enough to revise the DBCS handling. And I don't know if there is any other function in the code base where DBCS has to be converted. Hence it might not worth the effort to bring it into a separate function.
However, I offer to do my best to get rid of new and delete in WriteConsoleAImpl() and instead use the wstring that we already have for the UTF-8 conversion.
// EDIT done.

@german-one, if we theoretically made a au16 and a u16a to supercede ConvertToW and ConvertToA, there would probably be some sort of astate variable required. That variable could be stored on the assorted input handles to ensure that the problem briefly discussed in #386 is rectified.

Off hand, I believe there are probably several points in the code base that could be unified behind such a convergence function in a similar way to converging the u8/u16 problem. But I haven't enumerated them recently, so I may be incorrect.

I believe that the only circumstance where we'd really pay attention to IsDBCSLeadByte and hold onto it until the next call is if a string ends in a lead byte (then it would take the lead byte off the end and cache it for the next call in the astate or equivalent.) If it's in the middle of a run, we wouldn't care and just pass it into MultiByteToWideChar probably with the replacement character flag on so it would remove invalid DBCS representations. The next write would have the stored DBCS lead prefixed to whatever comes next, even if it's not valid together, and we'd let MB2WC sort it out and replace it.

The last provision to consider is that I believe the state would be discarded anytime the code page changed.

@miniksa u16a would be easy. We only need to take care of split surrogates, and that's what u16state already does.
As to au16 - IsDBCSLeadByteEx isn't used in the current implementation of CheckBisectStringA. It would be my attempt to implement an astate though. The remarks found on the manpage still make me wonder if it would be bulletproof for DBCS.
But even if it was that simple, we would still only have conversions for SBCS, DBCS (the few mentioned on the manpage), and UTF-8. Guess what happens if we receive UTF-7 that was split inside of a base64 sequence 🤕

…garian notation; unify camel case

german-one · 2020-01-31T18:22:08Z

Removed new and delete in WriteConsoleAImpl and on this occasion cleaned it up a little.

german-one · 2020-02-02T12:31:47Z

The latest commit is intended to improve the readability of WriteConsoleAImpl, not to change its behavior in any kind.
This can be taken to be somehow related to the update of the UTF-8 conversion in this function. However, I changed the topic to make it visible for everyone.

miniksa

This looks fine to me.

There's still some opportunities in _stream.cpp to use some of the new Chromium math stuff instead of the hard gsl casts and the old SizeTToUInt math functions.

There's also the big opportunity to hopefully eliminate a lot of the dangerousish pointer dances going on around the MBCS handling. The clean up is just incremental progress here. I'd really love to see it get to iterators and maybe even make/use a future til::au16 function that eliminates the need to do some of the odd counting.

But for now, this is better than what we had. So I'm good with it.

DHowett-MSFT

Okay, I had to read these functions side-by-side (original and new), but I think I trust it. Thank you, @german-one!

src/host/_stream.cpp

DHowett-MSFT · 2020-04-08T19:59:47Z

🎉 Once again, thanks for the contribution!

This pull request was included in a set of conhost changes that was just
released with Windows Insider Build 19603.

german-one added 2 commits January 30, 2020 22:23

replace utf8Parser with til::u8u16 in VtInputThread

0b59d22

replace utf8Parser with til::u8u16 in WriteConsoleAImpl, get ri…

0e57d30

…d of a couple of warnings in `_stream.cpp`

DHowett-MSFT reviewed Jan 31, 2020

View reviewed changes

zadjii-msft requested review from DHowett-MSFT and miniksa January 31, 2020 14:10

zadjii-msft added the Area-Output Related to output processing (inserting text into buffer, retrieving buffer text, etc.) label Jan 31, 2020

update WriteConsoleAImpl: remove new, delete, unused variables, hun…

ab4c720

…garian notation; unify camel case

try harder to clean up WriteConsoleAImpl

dd12bd6

german-one changed the title ~~Unify UTF-8 handling using til::u8u16~~ Unify UTF-8 handling using til::u8u16, revise WriteConsoleAImpl Feb 2, 2020

miniksa approved these changes Feb 3, 2020

View reviewed changes

DHowett-MSFT approved these changes Feb 4, 2020

View reviewed changes

src/host/_stream.cpp Show resolved Hide resolved

DHowett-MSFT changed the title ~~Unify UTF-8 handling using til::u8u16, revise WriteConsoleAImpl~~ Unify UTF-8 handling using til::u8u16 & revise WriteConsoleAImpl Feb 4, 2020

DHowett-MSFT merged commit 06b3931 into microsoft:master Feb 4, 2020

german-one mentioned this pull request Feb 6, 2020

Implement til::au16 and til::u16a conversion functions & make first use in WriteConsoleAImpl #4493

Closed

5 tasks

german-one mentioned this pull request Feb 20, 2020

WriteConsoleOutputCharacterA doesn't merge UTF-8 partials in successive calls #1851

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unify UTF-8 handling using til::u8u16 & revise WriteConsoleAImpl #4422

Unify UTF-8 handling using til::u8u16 & revise WriteConsoleAImpl #4422

german-one commented Jan 31, 2020

DHowett-MSFT Jan 31, 2020

DHowett-MSFT Jan 31, 2020

DHowett-MSFT Jan 31, 2020

german-one Jan 31, 2020 •

edited

Loading

miniksa Feb 3, 2020

german-one Feb 3, 2020

german-one commented Jan 31, 2020

german-one commented Feb 2, 2020

miniksa left a comment

DHowett-MSFT left a comment

DHowett-MSFT commented Apr 8, 2020

Unify UTF-8 handling using til::u8u16 & revise WriteConsoleAImpl #4422

Unify UTF-8 handling using til::u8u16 & revise WriteConsoleAImpl #4422

Conversation

german-one commented Jan 31, 2020

Summary of the Pull Request

References

PR Checklist

Detailed Description of the Pull Request / Additional comments

Validation Steps Performed

DHowett-MSFT Jan 31, 2020

Choose a reason for hiding this comment

DHowett-MSFT Jan 31, 2020

Choose a reason for hiding this comment

DHowett-MSFT Jan 31, 2020

Choose a reason for hiding this comment

german-one Jan 31, 2020 • edited Loading

Choose a reason for hiding this comment

miniksa Feb 3, 2020

Choose a reason for hiding this comment

german-one Feb 3, 2020

Choose a reason for hiding this comment

german-one commented Jan 31, 2020

german-one commented Feb 2, 2020

miniksa left a comment

Choose a reason for hiding this comment

DHowett-MSFT left a comment

Choose a reason for hiding this comment

DHowett-MSFT commented Apr 8, 2020

german-one Jan 31, 2020 •

edited

Loading