Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unify UTF-8 handling using til::u8u16 & revise WriteConsoleAImpl #4422

Merged
merged 4 commits into from
Feb 4, 2020
Merged

Unify UTF-8 handling using til::u8u16 & revise WriteConsoleAImpl #4422

merged 4 commits into from
Feb 4, 2020

Conversation

german-one
Copy link
Contributor

Summary of the Pull Request

Replace utf8Parser with til::u8u16 in order to have the same conversion algorithms used in terminal and conhost.

References

This PR is a follow up of #4093

PR Checklist

Detailed Description of the Pull Request / Additional comments

This PR addresses item 2 in this list:

  1. ✉ Implement til::u8u16 and til::u16u8 (done in PR Implement til::u8u16 and til::u16u8 conversion functions #4093)

  2. Unify UTF-8 handling using til::u8u16 (this PR)
    2.1. ✔ Update VtInputThread::_HandleRunInput()
    2.2. ✔ Update ApiRoutines::WriteConsoleAImpl()
    2.3. ❌ (optional / ask the core team) Remove Utf8ToWideCharParser from the code base to avoid further use

  3. ❌ Enable BOM discarding (follow up)
    3.1. ❌ extend til::u8u16 and til::u16u8 with a 3rd parameter to enable discarding the BOM
    3.2. ❌ Make use of the 3rd parameter to discard the BOM in all current function callers, or (optional / ask the core team) make it the default for til::u8u16 and til::u16u8

  4. ❌ Find UTF-16 to UTF-8 conversions and examine if they can be unified, too (follow up)

@miniksa @DHowett-MSFT

Validation Steps Performed

@@ -1053,36 +1052,25 @@ constexpr unsigned int LOCAL_BUFFER_SIZE = 100;
const auto codepage = gci.OutputCP;

// Convert our input parameters to Unicode
std::unique_ptr<wchar_t[]> wideCharBuffer{ nullptr };
static Utf8ToWideCharParser parser{ gci.OutputCP };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IGNORE, SEE BELOW
this one unfortunately cannot change just yet. gci.OutputCP must be the user's console output codepage for now and the forseeable future.

This poses an interesting conundrum for u8u16: should we have "ASCII" versions that take a codepage? au16 and u16a? I'm not sure 😄

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it's actually part of the u*state itself...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OH, I understand. You can almost entirely ignore this comment. I was confused because this was created outside the CP_UTF8 block.
I also understand that writing in another codepage means we need to kill the u8 state -- so we can't move it inside this block.

Copy link
Contributor Author

@german-one german-one Jan 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DHowett-MSFT Your understanding is correct. The reset() method is called in the else branch where codepages other than UTF-8 are processed. I knew I would need the reset here, that's why I implemented it from the beginning.

This poses an interesting conundrum for u8u16: should we have "ASCII" versions that take a codepage? au16 and u16a? I'm not sure 😄

Interesting indeed. Well, we already have function ConvertToW (and ConvertToA). But that's not simply applicable in WriteConsoleAImpl() due to the fact that we may receive DBCS-encoded text where caching of partials is required, too. @miniksa briefly mentioned that in #386 (comment)
I only have a poor understanding of how DBCS has to be processed though. The manpage of IsDBCSLeadByte states that even if you validated a lead byte you may not rely on MultiByteToWideChar being able to process the substring correctly. So, unfortunately I don't know enough to revise the DBCS handling. And I don't know if there is any other function in the code base where DBCS has to be converted. Hence it might not worth the effort to bring it into a separate function.
However, I offer to do my best to get rid of new and delete in WriteConsoleAImpl() and instead use the wstring that we already have for the UTF-8 conversion.
// EDIT done.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@german-one, if we theoretically made a au16 and a u16a to supercede ConvertToW and ConvertToA, there would probably be some sort of astate variable required. That variable could be stored on the assorted input handles to ensure that the problem briefly discussed in #386 is rectified.

Off hand, I believe there are probably several points in the code base that could be unified behind such a convergence function in a similar way to converging the u8/u16 problem. But I haven't enumerated them recently, so I may be incorrect.

I believe that the only circumstance where we'd really pay attention to IsDBCSLeadByte and hold onto it until the next call is if a string ends in a lead byte (then it would take the lead byte off the end and cache it for the next call in the astate or equivalent.) If it's in the middle of a run, we wouldn't care and just pass it into MultiByteToWideChar probably with the replacement character flag on so it would remove invalid DBCS representations. The next write would have the stored DBCS lead prefixed to whatever comes next, even if it's not valid together, and we'd let MB2WC sort it out and replace it.

The last provision to consider is that I believe the state would be discarded anytime the code page changed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@miniksa u16a would be easy. We only need to take care of split surrogates, and that's what u16state already does.
As to au16 - IsDBCSLeadByteEx isn't used in the current implementation of CheckBisectStringA. It would be my attempt to implement an astate though. The remarks found on the manpage still make me wonder if it would be bulletproof for DBCS.
But even if it was that simple, we would still only have conversions for SBCS, DBCS (the few mentioned on the manpage), and UTF-8. Guess what happens if we receive UTF-7 that was split inside of a base64 sequence 🤕

@zadjii-msft zadjii-msft added the Area-Output Related to output processing (inserting text into buffer, retrieving buffer text, etc.) label Jan 31, 2020
@german-one
Copy link
Contributor Author

Removed new and delete in WriteConsoleAImpl and on this occasion cleaned it up a little.

@german-one german-one changed the title Unify UTF-8 handling using til::u8u16 Unify UTF-8 handling using til::u8u16, revise WriteConsoleAImpl Feb 2, 2020
@german-one
Copy link
Contributor Author

The latest commit is intended to improve the readability of WriteConsoleAImpl, not to change its behavior in any kind.
This can be taken to be somehow related to the update of the UTF-8 conversion in this function. However, I changed the topic to make it visible for everyone.

Copy link
Member

@miniksa miniksa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fine to me.

There's still some opportunities in _stream.cpp to use some of the new Chromium math stuff instead of the hard gsl casts and the old SizeTToUInt math functions.

There's also the big opportunity to hopefully eliminate a lot of the dangerousish pointer dances going on around the MBCS handling. The clean up is just incremental progress here. I'd really love to see it get to iterators and maybe even make/use a future til::au16 function that eliminates the need to do some of the odd counting.

But for now, this is better than what we had. So I'm good with it.

Copy link
Contributor

@DHowett-MSFT DHowett-MSFT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I had to read these functions side-by-side (original and new), but I think I trust it. Thank you, @german-one!

src/host/_stream.cpp Show resolved Hide resolved
@DHowett-MSFT DHowett-MSFT changed the title Unify UTF-8 handling using til::u8u16, revise WriteConsoleAImpl Unify UTF-8 handling using til::u8u16 & revise WriteConsoleAImpl Feb 4, 2020
@DHowett-MSFT DHowett-MSFT merged commit 06b3931 into microsoft:master Feb 4, 2020
@DHowett-MSFT
Copy link
Contributor

🎉 Once again, thanks for the contribution!

This pull request was included in a set of conhost changes that was just
released with Windows Insider Build 19603.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area-Output Related to output processing (inserting text into buffer, retrieving buffer text, etc.)
Projects
None yet
4 participants