-
Notifications
You must be signed in to change notification settings - Fork 8.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unify UTF-8 handling using til::u8u16 & revise WriteConsoleAImpl #4422
Conversation
…d of a couple of warnings in `_stream.cpp`
@@ -1053,36 +1052,25 @@ constexpr unsigned int LOCAL_BUFFER_SIZE = 100; | |||
const auto codepage = gci.OutputCP; | |||
|
|||
// Convert our input parameters to Unicode | |||
std::unique_ptr<wchar_t[]> wideCharBuffer{ nullptr }; | |||
static Utf8ToWideCharParser parser{ gci.OutputCP }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IGNORE, SEE BELOW
this one unfortunately cannot change just yet. gci.OutputCP
must be the user's console output codepage for now and the forseeable future.
This poses an interesting conundrum for u8u16: should we have "ASCII" versions that take a codepage? au16 and u16a? I'm not sure 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe it's actually part of the u*state
itself...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OH, I understand. You can almost entirely ignore this comment. I was confused because this was created outside the CP_UTF8
block.
I also understand that writing in another codepage means we need to kill the u8 state -- so we can't move it inside this block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DHowett-MSFT Your understanding is correct. The reset()
method is called in the else
branch where codepages other than UTF-8 are processed. I knew I would need the reset
here, that's why I implemented it from the beginning.
This poses an interesting conundrum for u8u16: should we have "ASCII" versions that take a codepage? au16 and u16a? I'm not sure 😄
Interesting indeed. Well, we already have function ConvertToW
(and ConvertToA
). But that's not simply applicable in WriteConsoleAImpl()
due to the fact that we may receive DBCS-encoded text where caching of partials is required, too. @miniksa briefly mentioned that in #386 (comment)
I only have a poor understanding of how DBCS has to be processed though. The manpage of IsDBCSLeadByte
states that even if you validated a lead byte you may not rely on MultiByteToWideChar
being able to process the substring correctly. So, unfortunately I don't know enough to revise the DBCS handling. And I don't know if there is any other function in the code base where DBCS has to be converted. Hence it might not worth the effort to bring it into a separate function.
However, I offer to do my best to get rid of new
and delete
in WriteConsoleAImpl()
and instead use the wstring that we already have for the UTF-8 conversion.
// EDIT done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@german-one, if we theoretically made a au16
and a u16a
to supercede ConvertToW
and ConvertToA
, there would probably be some sort of astate
variable required. That variable could be stored on the assorted input handles to ensure that the problem briefly discussed in #386 is rectified.
Off hand, I believe there are probably several points in the code base that could be unified behind such a convergence function in a similar way to converging the u8/u16 problem. But I haven't enumerated them recently, so I may be incorrect.
I believe that the only circumstance where we'd really pay attention to IsDBCSLeadByte
and hold onto it until the next call is if a string ends in a lead byte (then it would take the lead byte off the end and cache it for the next call in the astate
or equivalent.) If it's in the middle of a run, we wouldn't care and just pass it into MultiByteToWideChar
probably with the replacement character flag on so it would remove invalid DBCS representations. The next write would have the stored DBCS lead prefixed to whatever comes next, even if it's not valid together, and we'd let MB2WC sort it out and replace it.
The last provision to consider is that I believe the state would be discarded anytime the code page changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@miniksa u16a
would be easy. We only need to take care of split surrogates, and that's what u16state
already does.
As to au16
- IsDBCSLeadByteEx
isn't used in the current implementation of CheckBisectStringA
. It would be my attempt to implement an astate
though. The remarks found on the manpage still make me wonder if it would be bulletproof for DBCS.
But even if it was that simple, we would still only have conversions for SBCS, DBCS (the few mentioned on the manpage), and UTF-8. Guess what happens if we receive UTF-7 that was split inside of a base64 sequence 🤕
…garian notation; unify camel case
Removed |
til::u8u16
til::u8u16
, revise WriteConsoleAImpl
The latest commit is intended to improve the readability of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks fine to me.
There's still some opportunities in _stream.cpp
to use some of the new Chromium math stuff instead of the hard gsl casts and the old SizeTToUInt math functions.
There's also the big opportunity to hopefully eliminate a lot of the dangerousish pointer dances going on around the MBCS handling. The clean up is just incremental progress here. I'd really love to see it get to iterators and maybe even make/use a future til::au16
function that eliminates the need to do some of the odd counting.
But for now, this is better than what we had. So I'm good with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I had to read these functions side-by-side (original and new), but I think I trust it. Thank you, @german-one!
til::u8u16
, revise WriteConsoleAImpl
🎉 Once again, thanks for the contribution! This pull request was included in a set of conhost changes that was just |
Summary of the Pull Request
Replace
utf8Parser
withtil::u8u16
in order to have the same conversion algorithms used in terminal and conhost.References
This PR is a follow up of #4093
PR Checklist
utf8ToWideCharParser.cpp
#3378Detailed Description of the Pull Request / Additional comments
This PR addresses item 2 in this list:
✉ Implement
til::u8u16
andtil::u16u8
(done in PR Implement til::u8u16 and til::u16u8 conversion functions #4093)✔ Unify UTF-8 handling using
til::u8u16
(this PR)2.1. ✔ Update VtInputThread::_HandleRunInput()
2.2. ✔ Update ApiRoutines::WriteConsoleAImpl()
2.3. ❌ (optional / ask the core team) Remove Utf8ToWideCharParser from the code base to avoid further use
❌ Enable BOM discarding (follow up)
3.1. ❌ extend
til::u8u16
andtil::u16u8
with a 3rd parameter to enable discarding the BOM3.2. ❌ Make use of the 3rd parameter to discard the BOM in all current function callers, or (optional / ask the core team) make it the default for
til::u8u16
andtil::u16u8
❌ Find UTF-16 to UTF-8 conversions and examine if they can be unified, too (follow up)
@miniksa @DHowett-MSFT
utf8ToWideCharParser.cpp
#3378Utf8ToWideCharParser
now that it isn't used anymoreValidation Steps Performed
printf
tested as shown in Certain invalid UTF-8 sequences can cause the output to fail #4086