-
Notifications
You must be signed in to change notification settings - Fork 8.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unify UTF-8 handling using til::u8u16 & revise WriteConsoleAImpl #4422
Merged
Merged
Changes from 2 commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
0b59d22
replace `utf8Parser` with `til::u8u16` in `VtInputThread`
german-one 0e57d30
replace `utf8Parser` with `til::u8u16` in `WriteConsoleAImpl`, get ri…
german-one ab4c720
update `WriteConsoleAImpl`: remove new, delete, unused variables, hun…
german-one dd12bd6
try harder to clean up `WriteConsoleAImpl`
german-one File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IGNORE, SEE BELOW
this one unfortunately cannot change just yet.
gci.OutputCP
must be the user's console output codepage for now and the forseeable future.This poses an interesting conundrum for u8u16: should we have "ASCII" versions that take a codepage? au16 and u16a? I'm not sure 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe it's actually part of the
u*state
itself...There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OH, I understand. You can almost entirely ignore this comment. I was confused because this was created outside the
CP_UTF8
block.I also understand that writing in another codepage means we need to kill the u8 state -- so we can't move it inside this block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DHowett-MSFT Your understanding is correct. The
reset()
method is called in theelse
branch where codepages other than UTF-8 are processed. I knew I would need thereset
here, that's why I implemented it from the beginning.Interesting indeed. Well, we already have function
ConvertToW
(andConvertToA
). But that's not simply applicable inWriteConsoleAImpl()
due to the fact that we may receive DBCS-encoded text where caching of partials is required, too. @miniksa briefly mentioned that in #386 (comment)I only have a poor understanding of how DBCS has to be processed though. The manpage of
IsDBCSLeadByte
states that even if you validated a lead byte you may not rely onMultiByteToWideChar
being able to process the substring correctly. So, unfortunately I don't know enough to revise the DBCS handling. And I don't know if there is any other function in the code base where DBCS has to be converted. Hence it might not worth the effort to bring it into a separate function.However, I offer to do my best to get rid of
new
anddelete
inWriteConsoleAImpl()
and instead use the wstring that we already have for the UTF-8 conversion.// EDIT done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@german-one, if we theoretically made a
au16
and au16a
to supercedeConvertToW
andConvertToA
, there would probably be some sort ofastate
variable required. That variable could be stored on the assorted input handles to ensure that the problem briefly discussed in #386 is rectified.Off hand, I believe there are probably several points in the code base that could be unified behind such a convergence function in a similar way to converging the u8/u16 problem. But I haven't enumerated them recently, so I may be incorrect.
I believe that the only circumstance where we'd really pay attention to
IsDBCSLeadByte
and hold onto it until the next call is if a string ends in a lead byte (then it would take the lead byte off the end and cache it for the next call in theastate
or equivalent.) If it's in the middle of a run, we wouldn't care and just pass it intoMultiByteToWideChar
probably with the replacement character flag on so it would remove invalid DBCS representations. The next write would have the stored DBCS lead prefixed to whatever comes next, even if it's not valid together, and we'd let MB2WC sort it out and replace it.The last provision to consider is that I believe the state would be discarded anytime the code page changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@miniksa
u16a
would be easy. We only need to take care of split surrogates, and that's whatu16state
already does.As to
au16
-IsDBCSLeadByteEx
isn't used in the current implementation ofCheckBisectStringA
. It would be my attempt to implement anastate
though. The remarks found on the manpage still make me wonder if it would be bulletproof for DBCS.But even if it was that simple, we would still only have conversions for SBCS, DBCS (the few mentioned on the manpage), and UTF-8. Guess what happens if we receive UTF-7 that was split inside of a base64 sequence 🤕