Skip to content

Commit

Permalink
Improve support for VT character sets (#4496)
Browse files Browse the repository at this point in the history
This PR improves our VT character set support, enabling the [`SCS`]
escape sequences to designate into all four G-sets with both 94- and
96-character sets, and supports invoking those G-sets into both the GL
and GR areas of the code table, with [locking shifts] and [single
shifts]. It also adds [`DOCS`] sequences to switch between UTF-8 and the
ISO-2022 coding system (which is what the VT character sets require),
and adds support for a lot more characters sets, up to around the level
of a VT510.

[`SCS`]: https://vt100.net/docs/vt510-rm/SCS.html
[locking shifts]: https://vt100.net/docs/vt510-rm/LS.html
[single shifts]: https://vt100.net/docs/vt510-rm/SS.html
[`DOCS`]: https://en.wikipedia.org/wiki/ISO/IEC_2022#Interaction_with_other_coding_systems

## Detailed Description of the Pull Request / Additional comments

To make it easier for us to declare a bunch of character sets, I've made
a little `constexpr` class that can build up a mapping table from a base
character set (ASCII or Latin1), along with a collection of mappings for
the characters the deviate from the base set. Many of the character sets
are simple variations of ASCII, so they're easy to define this way.

This class then casts directly to a `wstring_view` which is how the
translation tables are represented in most of the code. We have an array
of four of these tables representing the four G-sets, two instances for
the active left and right tables, and one instance for the single shift
table.

Initially we had just one `DesignateCharset` method, which could select
the active character set. We now have two designate methods (for 94- and
96- character sets), and each takes a G-set number specifying the target
of the designation, and a pair of characters identifying the character
set that will be designated (at the higher VT levels, character sets are
often identified by more than one character).

There are then two new `LockingShift` methods to invoke these G-sets
into either the GL or GR area of the code table, and a `SingleShift`
method which invokes a G-set temporarily (for just the next character
that is output).

I should mention here that I had to make some changes to the state
machine to make these single shift sequences work. The problem is that
the input state machine treats `SS3` as the start of a control sequence,
while the output state machine needs it to be dispatched immediately
(it's literally the _Single Shift 3_ escape sequence). To make that
work, I've added a `ParseControlSequenceAfterSs3` callback in the
`IStateMachineEngine` interface to decide which behavior is appropriate.

When it comes to mapping a character, it's simply an array reference
into the appropriate `wstring_view` table. If the single shift table is
set, that takes preference. Otherwise the GL table is used for
characters in the range 0x20 to 0x7F, and the GR table for characters
0xA0 to 0xFF (technically some character sets will only map up to 0x7E
and 0xFE, but that's easily controlled by the length of the
`wstring_view`).

The `DEL` character is a bit of a special case. By default it's meant to
be ignored like the `NUL` character (it's essentially a time-fill
character). However, it's possible that it could be remapped to a
printable character in a 96-character set, so we need to check for that
after the translation. This is handled in the `AdaptDispatch::Print`
method, so it doesn't interfere with the primary `PrintString` code
path.

The biggest problem with this whole process, though, is that the GR
mappings only really make sense if you have access to the raw output,
but by the time the output gets to us, it would already have been
translated to Unicode by the active code page. And in the case of UTF-8,
the characters we eventually receive may originally have been composed
from two or more code points.

The way I've dealt with this was to disable the GR translations by
default, and then added support for a pair of ISO-2022 `DOCS` sequences,
which can switch the code page between UTF-8 and ISO-8859-1. When the
code page is ISO-8859-1, we're essentially receiving the raw output
bytes, so it's safe to enable the GR translations. This is not strictly
correct ISO-2022 behavior, and there are edge cases where it's not going
to work, but it's the best solution I could come up with.

## Validation Steps Performed

As a result of the `SS3` changes in the state machine engine, I've had
to move the existing `SS3` tests from the `OutputEngineTest` to the
`InputEngineTest`, otherwise they would now fail (technically they
should never have been output tests).

I've added no additional unit tests, but I have done a lot of manual
testing, and made sure we passed all the character set tests in Vttest
(at least for the character sets we currently support). Note that this
required a slightly hacked version of the app, since by default it
doesn't expose a lot of the test to low-level terminals, and we
currently identify as a VT100.

Closes #3377
Closes #3487
  • Loading branch information
j4james authored Jun 4, 2020
1 parent 7b48912 commit 96a77cb
Show file tree
Hide file tree
Showing 30 changed files with 1,704 additions and 328 deletions.
18 changes: 18 additions & 0 deletions .github/actions/spell-check/dictionary/dictionary.txt
Original file line number Diff line number Diff line change
Expand Up @@ -97786,6 +97786,7 @@ dalesman
dalesmen
dalespeople
daleswoman
dalet
daleth
daleths
Daleville
Expand Down Expand Up @@ -107561,6 +107562,7 @@ dialystelic
dialystely
dialytic
dialytically
dialytika
dialyzability
dialyzable
dialyzate
Expand Down Expand Up @@ -114004,6 +114006,7 @@ djalmaite
Djambi
djasakid
djave
dje
djebel
djebels
djehad
Expand Down Expand Up @@ -120418,11 +120421,13 @@ DZ
dz
dz.
Dzaudzhikau
dze
dzeren
dzerin
dzeron
Dzerzhinsk
Dzhambul
dzhe
Dzhugashvili
dziggetai
dzo
Expand Down Expand Up @@ -158966,6 +158971,7 @@ Ghaznevid
Ghazzah
Ghazzali
ghbor
ghe
Gheber
gheber
ghebeta
Expand Down Expand Up @@ -160166,6 +160172,7 @@ gizzards
gizzen
gizzened
gizzern
gje
gjedost
Gjellerup
gjetost
Expand Down Expand Up @@ -212347,6 +212354,7 @@ Kizilbash
Kizzee
Kizzie
kJ
kje
Kjeldahl
kjeldahlization
kjeldahlize
Expand Down Expand Up @@ -224856,6 +224864,7 @@ lizzie
Lizzy
LJ
LJBF
lje
Ljod
Ljoka
Ljubljana
Expand Down Expand Up @@ -261607,6 +261616,7 @@ N.J.
NJ
nj
njave
nje
Njord
Njorth
NKGB
Expand Down Expand Up @@ -274785,6 +274795,7 @@ Ogma
ogmic
Ogmios
OGO
ogonek
ogonium
Ogor
O'Gowan
Expand Down Expand Up @@ -329834,6 +329845,7 @@ QN
qn
QNP
QNS
qof
Qoheleth
Qom
qoph
Expand Down Expand Up @@ -371408,6 +371420,7 @@ Shaysite
shazam
Shazar
SHCD
shcha
Shcheglovsk
Shcherbakov
she
Expand Down Expand Up @@ -420973,6 +420986,7 @@ tonometry
Tonopah
tonophant
tonoplast
tonos
tonoscope
tonotactic
tonotaxis
Expand Down Expand Up @@ -428676,6 +428690,7 @@ tsetses
TSF
TSgt
TSH
tshe
Tshi
tshi
Tshiluba
Expand Down Expand Up @@ -477068,6 +477083,7 @@ Yermo
yern
yertchuk
yerth
yeru
yerva
Yerwa-Maiduguri
Yerxa
Expand Down Expand Up @@ -478235,6 +478251,7 @@ Z-bar
ZBB
ZBR
ZD
ze
Zea
zea
zeal
Expand Down Expand Up @@ -478604,6 +478621,7 @@ ZG
ZGS
Zhang
Zhdanov
zhe
Zhitomir
Zhivkov
Zhmud
Expand Down
4 changes: 4 additions & 0 deletions .github/actions/spell-check/expect/expect.txt
Original file line number Diff line number Diff line change
Expand Up @@ -906,6 +906,7 @@ grep
Greyscale
gridline
groupbox
gset
gsl
GTP
guc
Expand Down Expand Up @@ -1530,6 +1531,7 @@ NOYIELD
NOZORDER
NPM
npos
NRCS
NSTATUS
ntapi
ntcon
Expand Down Expand Up @@ -2030,6 +2032,7 @@ SCROLLSCALE
SCROLLSCREENBUFFER
Scrollup
Scrolluppage
SCS
scursor
sddl
sdeleted
Expand Down Expand Up @@ -2444,6 +2447,7 @@ untimes
UPDATEDISPLAY
UPDOWN
UPKEY
UPSS
upvote
uri
url
Expand Down
4 changes: 3 additions & 1 deletion .github/actions/spell-check/patterns/patterns.txt
Original file line number Diff line number Diff line change
@@ -1,11 +1,13 @@
https://(?:(?:[-a-zA-Z0-9?&=]*\.|)microsoft\.com)/[-a-zA-Z0-9?&=_#\/.]*
https://aka\.ms/[-a-zA-Z0-9?&=\/_]*
https://www\.itscj\.ipsj\.or\.jp/iso-ir/[-0-9]+\.pdf
https://www\.vt100\.net/docs/[-a-zA-Z0-9#_\/.]*
https://www.w3.org/[-a-zA-Z0-9?&=\/_#]*
https://(?:(?:www\.|)youtube\.com|youtu.be)/[-a-zA-Z0-9?&=]*
https://[a-z-]+\.githubusercontent\.com/[-a-zA-Z0-9?&=_\/.]*
[Pp]ublicKeyToken="?[0-9a-fA-F]{16}"?
(?:[{"]|UniqueIdentifier>)[0-9a-fA-F]{8}-(?:[0-9a-fA-F]{4}-){3}[0-9a-fA-F]{12}(?:[}"]|</UniqueIdentifier)
(?:0[Xx]|U\+|#)[a-f0-9A-FGgRr]{2,}[Uu]?[Ll]?\b
(?:0[Xx]|\\x|U\+|#)[a-f0-9A-FGgRr]{2,}[Uu]?[Ll]{0,2}\b
microsoft/cascadia-code\@[0-9a-fA-F]{40}
\d+x\d+Logo
Scro\&ll
Expand Down
35 changes: 20 additions & 15 deletions src/host/getset.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -922,6 +922,25 @@ void ApiRoutines::GetLargestConsoleWindowSizeImpl(const SCREEN_INFORMATION& cont
CATCH_RETURN();
}

[[nodiscard]] HRESULT DoSrvSetConsoleOutputCodePage(const unsigned int codepage)
{
CONSOLE_INFORMATION& gci = ServiceLocator::LocateGlobals().getConsoleInformation();

// Return if it's not known as a valid codepage ID.
RETURN_HR_IF(E_INVALIDARG, !(IsValidCodePage(codepage)));

// Do nothing if no change.
if (gci.OutputCP != codepage)
{
// Set new code page
gci.OutputCP = codepage;

SetConsoleCPInfo(TRUE);
}

return S_OK;
}

// Routine Description:
// - Sets the codepage used for translating text when calling A versions of functions affecting the output buffer.
// Arguments:
Expand All @@ -932,23 +951,9 @@ void ApiRoutines::GetLargestConsoleWindowSizeImpl(const SCREEN_INFORMATION& cont
{
try
{
CONSOLE_INFORMATION& gci = ServiceLocator::LocateGlobals().getConsoleInformation();
LockConsole();
auto Unlock = wil::scope_exit([&] { UnlockConsole(); });

// Return if it's not known as a valid codepage ID.
RETURN_HR_IF(E_INVALIDARG, !(IsValidCodePage(codepage)));

// Do nothing if no change.
if (gci.OutputCP != codepage)
{
// Set new code page
gci.OutputCP = codepage;

SetConsoleCPInfo(TRUE);
}

return S_OK;
return DoSrvSetConsoleOutputCodePage(codepage);
}
CATCH_RETURN();
}
Expand Down
1 change: 1 addition & 0 deletions src/host/getset.h
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,7 @@ void DoSrvSetCursorColor(SCREEN_INFORMATION& screenInfo,

void DoSrvPrivateRefreshWindow(const SCREEN_INFORMATION& screenInfo);

[[nodiscard]] HRESULT DoSrvSetConsoleOutputCodePage(const unsigned int codepage);
void DoSrvGetConsoleOutputCodePage(unsigned int& codepage);

[[nodiscard]] NTSTATUS DoSrvPrivateSuppressResizeRepaint();
Expand Down
11 changes: 11 additions & 0 deletions src/host/outputStream.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -553,6 +553,17 @@ bool ConhostInternalGetSet::PrivateWriteConsoleControlInput(const KeyEvent key)
key));
}

// Routine Description:
// - Connects the SetConsoleOutputCP API call directly into our Driver Message servicing call inside Conhost.exe
// Arguments:
// - codepage - the new output codepage of the console.
// Return Value:
// - true if successful (see DoSrvSetConsoleOutputCodePage). false otherwise.
bool ConhostInternalGetSet::SetConsoleOutputCP(const unsigned int codepage)
{
return SUCCEEDED(DoSrvSetConsoleOutputCodePage(codepage));
}

// Routine Description:
// - Connects the GetConsoleOutputCP API call directly into our Driver Message servicing call inside Conhost.exe
// Arguments:
Expand Down
1 change: 1 addition & 0 deletions src/host/outputStream.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -115,6 +115,7 @@ class ConhostInternalGetSet final : public Microsoft::Console::VirtualTerminal::

bool PrivateWriteConsoleControlInput(const KeyEvent key) override;

bool SetConsoleOutputCP(const unsigned int codepage) override;
bool GetConsoleOutputCP(unsigned int& codepage) override;

bool IsConsolePty() const override;
Expand Down
12 changes: 9 additions & 3 deletions src/terminal/adapter/DispatchTypes.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -99,10 +99,16 @@ namespace Microsoft::Console::VirtualTerminal::DispatchTypes
ASB_AlternateScreenBuffer = 1049
};

enum VTCharacterSets : wchar_t
namespace CharacterSets
{
DEC_LineDrawing = L'0',
USASCII = L'B'
constexpr auto DecSpecialGraphics = std::make_pair(L'0', L'\0');
constexpr auto ASCII = std::make_pair(L'B', L'\0');
}

enum CodingSystem : wchar_t
{
ISO2022 = L'@',
UTF8 = L'G'
};

enum TabClearType : unsigned short
Expand Down
7 changes: 6 additions & 1 deletion src/terminal/adapter/ITermDispatch.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -95,7 +95,12 @@ class Microsoft::Console::VirtualTerminal::ITermDispatch
virtual bool DeviceAttributes() = 0; // DA1
virtual bool Vt52DeviceAttributes() = 0; // VT52 Identify

virtual bool DesignateCharset(const wchar_t wchCharset) = 0; // SCS
virtual bool DesignateCodingSystem(const wchar_t codingSystem) = 0; // DOCS
virtual bool Designate94Charset(const size_t gsetNumber, const std::pair<wchar_t, wchar_t> charset) = 0; // SCS
virtual bool Designate96Charset(const size_t gsetNumber, const std::pair<wchar_t, wchar_t> charset) = 0; // SCS
virtual bool LockingShift(const size_t gsetNumber) = 0; // LS0, LS1, LS2, LS3
virtual bool LockingShiftRight(const size_t gsetNumber) = 0; // LS1R, LS2R, LS3R
virtual bool SingleShift(const size_t gsetNumber) = 0; // SS2, SS3

virtual bool SoftReset() = 0; // DECSTR
virtual bool HardReset() = 0; // RIS
Expand Down
Loading

0 comments on commit 96a77cb

Please sign in to comment.