Improve support for VT character sets (#4496)

This PR improves our VT character set support, enabling the [`SCS`] escape sequences to designate into all four G-sets with both 94- and 96-character sets, and supports invoking those G-sets into both the GL and GR areas of the code table, with [locking shifts] and [single shifts]. It also adds [`DOCS`] sequences to switch between UTF-8 and the ISO-2022 coding system (which is what the VT character sets require), and adds support for a lot more characters sets, up to around the level of a VT510. [`SCS`]: https://vt100.net/docs/vt510-rm/SCS.html [locking shifts]: https://vt100.net/docs/vt510-rm/LS.html [single shifts]: https://vt100.net/docs/vt510-rm/SS.html [`DOCS`]: https://en.wikipedia.org/wiki/ISO/IEC_2022#Interaction_with_other_coding_systems ## Detailed Description of the Pull Request / Additional comments To make it easier for us to declare a bunch of character sets, I've made a little `constexpr` class that can build up a mapping table from a base character set (ASCII or Latin1), along with a collection of mappings for the characters the deviate from the base set. Many of the character sets are simple variations of ASCII, so they're easy to define this way. This class then casts directly to a `wstring_view` which is how the translation tables are represented in most of the code. We have an array of four of these tables representing the four G-sets, two instances for the active left and right tables, and one instance for the single shift table. Initially we had just one `DesignateCharset` method, which could select the active character set. We now have two designate methods (for 94- and 96- character sets), and each takes a G-set number specifying the target of the designation, and a pair of characters identifying the character set that will be designated (at the higher VT levels, character sets are often identified by more than one character). There are then two new `LockingShift` methods to invoke these G-sets into either the GL or GR area of the code table, and a `SingleShift` method which invokes a G-set temporarily (for just the next character that is output). I should mention here that I had to make some changes to the state machine to make these single shift sequences work. The problem is that the input state machine treats `SS3` as the start of a control sequence, while the output state machine needs it to be dispatched immediately (it's literally the _Single Shift 3_ escape sequence). To make that work, I've added a `ParseControlSequenceAfterSs3` callback in the `IStateMachineEngine` interface to decide which behavior is appropriate. When it comes to mapping a character, it's simply an array reference into the appropriate `wstring_view` table. If the single shift table is set, that takes preference. Otherwise the GL table is used for characters in the range 0x20 to 0x7F, and the GR table for characters 0xA0 to 0xFF (technically some character sets will only map up to 0x7E and 0xFE, but that's easily controlled by the length of the `wstring_view`). The `DEL` character is a bit of a special case. By default it's meant to be ignored like the `NUL` character (it's essentially a time-fill character). However, it's possible that it could be remapped to a printable character in a 96-character set, so we need to check for that after the translation. This is handled in the `AdaptDispatch::Print` method, so it doesn't interfere with the primary `PrintString` code path. The biggest problem with this whole process, though, is that the GR mappings only really make sense if you have access to the raw output, but by the time the output gets to us, it would already have been translated to Unicode by the active code page. And in the case of UTF-8, the characters we eventually receive may originally have been composed from two or more code points. The way I've dealt with this was to disable the GR translations by default, and then added support for a pair of ISO-2022 `DOCS` sequences, which can switch the code page between UTF-8 and ISO-8859-1. When the code page is ISO-8859-1, we're essentially receiving the raw output bytes, so it's safe to enable the GR translations. This is not strictly correct ISO-2022 behavior, and there are edge cases where it's not going to work, but it's the best solution I could come up with. ## Validation Steps Performed As a result of the `SS3` changes in the state machine engine, I've had to move the existing `SS3` tests from the `OutputEngineTest` to the `InputEngineTest`, otherwise they would now fail (technically they should never have been output tests). I've added no additional unit tests, but I have done a lot of manual testing, and made sure we passed all the character set tests in Vttest (at least for the character sets we currently support). Note that this required a slightly hacked version of the app, since by default it doesn't expose a lot of the test to low-level terminals, and we currently identify as a VT100. Closes #3377 Closes #3487
microsoft · Jun 4, 2020 · 96a77cb · 96a77cb
1 parent 7b48912
commit 96a77cb
Show file tree

Hide file tree

Showing 30 changed files with 1,704 additions and 328 deletions.
diff --git a/.github/actions/spell-check/dictionary/dictionary.txt b/.github/actions/spell-check/dictionary/dictionary.txt
@@ -97786,6 +97786,7 @@ dalesman
 dalesmen
 dalespeople
 daleswoman
+dalet
 daleth
 daleths
 Daleville
@@ -107561,6 +107562,7 @@ dialystelic
 dialystely
 dialytic
 dialytically
+dialytika
 dialyzability
 dialyzable
 dialyzate
@@ -114004,6 +114006,7 @@ djalmaite
 Djambi
 djasakid
 djave
+dje
 djebel
 djebels
 djehad
@@ -120418,11 +120421,13 @@ DZ
 dz
 dz.
 Dzaudzhikau
+dze
 dzeren
 dzerin
 dzeron
 Dzerzhinsk
 Dzhambul
+dzhe
 Dzhugashvili
 dziggetai
 dzo
@@ -158966,6 +158971,7 @@ Ghaznevid
 Ghazzah
 Ghazzali
 ghbor
+ghe
 Gheber
 gheber
 ghebeta
@@ -160166,6 +160172,7 @@ gizzards
 gizzen
 gizzened
 gizzern
+gje
 gjedost
 Gjellerup
 gjetost
@@ -212347,6 +212354,7 @@ Kizilbash
 Kizzee
 Kizzie
 kJ
+kje
 Kjeldahl
 kjeldahlization
 kjeldahlize
@@ -224856,6 +224864,7 @@ lizzie
 Lizzy
 LJ
 LJBF
+lje
 Ljod
 Ljoka
 Ljubljana
@@ -261607,6 +261616,7 @@ N.J.
 NJ
 nj
 njave
+nje
 Njord
 Njorth
 NKGB
@@ -274785,6 +274795,7 @@ Ogma
 ogmic
 Ogmios
 OGO
+ogonek
 ogonium
 Ogor
 O'Gowan
@@ -329834,6 +329845,7 @@ QN
 qn
 QNP
 QNS
+qof
 Qoheleth
 Qom
 qoph
@@ -371408,6 +371420,7 @@ Shaysite
 shazam
 Shazar
 SHCD
+shcha
 Shcheglovsk
 Shcherbakov
 she
@@ -420973,6 +420986,7 @@ tonometry
 Tonopah
 tonophant
 tonoplast
+tonos
 tonoscope
 tonotactic
 tonotaxis
@@ -428676,6 +428690,7 @@ tsetses
 TSF
 TSgt
 TSH
+tshe
 Tshi
 tshi
 Tshiluba
@@ -477068,6 +477083,7 @@ Yermo
 yern
 yertchuk
 yerth
+yeru
 yerva
 Yerwa-Maiduguri
 Yerxa
@@ -478235,6 +478251,7 @@ Z-bar
 ZBB
 ZBR
 ZD
+ze
 Zea
 zea
 zeal
@@ -478604,6 +478621,7 @@ ZG
 ZGS
 Zhang
 Zhdanov
+zhe
 Zhitomir
 Zhivkov
 Zhmud

diff --git a/.github/actions/spell-check/expect/expect.txt b/.github/actions/spell-check/expect/expect.txt
@@ -906,6 +906,7 @@ grep
 Greyscale
 gridline
 groupbox
+gset
 gsl
 GTP
 guc
@@ -1530,6 +1531,7 @@ NOYIELD
 NOZORDER
 NPM
 npos
+NRCS
 NSTATUS
 ntapi
 ntcon
@@ -2030,6 +2032,7 @@ SCROLLSCALE
 SCROLLSCREENBUFFER
 Scrollup
 Scrolluppage
+SCS
 scursor
 sddl
 sdeleted
@@ -2444,6 +2447,7 @@ untimes
 UPDATEDISPLAY
 UPDOWN
 UPKEY
+UPSS
 upvote
 uri
 url

diff --git a/.github/actions/spell-check/patterns/patterns.txt b/.github/actions/spell-check/patterns/patterns.txt
@@ -1,11 +1,13 @@
 https://(?:(?:[-a-zA-Z0-9?&=]*\.|)microsoft\.com)/[-a-zA-Z0-9?&=_#\/.]*
 https://aka\.ms/[-a-zA-Z0-9?&=\/_]*
+https://www\.itscj\.ipsj\.or\.jp/iso-ir/[-0-9]+\.pdf
+https://www\.vt100\.net/docs/[-a-zA-Z0-9#_\/.]*
 https://www.w3.org/[-a-zA-Z0-9?&=\/_#]*
 https://(?:(?:www\.|)youtube\.com|youtu.be)/[-a-zA-Z0-9?&=]*
 https://[a-z-]+\.githubusercontent\.com/[-a-zA-Z0-9?&=_\/.]*
 [Pp]ublicKeyToken="?[0-9a-fA-F]{16}"?
 (?:[{"]|UniqueIdentifier>)[0-9a-fA-F]{8}-(?:[0-9a-fA-F]{4}-){3}[0-9a-fA-F]{12}(?:[}"]|</UniqueIdentifier)
-(?:0[Xx]|U\+|#)[a-f0-9A-FGgRr]{2,}[Uu]?[Ll]?\b
+(?:0[Xx]|\\x|U\+|#)[a-f0-9A-FGgRr]{2,}[Uu]?[Ll]{0,2}\b
 microsoft/cascadia-code\@[0-9a-fA-F]{40}
 \d+x\d+Logo
 Scro\&ll

diff --git a/src/host/getset.cpp b/src/host/getset.cpp
@@ -922,6 +922,25 @@ void ApiRoutines::GetLargestConsoleWindowSizeImpl(const SCREEN_INFORMATION& cont
     CATCH_RETURN();
 }
 
+[[nodiscard]] HRESULT DoSrvSetConsoleOutputCodePage(const unsigned int codepage)
+{
+    CONSOLE_INFORMATION& gci = ServiceLocator::LocateGlobals().getConsoleInformation();
+
+    // Return if it's not known as a valid codepage ID.
+    RETURN_HR_IF(E_INVALIDARG, !(IsValidCodePage(codepage)));
+
+    // Do nothing if no change.
+    if (gci.OutputCP != codepage)
+    {
+        // Set new code page
+        gci.OutputCP = codepage;
+
+        SetConsoleCPInfo(TRUE);
+    }
+
+    return S_OK;
+}
+
 // Routine Description:
 // - Sets the codepage used for translating text when calling A versions of functions affecting the output buffer.
 // Arguments:
@@ -932,23 +951,9 @@ void ApiRoutines::GetLargestConsoleWindowSizeImpl(const SCREEN_INFORMATION& cont
 {
     try
     {
-        CONSOLE_INFORMATION& gci = ServiceLocator::LocateGlobals().getConsoleInformation();
         LockConsole();
         auto Unlock = wil::scope_exit([&] { UnlockConsole(); });
-
-        // Return if it's not known as a valid codepage ID.
-        RETURN_HR_IF(E_INVALIDARG, !(IsValidCodePage(codepage)));
-
-        // Do nothing if no change.
-        if (gci.OutputCP != codepage)
-        {
-            // Set new code page
-            gci.OutputCP = codepage;
-
-            SetConsoleCPInfo(TRUE);
-        }
-
-        return S_OK;
+        return DoSrvSetConsoleOutputCodePage(codepage);
     }
     CATCH_RETURN();
 }

diff --git a/src/host/getset.h b/src/host/getset.h
@@ -50,6 +50,7 @@ void DoSrvSetCursorColor(SCREEN_INFORMATION& screenInfo,
 
 void DoSrvPrivateRefreshWindow(const SCREEN_INFORMATION& screenInfo);
 
+[[nodiscard]] HRESULT DoSrvSetConsoleOutputCodePage(const unsigned int codepage);
 void DoSrvGetConsoleOutputCodePage(unsigned int& codepage);
 
 [[nodiscard]] NTSTATUS DoSrvPrivateSuppressResizeRepaint();

diff --git a/src/host/outputStream.cpp b/src/host/outputStream.cpp
@@ -553,6 +553,17 @@ bool ConhostInternalGetSet::PrivateWriteConsoleControlInput(const KeyEvent key)
                                                           key));
 }
 
+// Routine Description:
+// - Connects the SetConsoleOutputCP API call directly into our Driver Message servicing call inside Conhost.exe
+// Arguments:
+// - codepage - the new output codepage of the console.
+// Return Value:
+// - true if successful (see DoSrvSetConsoleOutputCodePage). false otherwise.
+bool ConhostInternalGetSet::SetConsoleOutputCP(const unsigned int codepage)
+{
+    return SUCCEEDED(DoSrvSetConsoleOutputCodePage(codepage));
+}
+
 // Routine Description:
 // - Connects the GetConsoleOutputCP API call directly into our Driver Message servicing call inside Conhost.exe
 // Arguments:

diff --git a/src/host/outputStream.hpp b/src/host/outputStream.hpp
@@ -115,6 +115,7 @@ class ConhostInternalGetSet final : public Microsoft::Console::VirtualTerminal::
 
     bool PrivateWriteConsoleControlInput(const KeyEvent key) override;
 
+    bool SetConsoleOutputCP(const unsigned int codepage) override;
     bool GetConsoleOutputCP(unsigned int& codepage) override;
 
     bool IsConsolePty() const override;

diff --git a/src/terminal/adapter/DispatchTypes.hpp b/src/terminal/adapter/DispatchTypes.hpp
@@ -99,10 +99,16 @@ namespace Microsoft::Console::VirtualTerminal::DispatchTypes
         ASB_AlternateScreenBuffer = 1049
     };
 
-    enum VTCharacterSets : wchar_t
+    namespace CharacterSets
     {
-        DEC_LineDrawing = L'0',
-        USASCII = L'B'
+        constexpr auto DecSpecialGraphics = std::make_pair(L'0', L'\0');
+        constexpr auto ASCII = std::make_pair(L'B', L'\0');
+    }
+
+    enum CodingSystem : wchar_t
+    {
+        ISO2022 = L'@',
+        UTF8 = L'G'
     };
 
     enum TabClearType : unsigned short

diff --git a/src/terminal/adapter/ITermDispatch.hpp b/src/terminal/adapter/ITermDispatch.hpp
@@ -95,7 +95,12 @@ class Microsoft::Console::VirtualTerminal::ITermDispatch
     virtual bool DeviceAttributes() = 0; // DA1
     virtual bool Vt52DeviceAttributes() = 0; // VT52 Identify
 
-    virtual bool DesignateCharset(const wchar_t wchCharset) = 0; // SCS
+    virtual bool DesignateCodingSystem(const wchar_t codingSystem) = 0; // DOCS
+    virtual bool Designate94Charset(const size_t gsetNumber, const std::pair<wchar_t, wchar_t> charset) = 0; // SCS
+    virtual bool Designate96Charset(const size_t gsetNumber, const std::pair<wchar_t, wchar_t> charset) = 0; // SCS
+    virtual bool LockingShift(const size_t gsetNumber) = 0; // LS0, LS1, LS2, LS3
+    virtual bool LockingShiftRight(const size_t gsetNumber) = 0; // LS1R, LS2R, LS3R
+    virtual bool SingleShift(const size_t gsetNumber) = 0; // SS2, SS3
 
     virtual bool SoftReset() = 0; // DECSTR
     virtual bool HardReset() = 0; // RIS