Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ClosePseudoConsole API hanging #1810

Closed
sbatten opened this issue Jul 3, 2019 · 58 comments · Fixed by #2525
Closed

ClosePseudoConsole API hanging #1810

sbatten opened this issue Jul 3, 2019 · 58 comments · Fixed by #2525
Labels
Area-Server Down in the muck of API call servicing, interprocess communication, eventing, etc. Issue-Bug It either shouldn't be doing this or needs an investigation. Needs-Tag-Fix Doesn't match tag requirements Priority-1 A description (P1) Product-Conpty For console issues specifically related to conpty Severity-Crash Crashes are real bad news.

Comments

@sbatten
Copy link
Member

sbatten commented Jul 3, 2019

Environment

Windows build number: Microsoft Windows [Version 10.0.18362.175]

Any other software? Visual Studio Code Insiders

Steps to reproduce

  1. Launch VS Code Insiders.
  2. Ensure using conpty with setting: terminal.integrated.windowsEnableConpty
  3. Open a terminal with Ctrl+`
  4. Close the terminal with the trash can icon
  5. Repeat 3 and 4 until the window hangs. It takes around 10 retries on my machine.

Expected behavior

No hanging.

Actual behavior

The window hangs. Using windbg, I see the following:

conpty_stack
conpty_locals
conpty_leftovers

Notice the many leftover conhosts in the task manager as well as the hang in ClosePseudoConsole.

Here is the link to the code in node-pty calling into the PseudoConsole API.
https://github.com/microsoft/node-pty/blob/04445ed76f90b4f56a190982ea2d4fcdd22a0ee7/src/win/conpty.cc#L429

/cc @daimms

@ghost ghost added Needs-Triage It's a new issue that the core contributor team needs to triage at the next triage meeting Needs-Tag-Fix Doesn't match tag requirements labels Jul 3, 2019
@zadjii-msft zadjii-msft added Issue-Bug It either shouldn't be doing this or needs an investigation. Product-Conpty For console issues specifically related to conpty labels Jul 8, 2019
@zadjii-msft zadjii-msft added this to the 20H1 milestone Jul 8, 2019
@DHowett-MSFT DHowett-MSFT added Area-Server Down in the muck of API call servicing, interprocess communication, eventing, etc. and removed Needs-Triage It's a new issue that the core contributor team needs to triage at the next triage meeting labels Jul 8, 2019
@ghost ghost removed the Needs-Tag-Fix Doesn't match tag requirements label Jul 8, 2019
@DHowett-MSFT
Copy link
Contributor

This just needs somebody to sit down to debug. Thanks for the report.

@zadjii-msft zadjii-msft self-assigned this Aug 6, 2019
@DHowett-MSFT DHowett-MSFT added the Severity-Crash Crashes are real bad news. label Aug 6, 2019
@miniksa
Copy link
Member

miniksa commented Aug 6, 2019

This is because the ConPTY is being started with CreatePseudoConsole() with the flag PSEUDOCONSOLE_INHERIT_CURSOR. There's an indefinite wait for the calling terminal to respond with the cursor position before the startup lock is freed for other threads to use. A different thread gets the shutdown message and is attempting to acquire the same lock to clean things up before exiting.

It looks like if the PTY starts up fast enough so that the shutdown message comes after the cursor is inherited OR if the shutdown happens before the PTY starts asking for the cursor inherit, then everything is good.

Given this is a race condition, I can't really blame it on the caller holding it wrong. I'll work to resolve it such that a shutdown is a valid way of halting the request for the cursor position while it is starting up.

@miniksa
Copy link
Member

miniksa commented Aug 6, 2019

Oh, yeah, conhost is open source now.

This is the VtIo startup code that is waiting to hear the cursor position from the calling terminal:

terminal/src/host/VtIo.cpp

Lines 238 to 255 in 8fa42e0

// MSFT: 15813316
// If the terminal application wants us to inherit the cursor position,
// we're going to emit a VT sequence to ask for the cursor position, then
// read input until we get a response. Terminals who request this behavior
// but don't respond will hang.
// If we get a response, the InteractDispatch will call SetCursorPosition,
// which will call to our VtIo::SetCursorPosition method.
// We need both handles for this initialization to work. If we don't have
// both, we'll skip it. They either aren't going to be reading output
// (so they can't get the DSR) or they can't write the response to us.
if (_lookingForCursorPosition && _pVtRenderEngine && _pVtInputThread)
{
LOG_IF_FAILED(_pVtRenderEngine->RequestCursor());
while (_lookingForCursorPosition)
{
_pVtInputThread->DoReadInput(false);
}
}

And this is the Signal thread that has noticed the shutdown and is attempting to acquire the lock so it can finish closing client process state:

void CloseConsoleProcessState()
{
const CONSOLE_INFORMATION& gci = ServiceLocator::LocateGlobals().getConsoleInformation();
// If there are no connected processes, sending control events is pointless as there's no one do send them to. In
// this case we'll just exit conhost.
// N.B. We can get into this state when a process has a reference to the console but hasn't connected. For example,
// when it's created suspended and never resumed.
if (gci.ProcessHandleList.IsEmpty())
{
ServiceLocator::RundownAndExit(STATUS_SUCCESS);
}
HandleCtrlEvent(CTRL_CLOSE_EVENT);
// Jiggle the handle: (see MSFT:19419231)
// When we call this function, we'll only actually close the console once
// we're totally unlocked. If our caller has the console locked, great,
// we'll displatch the ctrl event once they unlock. However, if they're
// not running under lock (eg PtySignalInputThread::_GetData), then the
// ctrl event will never actually get dispatched.
// So, lock and unlock here, to make sure the ctrl event gets handled.
LockConsole();
auto Unlock = wil::scope_exit([&] { UnlockConsole(); });
}

Here is the stack of the conhost threads when it is stuck:

   0  Id: f84.9168 Suspend: 1 Teb: 0000002d`0bbc3000 Unfrozen
 # Child-SP          RetAddr           Call Site
00 0000002d`0bcfebf8 00007ffb`69b38b6b ntdll!ZwReadFile+0x14 [minkernel\ntdll\daytona\objfre\amd64\usrstubs.asm @ 227] 
01 0000002d`0bcfec00 00007ff6`650e4d1e KERNELBASE!ReadFile+0x7b [minkernel\kernelbase\filehops.c @ 1397] 
02 0000002d`0bcfec70 00007ff6`650e479c conhost!Microsoft::Console::VtInputThread::DoReadInput+0x4a [onecore\windows\core\console\open\src\host\vtinputthread.cpp @ 111] 
03 0000002d`0bcfedd0 00007ff6`650d0b22 conhost!Microsoft::Console::VirtualTerminal::VtIo::StartIfNeeded+0xc0 [onecore\windows\core\console\open\src\host\vtio.cpp @ 251] 
04 0000002d`0bcfee00 00007ff6`650af84f conhost!ConsoleAllocateConsole+0x18d32 [onecore\windows\core\console\open\src\host\srvinit.cpp @ 609] 
05 0000002d`0bcfeeb0 00007ff6`650aaaec conhost!IoDispatchers::ConsoleHandleConnectionRequest+0x2cb [onecore\windows\core\console\open\src\server\iodispatchers.cpp @ 183] 
06 (Inline Function) --------`-------- conhost!IoSorter::ServiceIoOperation+0x15f [onecore\windows\core\console\open\src\server\iosorter.cpp @ 37] 
07 0000002d`0bcff9b0 00007ffb`6b257034 conhost!ConsoleIoThread+0x23c [onecore\windows\core\console\open\src\host\srvinit.cpp @ 667] 
08 0000002d`0bcffb30 00007ffb`6c27b0a1 KERNEL32!BaseThreadInitThunk+0x14 [clientcore\base\win32\client\thread.c @ 64] 
09 0000002d`0bcffb60 00000000`00000000 ntdll!RtlUserThreadStart+0x21 [minkernel\ntdll\rtlstrt.c @ 1153] 

   1  Id: f84.90cc Suspend: 1 Teb: 0000002d`0bbc5000 Unfrozen
 # Child-SP          RetAddr           Call Site
00 0000002d`0bd7f508 00007ffb`6c28fb61 ntdll!ZwWaitForAlertByThreadId+0x14 [minkernel\ntdll\daytona\objfre\amd64\usrstubs.asm @ 3875] 
01 0000002d`0bd7f510 00007ffb`6c28fa16 ntdll!RtlpWaitOnAddressWithTimeout+0x81 [minkernel\ntos\rtl\waitaddr.c @ 851] 
02 0000002d`0bd7f540 00007ffb`6c28f83d ntdll!RtlpWaitOnAddress+0xae [minkernel\ntos\rtl\waitaddr.c @ 1094] 
03 0000002d`0bd7f5b0 00007ffb`6c239e24 ntdll!RtlpWaitOnCriticalSection+0xfd [minkernel\ntos\rtl\resource.c @ 1610] 
04 0000002d`0bd7f690 00007ffb`6c239c52 ntdll!RtlpEnterCriticalSectionContended+0x1c4 [minkernel\ntos\rtl\resource.c @ 2317] 
05 0000002d`0bd7f6f0 00007ff6`650ed4c1 ntdll!RtlEnterCriticalSection+0x42 [minkernel\ntos\rtl\resource.c @ 1923] 
06 (Inline Function) --------`-------- conhost!CONSOLE_INFORMATION::LockConsole+0xe [onecore\windows\core\console\open\src\host\consoleinformation.cpp @ 63] 
07 (Inline Function) --------`-------- conhost!LockConsole+0xe [onecore\windows\core\console\open\src\host\handle.cpp @ 16] 
08 0000002d`0bd7f720 00007ff6`650e536a conhost!CloseConsoleProcessState+0x2d [onecore\windows\core\console\open\src\host\output.cpp @ 525] 
09 (Inline Function) --------`-------- conhost!Microsoft::Console::PtySignalInputThread::_Shutdown+0x5 [onecore\windows\core\console\open\src\host\ptysignalinputthread.cpp @ 203] 
0a 0000002d`0bd7f750 00007ff6`650e53b1 conhost!Microsoft::Console::PtySignalInputThread::_GetData+0x56 [onecore\windows\core\console\open\src\host\ptysignalinputthread.cpp @ 158] 
0b 0000002d`0bd7f790 00007ffb`6b257034 conhost!Microsoft::Console::PtySignalInputThread::_InputThread+0x25 [onecore\windows\core\console\open\src\host\ptysignalinputthread.cpp @ 83] 
0c 0000002d`0bd7f7d0 00007ffb`6c27b0a1 KERNEL32!BaseThreadInitThunk+0x14 [clientcore\base\win32\client\thread.c @ 64] 
0d 0000002d`0bd7f800 00000000`00000000 ntdll!RtlUserThreadStart+0x21 [minkernel\ntdll\rtlstrt.c @ 1153] 

The resolution is likely to make the ReadFile inside VtInputThread.cpp and any other synchronous ReadFile and WriteFile operations into OVERLAPPED ones that are asynchronous immediate calls that then WaitForMultipleObjects on the event in the overlapped structure AND on an event that can be signaled by the signal thread that says "Hey, give that up, we're going down!"

bool fSuccess = !!ReadFile(_hFile.get(), buffer, ARRAYSIZE(buffer), &dwRead, nullptr);

@zadjii-msft
Copy link
Member

@Tyriar You might be able to work around this by not passing PSEUDOCONSOLE_INHERIT_CURSOR to CreatePseudoConsole. I don't think you'll need to be doing anything to inherit the cursor position in your scenario, right? The conpty is always being started in a new terminal, right?

Disabling PSEUDOCONSOLE_INHERIT_CURSOR will make conpty assume that the cursor position is starting at 0,0 in an empty buffer, but for VsCode that seems like a reasonable assumption.

(We should definitely still fix the thing that @miniksa found)

@Tyriar
Copy link
Member

Tyriar commented Aug 8, 2019

We added that because for tasks we write to the terminal before the process is launched, we could probably avoid almost all crashes by just using it when a terminal has already been written too which happens pretty infrequently. I'll start the conversation for the workaround in the vscode issue, thanks.

@Tyriar
Copy link
Member

Tyriar commented Aug 16, 2019

The workaround might not be fixing the issue microsoft/vscode#76548 (comment)

@miniksa
Copy link
Member

miniksa commented Aug 16, 2019

@Tyriar, did this behavior change between Code Insiders July 2019 and Code Insiders August 2019?

I was attempting to fix this issue and then VS Code updated itself and now I can no longer repro it the same way.

@miniksa
Copy link
Member

miniksa commented Aug 16, 2019

@Tyriar, did this behavior change between Code Insiders July 2019 and Code Insiders August 2019?

I was attempting to fix this issue and then VS Code updated itself and now I can no longer repro it the same way.

Ugh, it looks like it did. Now to find where the heck I can get the July 2019 one again and have it not auto update.

@Tyriar
Copy link
Member

Tyriar commented Aug 16, 2019

@miniksa installer/zip at the top: https://code.visualstudio.com/updates/v1_36, you can set "update.mode": "none" to disable automatic updates

@miniksa
Copy link
Member

miniksa commented Aug 17, 2019

Beautiful, thank you! @Tyriar

@Tyriar
Copy link
Member

Tyriar commented Aug 20, 2019

It's sounding like it's still happening when we use that flag (since we use it on tasks), just a lot more often than I would have expected based on my experience reproducing microsoft/vscode#71966 (comment)

We can't really disable it for tasks as then the updates don't get printed to the terminal, that was another bug: #919

@miniksa
Copy link
Member

miniksa commented Aug 20, 2019

@Tyriar, OK. I'm working on a fix for the one specific stack/repro I had right now above.

I have a solution for that particular race, but it revealed a deadlock behind that one. It looks like we're attempting to further paint one last frame out the output channel before everything is torn down. My suspicion here is that the write of the last frame is stuck because the read operation is happening on the same thread on your side of the fence as the call to close the pseudo console. So there's a deadlock as the output channel isn't being drained so the paint frame doesn't think it is done and is holding the session open until it is drained.

Again, not something that is specifically your fault. We should be more robust than that.

A workaround for this case might be to close the handles you're not intending to listen to anymore before calling ClosePseudoConsole. Then we can't get stuck on sending you information during teardown because the pipes will be broken. There's plenty of code handling teardown when the PTY pipes are broken.

It's taking me a while to come up with a complete solution here because it seems like there's a wide spread problem in our code with reliability around synchronization and threading for the PTY. Thanks for your patience.

@Tyriar
Copy link
Member

Tyriar commented Aug 20, 2019

@miniksa
Copy link
Member

miniksa commented Aug 20, 2019

@miniksa so https://github.com/microsoft/node-pty/blob/3e645898f83370db2894cdde09ac180235e1cfb7/src/win/conpty.cc#L438 should move above the ClosePseudoConsole call?

No, I mean the handles hIn and hOut inside your pty_baton that you got from opening up the session. hShell looks like a process handle, not pipe handles.

@miniksa
Copy link
Member

miniksa commented Aug 20, 2019

@zadjii-msft, to totally fix this I need to re-evaluate

// MSFT:15506250
// In VT I/O Mode, a client application might die before we've rendered
// the last bit of text they've emitted. So give the VtRenderer one
// last chance to paint before it is killed.
if (s_globals.pRender)
{
s_globals.pRender->TriggerTeardown();
}
.

I don't have full context on the issue you fixed in the past, so I don't want to break that while I'm fixing this.

Right now I'm deadlocked on the Pty Signal Input thread getting a shutdown message and the TriggerTeardown sending a final paint to the Output channel. But the same process that sent the signal isn't consuming the Output channel so it deadlocks here.

If I in any way attempt to make this final paint asynchronous, there's then no guarantee that it will actually write before our process is torn down because we're calling RundownAndExit here.

I'm trying to determine if there's a difference in context between this type of shutdown and the one that you fixed in MSFT: 15506250.

If not, then we probably need to make it so the Pty Signal receiving a shutdown doesn't force the process closed through RundownAndExit but rather politely notifies everyone else that it's time to go and just lets whatever happen happen (and fix up any further issues that leave behind conhosts unexpectedly from there).

Edit:

Locked stack:

 	conhost.exe!Microsoft::Console::Render::VtEngine::_Flush() Line 109	C++
 	conhost.exe!Microsoft::Console::Render::VtEngine::EndPaint() Line 80	C++
 	conhost.exe!Microsoft::Console::Render::XtermEngine::EndPaint() Line 118	C++
 	conhost.exe!Microsoft::Console::Render::Renderer::_PaintFrameForEngine::__l2::<lambda>() Line 96	C++
 	conhost.exe!wil::details::lambda_call<void <lambda>(void) >::reset() Line 459	C++
 	conhost.exe!Microsoft::Console::Render::Renderer::_PaintFrameForEngine(Microsoft::Console::Render::IRenderEngine * const pEngine) Line 127	C++
 	conhost.exe!Microsoft::Console::Render::Renderer::TriggerTeardown() Line 260	C++
>	conhost.exe!Microsoft::Console::Interactivity::ServiceLocator::RundownAndExit(const HRESULT hr) Line 65	C++
 	conhost.exe!Microsoft::Console::PtySignalInputThread::_Shutdown() Line 215	C++
 	conhost.exe!Microsoft::Console::PtySignalInputThread::_GetData(void * const pBuffer, const unsigned long cbBuffer) Line 149	C++
 	conhost.exe!Microsoft::Console::PtySignalInputThread::_InputThread() Line 83	C++
 	conhost.exe!Microsoft::Console::PtySignalInputThread::StaticThreadProc(void * lpParameter) Line 59	C++

Oh, also a disconnect request is coming in here at the same time from the client:

 	conhost.exe!CONSOLE_INFORMATION::LockConsole() Line 64	C++
 	conhost.exe!LockConsole() Line 17	C++
>	conhost.exe!RemoveConsole(ConsoleProcessHandle * ProcessData) Line 197	C++
 	conhost.exe!IoDispatchers::ConsoleClientDisconnectRoutine(_CONSOLE_API_MSG * pMessage) Line 275	C++
 	conhost.exe!IoSorter::ServiceIoOperation(_CONSOLE_API_MSG * const pMsg, _CONSOLE_API_MSG * * ReplyMsg) Line 41	C++
 	conhost.exe!ConsoleIoThread(void * __formal) Line 668	C++

@zadjii-msft
Copy link
Member

So, the situation MSFT:15506250 is referring to is back in the earliest WSL tests of the conpty API. When you're using conpty to host a commandline like cmd /c dir, the client app could run and exit before we've rendered all their output. In this scenario, the last client exits, and conhost will tear itself down because there are no clients left. The "terminal" (in this scenario, the linux subsystem) isn't breaking the signal pipe to close the conpty. I believe WSL will keep reading the output pipe until the pipe is broken, and use that as a signal that the child process died.

@benhillis can keep me honest if I'm mis-remembering this.

This feels to me specifically different than the ClosePseudoConsole route, and I think you'd be fine changing it as you suggest. If someone called ClosePseudoConsole, then we could presume that they don't care anymore about something we have buffered, while in the scenario where the client exits (triggering our exit), we'd still want to deliver all the buffered output.

@Tyriar
Copy link
Member

Tyriar commented Aug 21, 2019

I just hit the hang when closing a regular terminal (not a task) which has PSEUDOCONSOLE_INHERIT_CURSOR set to 0. Killing conhost freed up the UI again.

@miniksa
Copy link
Member

miniksa commented Aug 22, 2019

I just hit the hang when closing a regular terminal (not a task) which has PSEUDOCONSOLE_INHERIT_CURSOR set to 0. Killing conhost freed up the UI again.

Next time you hit the hang or any other hang like this, can you please take a process dump of the supposedly hung conhost so I can check the stacks?

If you're hitting a similar situation without the flag set, then you're getting hung in a different situation which I'm not necessarily working on fixing right now in this issue.

@Tyriar
Copy link
Member

Tyriar commented Jul 8, 2020

Note that I've made a PR that fixes this in node-pty microsoft/node-pty#415, however it does not yet work in Electron 8 (haven't tried 9) when node integration is enabled as worker_threads doesn't seem to work electron/electron#18540.

@Tyriar
Copy link
Member

Tyriar commented Sep 25, 2020

VS Code update: Electron isn't going to support worker_threads in the renderer thread, so the fix is blocked on moving where we host node-pty/conpty to a real node process: microsoft/node-pty#415 (comment)

@christianparpart
Copy link

christianparpart commented Sep 29, 2020

As mentioned earlier in this thread:

to be 100% clear, VS Code can fix this on the client end by properly threading the handles received from the ConPTY. The hang is because all of the ConPTY handles are being serviced on the same thread in VS Code. To resolve this condition, the output and input handles should be managed on separate threads/queues than the one that is managing the session.

@christianparpart could you do something like that?

I am currently actively looking into finally working around this ConPTY behavior.

Honestly, if your above quoted workaround is the only way to work around this, I'd rather drop Windows support (sadly) or just wait until maybe or maybe not eventually ConPTY can handle with read/write operations in one thread. The reasoning is not that I want to be stubborn but rather that ConPTY (windows) is the only platform that requires this architectural change just to not freeze at shutdown.

I'm trying it though, but it seems like not worth it for now. :)

UPDATE: Okay, I've actually forgotten my own source code, and I remember why I did already split up PTY reads from writes (because ConPTY does not support non-blocking I/O).

My problem was, that if the connected process behind ConPTY terminated, the already started ConPTY's read() call did not return and does hang forever. This is still the case with latest software updates, and I actually do consider this a bad (or buggy?) behaviour compared to the other PTY implementations cancelling any active I/O operation when the other end disconnects.
I did add a process exit watcher that does explicitly close the terminal's end of the PTY, and now the read-call is not hanging anymore.

Thanks :)

@JustinGrote
Copy link

@miniksa @Tyriar has merged microsoft/node-pty#415 (comment) into Node-pty. Can you provide an update when this might make it upstream to VSCode?

@Tyriar
Copy link
Member

Tyriar commented Feb 10, 2021

Probably tomorrow's insiders, follow microsoft/vscode#116185 for updates

@TheLizzard
Copy link

Any updates? I am also having the same bug in my code. When I call ClosePseudoConsole my program hangs. It only happens when I call CreatePseudoConsole with dwFlags=1. Btw my project is written in python using the windows API.

@zadjii-msft zadjii-msft modified the milestones: Windows vNext, 22H2 Jan 4, 2022
@miniksa miniksa removed their assignment Mar 31, 2022
@Anduh
Copy link

Anduh commented May 9, 2022

microsoft/vscode#144016 is likely related

ghost pushed a commit that referenced this issue Oct 13, 2022
Problem:
* Calling `RundownAndExit` tries to flush out the last frame from `VtEngine`
* `VtEngine` indirectly calls `RundownAndExit` if the pipe is gone via `VtIo`
* `RundownAndExit` is called by other parts of OpenConsole
* `RundownAndExit` must be `[[noreturn]]` for most parts of OpenConsole
* `VtIo` itself has a mutex ensuring orderly shutdown
* In short, doing a thread safe orderly shutdown requires us to hold both,
  a lock in `RundownAndExit` and `VtIo` at the same time, but while other parts
  need a blocking `RundownAndExit`, `VtIo` needs a non-blocking one
* Refactoring the code to use optionally non-blocking `RundownAndExit`
  requires refactoring and might prove to be just as buggy

Solution:
* Simply don't call `RundownAndExit` in `VtEngine` at all
* In case the write pipe breaks:
  * `VtEngine` will close the handle
  * The client should notice that their read pipe is broken and
    close their write pipe sooner or later
  * Once we notice that our read pipe is broken, we call `RundownAndExit`
  * `RundownAndExit` might call back into `VtEngine` but
    without a pipe it won't do anything
* In case the read pipe breaks or any other part calls `RundownAndExit`:
  * We call `RundownAndExit`
  * `RundownAndExit` might call back into `VtEngine` and depending on whether
    the write pipe is broken or not it will simply write into it or ignore it

Closes #14132
Pretty sure this also applies to #1810

## Validation Steps Performed
* Open 5 tabs and run MSYS2's `bash --login` in each of them
* `Enter-VsDevShell` in another tab
* Close window
* 5 tab processes are killed instantly, 1 after ~3s ✅
* Replace conhost with OpenConsole via sfpcopy
* Launch Dozens of Git Bash tabs in VS Code
* Close them randomly
* Remaining ones still work, processes are gone ✅
DHowett pushed a commit that referenced this issue Oct 13, 2022
Problem:
* Calling `RundownAndExit` tries to flush out the last frame from `VtEngine`
* `VtEngine` indirectly calls `RundownAndExit` if the pipe is gone via `VtIo`
* `RundownAndExit` is called by other parts of OpenConsole
* `RundownAndExit` must be `[[noreturn]]` for most parts of OpenConsole
* `VtIo` itself has a mutex ensuring orderly shutdown
* In short, doing a thread safe orderly shutdown requires us to hold both,
  a lock in `RundownAndExit` and `VtIo` at the same time, but while other parts
  need a blocking `RundownAndExit`, `VtIo` needs a non-blocking one
* Refactoring the code to use optionally non-blocking `RundownAndExit`
  requires refactoring and might prove to be just as buggy

Solution:
* Simply don't call `RundownAndExit` in `VtEngine` at all
* In case the write pipe breaks:
  * `VtEngine` will close the handle
  * The client should notice that their read pipe is broken and
    close their write pipe sooner or later
  * Once we notice that our read pipe is broken, we call `RundownAndExit`
  * `RundownAndExit` might call back into `VtEngine` but
    without a pipe it won't do anything
* In case the read pipe breaks or any other part calls `RundownAndExit`:
  * We call `RundownAndExit`
  * `RundownAndExit` might call back into `VtEngine` and depending on whether
    the write pipe is broken or not it will simply write into it or ignore it

Closes #14132
Pretty sure this also applies to #1810

## Validation Steps Performed
* Open 5 tabs and run MSYS2's `bash --login` in each of them
* `Enter-VsDevShell` in another tab
* Close window
* 5 tab processes are killed instantly, 1 after ~3s ✅
* Replace conhost with OpenConsole via sfpcopy
* Launch Dozens of Git Bash tabs in VS Code
* Close them randomly
* Remaining ones still work, processes are gone ✅

(cherry picked from commit 1774cfd)
Service-Card-Id: 86174637
Service-Version: 1.16
DHowett pushed a commit that referenced this issue Oct 13, 2022
Problem:
* Calling `RundownAndExit` tries to flush out the last frame from `VtEngine`
* `VtEngine` indirectly calls `RundownAndExit` if the pipe is gone via `VtIo`
* `RundownAndExit` is called by other parts of OpenConsole
* `RundownAndExit` must be `[[noreturn]]` for most parts of OpenConsole
* `VtIo` itself has a mutex ensuring orderly shutdown
* In short, doing a thread safe orderly shutdown requires us to hold both,
  a lock in `RundownAndExit` and `VtIo` at the same time, but while other parts
  need a blocking `RundownAndExit`, `VtIo` needs a non-blocking one
* Refactoring the code to use optionally non-blocking `RundownAndExit`
  requires refactoring and might prove to be just as buggy

Solution:
* Simply don't call `RundownAndExit` in `VtEngine` at all
* In case the write pipe breaks:
  * `VtEngine` will close the handle
  * The client should notice that their read pipe is broken and
    close their write pipe sooner or later
  * Once we notice that our read pipe is broken, we call `RundownAndExit`
  * `RundownAndExit` might call back into `VtEngine` but
    without a pipe it won't do anything
* In case the read pipe breaks or any other part calls `RundownAndExit`:
  * We call `RundownAndExit`
  * `RundownAndExit` might call back into `VtEngine` and depending on whether
    the write pipe is broken or not it will simply write into it or ignore it

Closes #14132
Pretty sure this also applies to #1810

## Validation Steps Performed
* Open 5 tabs and run MSYS2's `bash --login` in each of them
* `Enter-VsDevShell` in another tab
* Close window
* 5 tab processes are killed instantly, 1 after ~3s ✅
* Replace conhost with OpenConsole via sfpcopy
* Launch Dozens of Git Bash tabs in VS Code
* Close them randomly
* Remaining ones still work, processes are gone ✅

(cherry picked from commit 1774cfd)
Service-Card-Id: 86178271
Service-Version: 1.15
carlos-zamora pushed a commit that referenced this issue Dec 5, 2022
Problem:
* Calling `RundownAndExit` tries to flush out the last frame from `VtEngine`
* `VtEngine` indirectly calls `RundownAndExit` if the pipe is gone via `VtIo`
* `RundownAndExit` is called by other parts of OpenConsole
* `RundownAndExit` must be `[[noreturn]]` for most parts of OpenConsole
* `VtIo` itself has a mutex ensuring orderly shutdown
* In short, doing a thread safe orderly shutdown requires us to hold both,
  a lock in `RundownAndExit` and `VtIo` at the same time, but while other parts
  need a blocking `RundownAndExit`, `VtIo` needs a non-blocking one
* Refactoring the code to use optionally non-blocking `RundownAndExit`
  requires refactoring and might prove to be just as buggy

Solution:
* Simply don't call `RundownAndExit` in `VtEngine` at all
* In case the write pipe breaks:
  * `VtEngine` will close the handle
  * The client should notice that their read pipe is broken and
    close their write pipe sooner or later
  * Once we notice that our read pipe is broken, we call `RundownAndExit`
  * `RundownAndExit` might call back into `VtEngine` but
    without a pipe it won't do anything
* In case the read pipe breaks or any other part calls `RundownAndExit`:
  * We call `RundownAndExit`
  * `RundownAndExit` might call back into `VtEngine` and depending on whether
    the write pipe is broken or not it will simply write into it or ignore it

Closes #14132
Pretty sure this also applies to #1810

## Validation Steps Performed
* Open 5 tabs and run MSYS2's `bash --login` in each of them
* `Enter-VsDevShell` in another tab
* Close window
* 5 tab processes are killed instantly, 1 after ~3s ✅
* Replace conhost with OpenConsole via sfpcopy
* Launch Dozens of Git Bash tabs in VS Code
* Close them randomly
* Remaining ones still work, processes are gone ✅
@zadjii-msft zadjii-msft modified the milestones: 22H2, Terminal v1.17 Dec 5, 2022
lhecker added a commit that referenced this issue Dec 5, 2022
Problem:
* Calling `RundownAndExit` tries to flush out the last frame from `VtEngine`
* `VtEngine` indirectly calls `RundownAndExit` if the pipe is gone via `VtIo`
* `RundownAndExit` is called by other parts of OpenConsole
* `RundownAndExit` must be `[[noreturn]]` for most parts of OpenConsole
* `VtIo` itself has a mutex ensuring orderly shutdown
* In short, doing a thread safe orderly shutdown requires us to hold both,
  a lock in `RundownAndExit` and `VtIo` at the same time, but while other parts
  need a blocking `RundownAndExit`, `VtIo` needs a non-blocking one
* Refactoring the code to use optionally non-blocking `RundownAndExit`
  requires refactoring and might prove to be just as buggy

Solution:
* Simply don't call `RundownAndExit` in `VtEngine` at all
* In case the write pipe breaks:
  * `VtEngine` will close the handle
  * The client should notice that their read pipe is broken and
    close their write pipe sooner or later
  * Once we notice that our read pipe is broken, we call `RundownAndExit`
  * `RundownAndExit` might call back into `VtEngine` but
    without a pipe it won't do anything
* In case the read pipe breaks or any other part calls `RundownAndExit`:
  * We call `RundownAndExit`
  * `RundownAndExit` might call back into `VtEngine` and depending on whether
    the write pipe is broken or not it will simply write into it or ignore it

Closes #14132
Pretty sure this also applies to #1810

## Validation Steps Performed
* Open 5 tabs and run MSYS2's `bash --login` in each of them
* `Enter-VsDevShell` in another tab
* Close window
* 5 tab processes are killed instantly, 1 after ~3s ✅
* Replace conhost with OpenConsole via sfpcopy
* Launch Dozens of Git Bash tabs in VS Code
* Close them randomly
* Remaining ones still work, processes are gone ✅

(cherry picked from commit 1774cfd)
@zadjii-msft
Copy link
Member

We are pretty sure that #14160 fixed this?

@Tyriar
Copy link
Member

Tyriar commented Dec 6, 2022

Nice! Sounds like it.

@lhecker
Copy link
Member

lhecker commented Dec 6, 2022

There's still an easily reproducible bug in WSL however. It calls ClosePseudoConsole on the same thread that's reading from the output pipe and so it can deadlock (conhost is blocked writing to the pipe on exit, WSL is blocked waiting for conhost to exit). I'm planning to contact them and introduce a new ClosePseudoConsoleTimeout function to simplify the process (a timeout of 0 will equate to a fully async shutdown).

The bug reproduces if you execute any CLI application that writes more than 4kB of text before conhost has a chance to exit. This happens with applications similar to cat in particular and is a race condition in most situations. The deadlock is resolved once you run wsl --shutdown.

@zadjii-msft
Copy link
Member

I'm gonna close this out, because we think this was fixed in 1.17. It'll obviously be a while before we can get a newer version ingested into VsCode (cause we basically need to do... #6999? That doesn't look right but it's close), but the code on our end should be finished.

If this still repros with the 1.17+ version of the API, then we'll probably need something that builds on that fix, rather than revert that and try something new.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area-Server Down in the muck of API call servicing, interprocess communication, eventing, etc. Issue-Bug It either shouldn't be doing this or needs an investigation. Needs-Tag-Fix Doesn't match tag requirements Priority-1 A description (P1) Product-Conpty For console issues specifically related to conpty Severity-Crash Crashes are real bad news.
Projects
None yet