Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Reverse Diagnostics Server #33307

Merged
merged 52 commits into from
Apr 20, 2020

Conversation

josalem
Copy link
Contributor

@josalem josalem commented Mar 6, 2020

This PR adds the ability for the runtime to connect to a pre-existing IPC Transport for the Diagnostics Server. This enables several interesting scenarios, including:

  • Many apps connecting to a single diagnostics agent
  • tracing via IPC at startup (imagine doing dotnet trace run <executable> ...)
  • A busybox-esque agent being configurable in a side-car container
  • A host-based agent running and tracing an application in a container

This is achieved, by adding the non-blocking Accept (think listen on Linux), Connect, and Poll APIs to the diagnostics PAL. Using these, the runtime is able to listen on the original IPC Transport it creates ($TMPDIR/dotnet-diagnostics-<pid>) and this reversed connection if configured.

This reversed connection is opt-in and only activates when a path is specified in the DOTNET_DiagnosticsMonitorAddress environment variable.

The following logic is almost entirely inside DiagnosticsIpcFactory::GetNExtAvailableStream(...)

If a user configures this mechanism, the runtime will infinitely attempt to connect the specified transport and is resilient to that transport closing and reopening. The retry logic is as follows:

  • Reverse connection is not configured => infinitely block on server connection
  • Reverse connection is configured
    • Reverse connection is successful => cache connection, set timeout to max (30s)
    • Reverse connection is unsuccessful or closed => set timeout to minimum (250ms) and poll on other connections, multiply by falloff factor (2) for each attempt until max (30s)

The runtime caches reverse connections so that it is not constantly attaching to the reverse transport every time poll times out or the server connection is used. These cache entries get cleared when a connection is used or the connection is hung up.

This PR is currently missing The following will be a separate PR:

  • Semantics for blocking startup of the runtime
    • To enable collecting startup events, we would add a configurable blocking timeout to runtime startup
    • an environment variable will take a value of -1 (infinite), or 0 to n milliseconds
    • this value will be used to determine how long to block eemain before an EventPipe session is started.
    • the default will be 0 (no block)

CC - @tommcdon @shirhatti @noahfalk @sywhang

@josalem josalem added this to the 5.0 milestone Mar 6, 2020
@josalem josalem requested review from noahfalk and sywhang March 6, 2020 22:28
@josalem josalem self-assigned this Mar 6, 2020
@Dotnet-GitSync-Bot
Copy link
Collaborator

I couldn't add an area label to this PR.

Checkout this page to find out which area owner to ping, or please add exactly one area label to help train me in the future.

Copy link
Member

@noahfalk noahfalk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a few comments and questions. Holding off on a more detailed review in expectation that things are still changing : )

src/coreclr/src/vm/diagnosticserver.cpp Outdated Show resolved Hide resolved
src/coreclr/src/vm/diagnosticserver.cpp Outdated Show resolved Hide resolved
src/coreclr/src/vm/diagnosticserver.cpp Outdated Show resolved Hide resolved
src/coreclr/src/inc/clrconfigvalues.h Outdated Show resolved Hide resolved
src/coreclr/src/debug/debug-pal/win/diagnosticsipc.cpp Outdated Show resolved Hide resolved
src/coreclr/src/vm/diagnosticsserverprotocolhelper.h Outdated Show resolved Hide resolved
@josalem josalem marked this pull request as ready for review March 18, 2020 00:06
@josalem josalem requested a review from noahfalk March 18, 2020 00:06
@josalem josalem changed the title [Draft] Add Reverse Diagnostics Server (client-mode) Add Reverse Diagnostics Server (client-mode) Mar 18, 2020
Copy link
Member

@noahfalk noahfalk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its shaping up but will still need some more work to be robust. Happy to chat if you've got questions about anything.

src/coreclr/src/vm/diagnosticserver.cpp Outdated Show resolved Hide resolved
src/coreclr/src/vm/diagnosticserver.cpp Outdated Show resolved Hide resolved
src/coreclr/src/vm/diagnosticsipcfactory.cpp Outdated Show resolved Hide resolved
src/coreclr/src/vm/diagnosticsprotocol.h Outdated Show resolved Hide resolved
src/coreclr/src/vm/diagnosticsipcfactory.cpp Outdated Show resolved Hide resolved
src/coreclr/src/debug/inc/diagnosticsipc.h Show resolved Hide resolved
src/coreclr/src/debug/debug-pal/win/diagnosticsipc.cpp Outdated Show resolved Hide resolved
src/coreclr/src/debug/debug-pal/win/diagnosticsipc.cpp Outdated Show resolved Hide resolved

const BOOL fSuccessCloseHandle = ::CloseHandle(_hPipe);
assert(fSuccessCloseHandle != 0);
_ASSERTE(fSuccessCloseHandle != 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add an assert that that OVERLAPPED operation is complete. I would also test to confirm that DisconnectNamedPipe() finishes a ConnectNamedPipe operation when it is still pending. Destroying an OVERLAPPED struct before the operation ends usually causes painful to debug memory corruption issues.

src/coreclr/src/vm/diagnosticsipcfactory.cpp Outdated Show resolved Hide resolved
@josalem josalem force-pushed the dev/josalem/diag-server-handshake branch from 9f108d8 to 20f3745 Compare April 1, 2020 22:41
John Salem added 11 commits April 2, 2020 14:59
* no select on unix
* untested on unix
* Change DOTNET_DiagnosticsServerAddress to DOTNET_DiagnosticsClientModeAddress
* works for original connection mode
* untested for client mode
* fix array alloc
* properly cast array size
John Salem added 7 commits April 13, 2020 16:33
* Adds ConnectionState class for hiding server/client diff
* simplifies code for easier reading
* test was creating a pipe with a 0 buffer
* runtime needs to handle a 0 buffer namedpipe
* makes advertisement not block for more than 100 ms
* TODO: implement on non-windows
@josalem
Copy link
Contributor Author

josalem commented Apr 17, 2020

/azp run runtime

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@josalem josalem requested a review from noahfalk April 17, 2020 22:18
Copy link
Member

@noahfalk noahfalk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, things are starting to look ship shape 👍 I did find a few more minor potential issues, comments inline.

src/coreclr/src/debug/debug-pal/win/diagnosticsipc.cpp Outdated Show resolved Hide resolved
src/coreclr/src/debug/inc/diagnosticsipc.h Outdated Show resolved Hide resolved
src/coreclr/src/debug/inc/diagnosticsipc.h Outdated Show resolved Hide resolved
src/coreclr/src/debug/inc/diagnosticsipc.h Outdated Show resolved Hide resolved
src/coreclr/src/vm/diagnosticsprotocol.h Show resolved Hide resolved
src/coreclr/src/vm/diagnosticsprotocol.h Show resolved Hide resolved
src/coreclr/src/vm/ipcstreamfactory.cpp Show resolved Hide resolved
_pIpc->Close(callback);
if (_pStream != nullptr)
_pStream->Close(callback);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_pIpc and _pStream need to be deleted (not that a leak on shutdown would really matter, but may as well clean up properly)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I specifically leaked these since the PAL is unaware of shutdown paths and has access to these pointers. This method is specifically called from a different thread than the server thread. If we delete them in this code path it can lead to AVs on the server thread in the PAL (inside IpcStream::DiagnosticsIpc::Poll specifically). Rather than introduce some form of knowledge of runtime state to the PAL or potentially adding locks protecting access I opted to just leak the memory on shutdown, since that is the only place this method is called. I should add a comment that calls out this assumption, though.

src/coreclr/tests/src/tracing/eventpipe/reverse/reverse.cs Outdated Show resolved Hide resolved
@josalem
Copy link
Contributor Author

josalem commented Apr 20, 2020

Remaining failure appears to be an AzDO Package Feed or NuGet manifest error. All test runs passed.

@josalem josalem changed the title Add Reverse Diagnostics Server (client-mode) Add Reverse Diagnostics Server Apr 20, 2020
@josalem josalem merged commit 629dba5 into dotnet:master Apr 20, 2020
jkotas added a commit that referenced this pull request May 3, 2020
jkotas added a commit that referenced this pull request May 3, 2020
josalem pushed a commit to josalem/runtime that referenced this pull request May 7, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 10, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants