Test failure: tracing/eventpipe/reverse/reverse/reverse.sh #38156

josalem · 2020-06-19T16:09:51Z

Happened again: https://helix.dot.net/api/2019-06-17/jobs/24babe78-7797-4a21-b7d0-6157ef54bf62/workitems/PayloadGroup0/console

Any update on this?

Originally posted by @safern in #35270 (comment)

dotnet#38156 tracks work for solution

#38156 tracks work for solution

josalem · 2020-06-24T18:32:05Z

I'm not positive this is the same issue, but I've managed to recreate an issue locally using a modified version of the test that infinitely loops under a debugger. My conjecture is that the target process is exhausting its limit of file descriptors. My local repro infinitely loops the one test case without recycling the app (which may exacerbate the problem slightly) and eventually the monitoring app will pause indefinitely waiting for an advertisement from the target app. Looking in the stresslog for the target app shows that the pause is caused by the target app failing to connect to the reverse server with an error message indicating that there are too many open files. Looking at the output of lsof -p <pid> | wc -l prints something on the order of 10,000 file descriptors. most of them are "null" unix domain socket entries, e.g., corerun 32611 josalem 1132u unix 0x4e11cdfe222d8ee1 0t0 ->(none). At first I thought we might be leaking these file descriptors, but I'm starting to think it might have more to do with the test abusing the underlying heuristics for the socket APIs.

Specifically, the socket(7) man page says this about the linger option:

When enabled, a close(2) or shutdown(2) will not return until
all queued messages for the socket have been successfully sent
or the linger timeout has been reached. Otherwise, the call
returns immediately and the closing is done in the background.
When the socket is closed as part of exit(2), it always
lingers in the background.

The main point I take from that snippet is that if linger isn't specified, close will return immediately, and the OS handles the actual close/cleanup in the background.

I'm currently looking into ways to change the calls to close from both the monitor app and the target app to validate whether this is a product issue, e.g., leaking the fds, or a misuse of the APIs, e.g., missing a call to properly close the sockets.

josalem · 2020-06-25T16:56:17Z

On further reflection, I'm not convinced that the potential issue I found in the previous comment is the same as the one from the CI log messages, or that it is necessarily an issue at all. The presentation is similar, but the steps for reproducing the mentioned error consistently require ~155 seconds of constantly deleting and recreating the unix domain socket without recycling the target app. I think that by not recycling the app and constantly deleting and recreating the socket, I was abusing the socket and close APIs such that the OS never got around to cleaning the underlying, orphaned file descriptors. They've been closed by the code, but the OS hasn't cleaned them yet.

The error in CI seems to reproduce inconsistently, occurs only in the TEST_ReverseConnectionCanRecycleWhileTracing method, and always occurs in the same spot where the socket is only recycled 4 times. I left an infinitely looping variant of the test that is in CI on overnight and it never caught the issue. I'm going to try running this same experiment under an Ubuntu container and see if that susses out any new info. I might also try creating a no-merge PR and running the outer loop tests in CI with some modifications to try and collect additional info.

josalem · 2020-06-26T17:50:20Z

Managed to repro the failure. I ran the test that is in CI with the timeouts taken out in an infinite loop in 3 separate terminals. One of the instances eventually hit an infinite pause with identical symptoms to the CI failures. I managed to take a core dump of the subprocess, but attempting to run bt all in lldb while I had the process under a debugger caused the target process to crash. I'll take a look at the dump as time permits. I have had 3 terminals running the tests in a loop for >12 hours now and the issue hasn't happened again. I take this to mean that the issue is either some form of race condition, or non-deterministic behavior. Since this issue is only hit in a pathologic test case, I'll split that test out into its own test in CI and move it to outer loop. Additional verbosity in the test will hopefully give some more insight. I'll also look into adding the ability for the helix dump functionality to also dump any child processes of a test. That will be helpful for any tests using RemoteExecutor, which I understand many tests in the Libraries use.

Dotnet-GitSync-Bot added area-Tracing-coreclr untriaged New issue has not been triaged by the area owner labels Jun 19, 2020

josalem added this to the 5.0.0 milestone Jun 19, 2020

mangod9 added the blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' label Jun 19, 2020

jashook mentioned this issue Jun 19, 2020

Infrastructure - Status/Health #702

Closed

tommcdon added p1 and removed untriaged New issue has not been triaged by the area owner labels Jun 22, 2020

josalem mentioned this issue Jun 23, 2020

PayloadGroup0 is timing out #38284

Closed

josalem pushed a commit to josalem/runtime that referenced this issue Jun 23, 2020

temporarily disable the reverse eventpipe

278b7ef

dotnet#38156 tracks work for solution

josalem mentioned this issue Jun 23, 2020

[Test] temporarily disable the reverse eventpipe test #38293

Merged

josalem pushed a commit that referenced this issue Jun 23, 2020

[Test] temporarily disable the reverse eventpipe test (#38293)

4046b5e

#38156 tracks work for solution

josalem mentioned this issue Jun 26, 2020

[Test] Reverse EventPipe test improvements #38474

Merged

This was referenced Jul 7, 2020

Test failure: tracing/eventpipe/reverseouter/reverseouter/reverseouter.sh #38843

Closed

Fix fd leak in Diagnostics Server and add better error handling #39129

Merged

josalem closed this as completed in #39129 Jul 20, 2020

josalem mentioned this issue Jul 28, 2020

Test failures: reverseouter.sh, reverse.sh #39979

Closed

ghost locked as resolved and limited conversation to collaborators Dec 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test failure: tracing/eventpipe/reverse/reverse/reverse.sh #38156

Test failure: tracing/eventpipe/reverse/reverse/reverse.sh #38156

josalem commented Jun 19, 2020

josalem commented Jun 24, 2020

josalem commented Jun 25, 2020

josalem commented Jun 26, 2020

Test failure: tracing/eventpipe/reverse/reverse/reverse.sh #38156

Test failure: tracing/eventpipe/reverse/reverse/reverse.sh #38156

Comments

josalem commented Jun 19, 2020

josalem commented Jun 24, 2020

josalem commented Jun 25, 2020

josalem commented Jun 26, 2020