Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test failure: tracing/eventpipe/reverse/reverse/reverse.sh #38156

Closed
josalem opened this issue Jun 19, 2020 · 3 comments · Fixed by #39129
Closed

Test failure: tracing/eventpipe/reverse/reverse/reverse.sh #38156

josalem opened this issue Jun 19, 2020 · 3 comments · Fixed by #39129
Labels
area-Tracing-coreclr blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms'
Milestone

Comments

@josalem
Copy link
Contributor

josalem commented Jun 19, 2020

Happened again: https://helix.dot.net/api/2019-06-17/jobs/24babe78-7797-4a21-b7d0-6157ef54bf62/workitems/PayloadGroup0/console

Any update on this?

Originally posted by @safern in #35270 (comment)

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added area-Tracing-coreclr untriaged New issue has not been triaged by the area owner labels Jun 19, 2020
@josalem josalem added this to the 5.0.0 milestone Jun 19, 2020
@mangod9 mangod9 added the blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' label Jun 19, 2020
@tommcdon tommcdon added p1 and removed untriaged New issue has not been triaged by the area owner labels Jun 22, 2020
josalem pushed a commit to josalem/runtime that referenced this issue Jun 23, 2020
josalem pushed a commit that referenced this issue Jun 23, 2020
@josalem
Copy link
Contributor Author

josalem commented Jun 24, 2020

I'm not positive this is the same issue, but I've managed to recreate an issue locally using a modified version of the test that infinitely loops under a debugger. My conjecture is that the target process is exhausting its limit of file descriptors. My local repro infinitely loops the one test case without recycling the app (which may exacerbate the problem slightly) and eventually the monitoring app will pause indefinitely waiting for an advertisement from the target app. Looking in the stresslog for the target app shows that the pause is caused by the target app failing to connect to the reverse server with an error message indicating that there are too many open files. Looking at the output of lsof -p <pid> | wc -l prints something on the order of 10,000 file descriptors. most of them are "null" unix domain socket entries, e.g., corerun 32611 josalem 1132u unix 0x4e11cdfe222d8ee1 0t0 ->(none). At first I thought we might be leaking these file descriptors, but I'm starting to think it might have more to do with the test abusing the underlying heuristics for the socket APIs.

Specifically, the socket(7) man page says this about the linger option:

When enabled, a close(2) or shutdown(2) will not return until
all queued messages for the socket have been successfully sent
or the linger timeout has been reached. Otherwise, the call
returns immediately and the closing is done in the background.
When the socket is closed as part of exit(2), it always
lingers in the background.

The main point I take from that snippet is that if linger isn't specified, close will return immediately, and the OS handles the actual close/cleanup in the background.

I'm currently looking into ways to change the calls to close from both the monitor app and the target app to validate whether this is a product issue, e.g., leaking the fds, or a misuse of the APIs, e.g., missing a call to properly close the sockets.

@josalem
Copy link
Contributor Author

josalem commented Jun 25, 2020

On further reflection, I'm not convinced that the potential issue I found in the previous comment is the same as the one from the CI log messages, or that it is necessarily an issue at all. The presentation is similar, but the steps for reproducing the mentioned error consistently require ~155 seconds of constantly deleting and recreating the unix domain socket without recycling the target app. I think that by not recycling the app and constantly deleting and recreating the socket, I was abusing the socket and close APIs such that the OS never got around to cleaning the underlying, orphaned file descriptors. They've been closed by the code, but the OS hasn't cleaned them yet.

The error in CI seems to reproduce inconsistently, occurs only in the TEST_ReverseConnectionCanRecycleWhileTracing method, and always occurs in the same spot where the socket is only recycled 4 times. I left an infinitely looping variant of the test that is in CI on overnight and it never caught the issue. I'm going to try running this same experiment under an Ubuntu container and see if that susses out any new info. I might also try creating a no-merge PR and running the outer loop tests in CI with some modifications to try and collect additional info.

@josalem
Copy link
Contributor Author

josalem commented Jun 26, 2020

Managed to repro the failure. I ran the test that is in CI with the timeouts taken out in an infinite loop in 3 separate terminals. One of the instances eventually hit an infinite pause with identical symptoms to the CI failures. I managed to take a core dump of the subprocess, but attempting to run bt all in lldb while I had the process under a debugger caused the target process to crash. I'll take a look at the dump as time permits. I have had 3 terminals running the tests in a loop for >12 hours now and the issue hasn't happened again. I take this to mean that the issue is either some form of race condition, or non-deterministic behavior. Since this issue is only hit in a pathologic test case, I'll split that test out into its own test in CI and move it to outer loop. Additional verbosity in the test will hopefully give some more insight. I'll also look into adding the ability for the helix dump functionality to also dump any child processes of a test. That will be helpful for any tests using RemoteExecutor, which I understand many tests in the Libraries use.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-Tracing-coreclr blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms'
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants