-
Notifications
You must be signed in to change notification settings - Fork 29.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
trace_event: destroy platform before tracing #22938
trace_event: destroy platform before tracing #22938
Conversation
Linter re-run: https://ci.nodejs.org/job/node-test-linter/22199/ Windows re-run: https://ci.nodejs.org/job/node-test-commit-windows-fanned/20863/ |
win10 failures are odd but they have to be unrelated, right? Windows rebuild: https://ci.nodejs.org/job/node-test-commit-windows-fanned/20869/ |
The windows tests are consistently failing with:
3221225477 is 0xc0000005 which is access violation. This needs investigating. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/ping @nodejs/platform-windows on the access-violation Windows failures that this change apparently causes. |
I plan to take a look at this once I get some free cycles – today has been quite busy. I would appreciate others' help however. |
For safer shutdown, we should destroy the platform – and background threads - before the tracing infrastructure is destroyed. This change fixes the relative order of NodePlatform disposition and the tracing agent shutting down. This matches the nesting order for startup. Make the tracing agent own the tracing controller instead of platform to match the above. Fixes: nodejs#22865
b2718fe
to
db9b5ff
Compare
I have been having a hard time reproducing this failure on my local neighbourhood windows machine. Even in the CI it seems like a flaky issue with different tests failing depending on shutdown timing. Failures typically take the form of a segfault in a child process. If someone could help me grab the crashing stacktrace somehow, that would help tremendously. BTW, it would be a nice feature if our test infrastructure was capable of capturing stacktraces on segfaults automatically. It would make it tremendously simpler to do root cause analysis for failures. |
@nodejs/build-infra @nodejs/build would it be possible to support what @ofrobots suggested?
|
@@ -48,8 +48,7 @@ using v8::platform::tracing::TraceConfig; | |||
using v8::platform::tracing::TraceWriter; | |||
using std::string; | |||
|
|||
Agent::Agent() { | |||
tracing_controller_ = new TracingController(); | |||
Agent::Agent() : tracing_controller_(new TracingController()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the TracingController
life is 1:1 with the Agent
, make it a member instead of a pointer-to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point, the lifetimes aren't perfectly aligned. After this PR lands, I have an intention to refactor & merge the Agent
and TracingController
concepts into a single structure.
After a lot of flailing (due to my inexperience with Windows) I was able to capture a crash dump. On Unix this would have been as simple as There were lots of false starts though. E.g. the internet suggests that setting some registry keys should enable generation of dumps. This didn't work for me. There are instructions on our issue tracker about 'Dumps on Silent Process Exit'. This didn't help as this would generate a dump every time a child process exited 'silently'. This happens frequently. Ultimately, I added this code to Anyway, back to the problem at hand: the tests that are flaking are ones that manually call |
I have verified that this windows crash is not caused by the change here, but a long standing issue with thread timing on windows that happens to get exposed with the change here. Without my patch, I can reproduce a crash by introducing a manual delay in the background worker thread startup: static void PlatformWorkerThread(void* data) {
fprintf(stderr, ""); // write an empty string to stderr, just to introduce a delay.
TRACE_EVENT_METADATA1("__metadata", "thread_name", "name",
"PlatformWorkerThread");
TaskQueue<Task>* pending_worker_tasks = static_cast<TaskQueue<Task>*>(data);
while (std::unique_ptr<Task> task = pending_worker_tasks->BlockingPop()) {
task->Run();
pending_worker_tasks->NotifyOfCompletion();
}
} This results in a segfault in the child process (3221225477 is the code in decimal for access violation).
So far I can make the crash happen on windows only. The problem is that the background worker is still doing IO when the main thread calls exit. I'll open a separate issue for this. This PR is blocked until this can be resolved. |
Opened issue #23065. |
Added |
#23065 is resolved, so I believe this is now unblocked. |
For safer shutdown, we should destroy the platform – and background threads - before the tracing infrastructure is destroyed. This change fixes the relative order of NodePlatform disposition and the tracing agent shutting down. This matches the nesting order for startup. Make the tracing agent own the tracing controller instead of platform to match the above. Fixes: nodejs#22865 PR-URL: nodejs#22938 Reviewed-By: Eugene Ostroukhov <[email protected]> Reviewed-By: James M Snell <[email protected]> Reviewed-By: Matteo Collina <[email protected]>
Landed in 68b3e46 |
Should this be backported to |
v10.x Backport on #23398 |
For safer shutdown, we should destroy the platform – and background threads - before the tracing infrastructure is destroyed. This change fixes the relative order of NodePlatform disposition and the tracing agent shutting down. This matches the nesting order for startup. Make the tracing agent own the tracing controller instead of platform to match the above. Fixes: nodejs#22865 PR-URL: nodejs#22938 Reviewed-By: Eugene Ostroukhov <[email protected]> Reviewed-By: James M Snell <[email protected]> Reviewed-By: Matteo Collina <[email protected]>
For safer shutdown, we should destroy the platform – and background threads - before the tracing infrastructure is destroyed. This change fixes the relative order of NodePlatform disposition and the tracing agent shutting down. This matches the nesting order for startup. Make the tracing agent own the tracing controller instead of platform to match the above. Fixes: #22865 PR-URL: #22938 Reviewed-By: Eugene Ostroukhov <[email protected]> Reviewed-By: James M Snell <[email protected]> Reviewed-By: Matteo Collina <[email protected]>
For safer shutdown, we should destroy the platform – and background threads - before the tracing infrastructure is destroyed. This change fixes the relative order of NodePlatform disposition and the tracing agent shutting down. This matches the nesting order for startup. Make the tracing agent own the tracing controller instead of platform to match the above. Fixes: #22865 PR-URL: #22938 Reviewed-By: Eugene Ostroukhov <[email protected]> Reviewed-By: James M Snell <[email protected]> Reviewed-By: Matteo Collina <[email protected]>
For safer shutdown, we should destroy the platform – and background threads - before the tracing infrastructure is destroyed. This change fixes the relative order of NodePlatform disposition and the tracing agent shutting down. This matches the nesting order for startup. Make the tracing agent own the tracing controller instead of platform to match the above. Fixes: #22865 PR-URL: #22938 Reviewed-By: Eugene Ostroukhov <[email protected]> Reviewed-By: James M Snell <[email protected]> Reviewed-By: Matteo Collina <[email protected]>
For safer shutdown, we should destroy the platform – and background threads - before the tracing infrastructure is destroyed. This change fixes the relative order of NodePlatform disposition and the tracing agent shutting down. This matches the nesting order for startup. Make the tracing agent own the tracing controller instead of platform to match the above. Fixes: #22865 PR-URL: #22938 Reviewed-By: Eugene Ostroukhov <[email protected]> Reviewed-By: James M Snell <[email protected]> Reviewed-By: Matteo Collina <[email protected]>
For safer shutdown, we should destroy the platform – and background threads - before the tracing infrastructure is destroyed. This change fixes the relative order of NodePlatform disposition and the tracing agent shutting down. This matches the nesting order for startup. Make the tracing agent own the tracing controller instead of platform to match the above. Fixes: #22865 PR-URL: #22938 Reviewed-By: Eugene Ostroukhov <[email protected]> Reviewed-By: James M Snell <[email protected]> Reviewed-By: Matteo Collina <[email protected]>
For safer shutdown, we should destroy the platform – and platform
threads - before the tracing infrastructure is destroyed. This change
fixes the relative order of NodePlatform disposition and the tracing
agent shutting down. This matches the nesting order for startup.
Make the tracing agent own the tracing controller instead of platform
to match the above rationale.
Fixes: #22865
This should fix the thread races we have been observing with trace events. I have been running this on the FreeBSD box that was showing flakes in the CI:
I believe this makes 92b695e unnecessary (but harmless). I can revert that if this change sticks and the CI proves that the flakiness is gone.
Checklist
make -j4 test
(UNIX), orvcbuild test
(Windows) passesCI: https://ci.nodejs.org/job/node-test-pull-request/17302/