-
Notifications
You must be signed in to change notification settings - Fork 29.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
worker: refactor thread life cycle management #26099
Conversation
@addaleax - One thing I noticed is that the |
Yes, it’s working as designed.
It’s a question of correctness – the |
|
|
@gireeshpunathil Let me know if you need anything investigating those failures :) |
Program terminated with signal SIGSEGV, Segmentation fault.
#0 __GI___pthread_mutex_lock (mutex=0x18) at ../nptl/pthread_mutex_lock.c:67
67 ../nptl/pthread_mutex_lock.c: No such file or directory.
[Current thread is 1 (Thread 0x3fffa1d54e60 (LWP 32165))]
(gdb) where
#0 __GI___pthread_mutex_lock (mutex=0x18) at ../nptl/pthread_mutex_lock.c:67
#1 0x000000001077a3d0 in ?? () at ../deps/uv/src/unix/thread.c:288
#2 0x0000000010669de4 in node::worker::Worker::StopThread(v8::FunctionCallbackInfo<v8::Value> const&)
()
#3 0x00000000108dfc4c in v8::internal::MaybeHandle<v8::internal::Object> v8::internal::(anonymous namespace)::HandleApiCallHelper<false>(v8::internal::Isolate*, v8::internal::Handle<v8::internal::HeapObject>, v8::internal::Handle<v8::internal::HeapObject>, v8::internal::Handle<v8::internal::FunctionTemplateInfo>, v8::internal::Handle<v8::internal::Object>, v8::internal::BuiltinArguments) ()
#4 0x00000000108e09e4 in v8::internal::Builtin_HandleApiCall(int, v8::internal::Object**, v8::internal::Isolate*) ()
#5 0x0000000011b5bc68 in Builtins_CEntry_Return1_DontSaveFPRegs_ArgvOnStack_NoBuiltinExit () void Worker::Exit(int code) {
Mutex::ScopedLock lock(mutex_);
Debug(this, "Worker %llu called Exit(%d)", thread_id_, code);
if (!thread_stopper_->IsStopped()) { // --------------------------> here
exit_code_ = code;
Debug(this, "Received StopEventLoop request");
thread_stopper_->Stop();
if (isolate_ != nullptr)
isolate_->TerminateExecution();
}
} Looking at this line (where it crashed), I believe that the main thread is trying to terminate the worker even before it came to life - which means Prior to this PR it was not an issue as we were using primitive fields under worker. Right now I am constructing I guess we could create it in the parent itself, and the async_handle can be late attached in the worker thread, of course. With that change I ran ppc linux 1000 times and see no crashes. @addaleax - what do you think? |
btw AIX also failed the same test, and I was anticipating it! |
new CI with the changes: https://ci.nodejs.org/job/node-test-pull-request/20794/ |
@gireeshpunathil Yes, that makes a lot of sense. Do these |
you mean to unwrap the |
btw wrote a new test that recreated the crash in xlinux with the old code and confirmed the theory. When the main thread went to terminate, the worker was still being cloned. (gdb) where
#0 0x00007f1d0de23c30 in pthread_mutex_lock () from /lib64/libpthread.so.0
#1 0x0000000001019239 in uv_mutex_lock (mutex=0x18) at ../deps/uv/src/unix/thread.c:287
#2 0x0000000000d9b0a0 in node::LibuvMutexTraits::mutex_lock (mutex=0x18) at ../src/node_mutex.h:108
#3 0x0000000000d9ce86 in node::MutexBase<node::LibuvMutexTraits>::ScopedLock::ScopedLock (
this=0x7ffe42fea980, mutex=...) at ../src/node_mutex.h:164
#4 0x0000000000efd41a in node::worker::AsyncRequest::IsStopped (this=0x0) at ../src/node_worker.cc:87
#5 0x0000000000f00105 in node::worker::Worker::Exit (this=0x5060d20, code=1)
at ../src/node_worker.cc:549
#6 0x0000000000efff41 in node::worker::Worker::StopThread (args=...) at ../src/node_worker.cc:526
#7 0x0000000001185b4d in v8::internal::FunctionCallbackArguments::Call (
this=this@entry=0x7ffe42feab90, handler=handler@entry=0x18d96d7f5391)
at ../deps/v8/src/api-arguments-inl.h:140
#8 0x0000000001186802 in v8::internal::(anonymous namespace)::HandleApiCallHelper<false> (
isolate=isolate@entry=0x4f6da90, function=..., function@entry=..., new_target=...,
new_target@entry=..., fun_data=..., receiver=..., receiver@entry=..., args=...)
at ../deps/v8/src/builtins/builtins-api.cc:109
#9 0x000000000118ac2b in v8::internal::Builtin_Impl_HandleApiCall (args=...,
isolate=isolate@entry=0x4f6da90) at ../deps/v8/src/builtins/builtins-api.cc:139
#10 0x000000000118b641 in v8::internal::Builtin_HandleApiCall (args_length=5,
args_object=0x7ffe42feadb8, isolate=0x4f6da90) at ../deps/v8/src/builtins/builtins-api.cc:127
#11 0x000000000249ea95 in Builtins_CEntry_Return1_DontSaveFPRegs_ArgvOnStack_NoBuiltinExit ()
#12 0x00000840f988cb8e in ?? ()
#13 0x000004c1108025a1 in ?? ()
#14 0x000018d96d7f5621 in ?? ()
#15 0x0000000500000000 in ?? ()
#16 0x000004c110802681 in ?? ()
#17 0x00003a17f9dd3be9 in ?? ()
(gdb) f 5
#5 0x0000000000f00105 in node::worker::Worker::Exit (this=0x5060d20, code=1)
at ../src/node_worker.cc:549
549 if (!thread_stopper_->IsStopped()) {
(gdb) thr 8
[Switching to thread 8 (Thread 0x7f1d06ffd700 (LWP 42711))]
#0 0x00007f1d0db4bb01 in clone () from /lib64/libc.so.6
(gdb) where
#0 0x00007f1d0db4bb01 in clone () from /lib64/libc.so.6
#1 0x00007f1d0de21d10 in ?? () from /lib64/libpthread.so.0
#2 0x00007f1d06ffd700 in ?? ()
#3 0x0000000000000000 in ?? ()
(gdb) |
Sorry for being unclear – what I meant was to use |
thanks @addaleax - I followed that suggestion. Several test were crashing in windows. Here is what I found: 00007FF6A9505BE0 push rbx
00007FF6A9505BE2 sub rsp,20h
00007FF6A9505BE6 mov rbx,qword ptr [rcx]
00007FF6A9505BE9 mov edx,0E0h
00007FF6A9505BEE mov rax,qword ptr [rbx]
>> 00007FF6A9505BF1 dec dword ptr [rax+7D0h]
00007FF6A9505BF7 mov rax,qword ptr [rbx+10h]
00007FF6A9505BFB mov qword ptr [rcx],rax
00007FF6A9505BFE call operator delete (07FF6A93C1B4Ah)
00007FF6A9505C03 mov edx,18h
00007FF6A9505C08 mov rcx,rbx
00007FF6A9505C0B add rsp,20h
00007FF6A9505C0F pop rbx this points to uv_close(reinterpret_cast<uv_handle_t*>(handle), [](uv_handle_t* handle) {
std::unique_ptr<CloseData> data { static_cast<CloseData*>(handle->data) };
data->env->handle_cleanup_waiting_--;
handle->data = data->original_data;
data->callback(reinterpret_cast<T*>(handle));
}); It is possible that when we issue the If I nullify the Looking for suggestions at this point! |
ok, I guess I found the issue - I was issuing |
@gireeshpunathil Yeah, I think that makes sense. :) |
0035d2c
to
f53a1e4
Compare
@@ -34,14 +55,13 @@ class Worker : public AsyncWrap { | |||
tracker->TrackFieldWithSize( | |||
"isolate_data", sizeof(IsolateData), "IsolateData"); | |||
tracker->TrackFieldWithSize("env", sizeof(Environment), "Environment"); | |||
tracker->TrackField("thread_exit_async", *thread_exit_async_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We’re not tracking the uv_async_t
anymore, right? Maybe we should add something like tracker->TrackInlineField()
that allows us to keep track of MemoryRetainer
s that are direct members of the class…
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mean - the one represented by thread_exit_async_
? that is replaced withAsyncRequest
objects that creates uv_async_t
objects, and tracks through the interface method. the async_
field in AsyncRequest
is still a pointer, direct member of neither AsyncRequest
nor Worker
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neither *thread_stopper_.async_
nor *on_thread_finished_.async_
are tracked, yes, because we don’t inform the tracker
about the existence of the AsyncRequest
fields.
Also, side note: I’m just noticing that we have the IsolateData
and Environment
fields listed here as well, which I’m not sure makes sense given that they are no longer directly allocated by this object…
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, I don't understand. *thread_stopper_.async
and *on_thread_finished_.async_
are not tracked through the tracker instance or of worker, but those are tracked through the tracker instance of AsyncRequest
object (line 98):
void AsyncRequest::MemoryInfo(MemoryTracker* tracker) const {
Mutex::ScopedLock lock(mutex_);
if (async_ != nullptr) tracker->TrackField("async_request", *async_);
}
Isn't it enough? I hope we don't need multiple trackers for the same allocation?
For the IsolateData
and Environment
: I just removed those from being actively tracked by the worker and pushed in under this PR itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gireeshpunathil The problem is that the memory tracker doesn’t know that it should call AsyncRequest::MemoryInfo
. Currently, the way to inform it would be adding tracker->TrackField("thread_stopper_", &thread_stopper_);
, but then we would end up tracking the memory for the AsyncRequest
itself twice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gireeshpunathil Should we change this PR to use TrackInlineField
now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current mechanism of uses two async handles, one owned by the creator of the worker thread to terminate a running worker, and another one employed by the worker to interrupt its creator on its natural termination. The force termination piggybacks on the message- passing mechanism to inform the worker to quiesce. Also there are few flags that represent the other thread's state / request state because certain code path is shared by multiple control flows, and there are certain code path where the async handles may not have come to life. Refactor into a LoopStopper abstraction that exposes routines to install a handle as well as to save a state. Refs: nodejs#21283
338d590
to
db9edc1
Compare
Landed in d14cba4 :) |
The current mechanism of uses two async handles, one owned by the creator of the worker thread to terminate a running worker, and another one employed by the worker to interrupt its creator on its natural termination. The force termination piggybacks on the message- passing mechanism to inform the worker to quiesce. Also there are few flags that represent the other thread's state / request state because certain code path is shared by multiple control flows, and there are certain code path where the async handles may not have come to life. Refactor into an AsyncRequest abstraction that exposes routines to install a handle as well as to save a state. PR-URL: #26099 Refs: #21283 Reviewed-By: Anna Henningsen <[email protected]>
The current mechanism of uses two async handles, one owned by the creator of the worker thread to terminate a running worker, and another one employed by the worker to interrupt its creator on its natural termination. The force termination piggybacks on the message- passing mechanism to inform the worker to quiesce. Also there are few flags that represent the other thread's state / request state because certain code path is shared by multiple control flows, and there are certain code path where the async handles may not have come to life. Refactor into an AsyncRequest abstraction that exposes routines to install a handle as well as to save a state. PR-URL: #26099 Refs: #21283 Reviewed-By: Anna Henningsen <[email protected]>
The current mechanism uses two async handles, one owned by the
creator of the worker thread to terminate a running worker,
and another one employed by the worker to interrupt its creator on its
natural termination. The force termination piggybacks on the message-
passing mechanism to inform the worker to quiesce.
Also there are few flags that represent the other thread's state /
request state because certain code path is shared by multiple
control flows, and there are certain code path where the async
handles may not have come to life.
Refactor into a LoopStopper abstraction that exposes routines to
install a handle as well as to save a state.
Refs: #21283
The approach can be re-used for stopping the main Node application thread in-flight.
cc @addaleax @nodejs/workers
Checklist
make -j4 test
(UNIX), orvcbuild test
(Windows) passes