Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

worker: refactor thread life cycle management #26099

Closed
wants to merge 4 commits into from

Conversation

gireeshpunathil
Copy link
Member

@gireeshpunathil gireeshpunathil commented Feb 14, 2019

The current mechanism uses two async handles, one owned by the
creator of the worker thread to terminate a running worker,
and another one employed by the worker to interrupt its creator on its
natural termination. The force termination piggybacks on the message-
passing mechanism to inform the worker to quiesce.

Also there are few flags that represent the other thread's state /
request state because certain code path is shared by multiple
control flows, and there are certain code path where the async
handles may not have come to life.

Refactor into a LoopStopper abstraction that exposes routines to
install a handle as well as to save a state.

Refs: #21283

The approach can be re-used for stopping the main Node application thread in-flight.

cc @addaleax @nodejs/workers

Checklist
  • make -j4 test (UNIX), or vcbuild test (Windows) passes
  • tests and/or benchmarks are included
  • documentation is changed or added
  • commit message follows commit guidelines

@nodejs-github-bot nodejs-github-bot added the c++ Issues and PRs that require attention from people who are familiar with C++. label Feb 14, 2019
@gireeshpunathil gireeshpunathil added the worker Issues and PRs related to Worker support. label Feb 14, 2019
src/node_worker.cc Outdated Show resolved Hide resolved
src/node_worker.h Show resolved Hide resolved
src/node_worker.h Outdated Show resolved Hide resolved
src/node_worker.h Show resolved Hide resolved
src/node_worker.cc Outdated Show resolved Hide resolved
@gireeshpunathil
Copy link
Member Author

@addaleax - One thing I noticed is that the Environment that is newly created in Worker::Run is never attached to the worker object, instead it has the creator's Environment, but within the run loop the new env_ is in effect. While I don't see any issues with this approach, is this working as designed? What advantage it has over attaching the new one onto it?

src/node_worker.cc Outdated Show resolved Hide resolved
src/node_worker.cc Outdated Show resolved Hide resolved
src/node_worker.cc Outdated Show resolved Hide resolved
src/node_worker.h Show resolved Hide resolved
@addaleax
Copy link
Member

While I don't see any issues with this approach, is this working as designed?

Yes, it’s working as designed.

What advantage it has over attaching the new one onto it?

It’s a question of correctness – the Worker object is an object in the parent thread, so env() should return the parent thread for it. We also don’t want the child thread’s Environment to be used outside of the child thread, so making it local to that and not exposing it otherwise seems to make sense?

src/node_worker.cc Outdated Show resolved Hide resolved
@addaleax
Copy link
Member

@gireeshpunathil
Copy link
Member Author

parallel/test-worker-messageport-transfer-terminate is consistently failing in power linux, and I am able to recreate locally. investigating...

@gireeshpunathil
Copy link
Member Author

parallel/test-worker-messageport-transfer-terminate fails in centos and osx too.
parallel/test-worker-debug fails in windows.

@addaleax
Copy link
Member

@gireeshpunathil Let me know if you need anything investigating those failures :)

@gireeshpunathil
Copy link
Member Author

Program terminated with signal SIGSEGV, Segmentation fault.
#0  __GI___pthread_mutex_lock (mutex=0x18) at ../nptl/pthread_mutex_lock.c:67
67	../nptl/pthread_mutex_lock.c: No such file or directory.
[Current thread is 1 (Thread 0x3fffa1d54e60 (LWP 32165))]
(gdb) where
#0  __GI___pthread_mutex_lock (mutex=0x18) at ../nptl/pthread_mutex_lock.c:67
#1  0x000000001077a3d0 in ?? () at ../deps/uv/src/unix/thread.c:288
#2  0x0000000010669de4 in node::worker::Worker::StopThread(v8::FunctionCallbackInfo<v8::Value> const&)
    ()
#3  0x00000000108dfc4c in v8::internal::MaybeHandle<v8::internal::Object> v8::internal::(anonymous namespace)::HandleApiCallHelper<false>(v8::internal::Isolate*, v8::internal::Handle<v8::internal::HeapObject>, v8::internal::Handle<v8::internal::HeapObject>, v8::internal::Handle<v8::internal::FunctionTemplateInfo>, v8::internal::Handle<v8::internal::Object>, v8::internal::BuiltinArguments) ()
#4  0x00000000108e09e4 in v8::internal::Builtin_HandleApiCall(int, v8::internal::Object**, v8::internal::Isolate*) ()
#5  0x0000000011b5bc68 in Builtins_CEntry_Return1_DontSaveFPRegs_ArgvOnStack_NoBuiltinExit ()
void Worker::Exit(int code) {
  Mutex::ScopedLock lock(mutex_);

  Debug(this, "Worker %llu called Exit(%d)", thread_id_, code);
  if (!thread_stopper_->IsStopped()) { // --------------------------> here
    exit_code_ = code;
    Debug(this, "Received StopEventLoop request");
    thread_stopper_->Stop();
    if (isolate_ != nullptr)
      isolate_->TerminateExecution();
  }
}

Looking at this line (where it crashed), I believe that the main thread is trying to terminate the worker even before it came to life - which means thread_stopper_ is still not created.

Prior to this PR it was not an issue as we were using primitive fields under worker.

Right now I am constructing thread_stopper_ as soon as I enter the thread (::Run) but looks like that itself is too late!

I guess we could create it in the parent itself, and the async_handle can be late attached in the worker thread, of course.

With that change I ran ppc linux 1000 times and see no crashes.

@addaleax - what do you think?

@gireeshpunathil
Copy link
Member Author

btw AIX also failed the same test, and I was anticipating it!

@gireeshpunathil
Copy link
Member Author

new CI with the changes: https://ci.nodejs.org/job/node-test-pull-request/20794/

@addaleax
Copy link
Member

@gireeshpunathil Yes, that makes a lot of sense. Do these AsyncRequest fields need to be allocated separately, now that they are both constructed in the constructor anyway? They could be direct members of Worker, right?

@gireeshpunathil
Copy link
Member Author

you mean to unwrap the AsyncRequest object and make everything part of the worker::Worker? but then we can't re-use it elsewhere? for example #21283 where we would want this (which is outside the scope of worker_threads)

@gireeshpunathil
Copy link
Member Author

btw wrote a new test that recreated the crash in xlinux with the old code and confirmed the theory. When the main thread went to terminate, the worker was still being cloned.

(gdb) where
#0  0x00007f1d0de23c30 in pthread_mutex_lock () from /lib64/libpthread.so.0
#1  0x0000000001019239 in uv_mutex_lock (mutex=0x18) at ../deps/uv/src/unix/thread.c:287
#2  0x0000000000d9b0a0 in node::LibuvMutexTraits::mutex_lock (mutex=0x18) at ../src/node_mutex.h:108
#3  0x0000000000d9ce86 in node::MutexBase<node::LibuvMutexTraits>::ScopedLock::ScopedLock (
    this=0x7ffe42fea980, mutex=...) at ../src/node_mutex.h:164
#4  0x0000000000efd41a in node::worker::AsyncRequest::IsStopped (this=0x0) at ../src/node_worker.cc:87
#5  0x0000000000f00105 in node::worker::Worker::Exit (this=0x5060d20, code=1)
    at ../src/node_worker.cc:549
#6  0x0000000000efff41 in node::worker::Worker::StopThread (args=...) at ../src/node_worker.cc:526
#7  0x0000000001185b4d in v8::internal::FunctionCallbackArguments::Call (
    this=this@entry=0x7ffe42feab90, handler=handler@entry=0x18d96d7f5391)
    at ../deps/v8/src/api-arguments-inl.h:140
#8  0x0000000001186802 in v8::internal::(anonymous namespace)::HandleApiCallHelper<false> (
    isolate=isolate@entry=0x4f6da90, function=..., function@entry=..., new_target=..., 
    new_target@entry=..., fun_data=..., receiver=..., receiver@entry=..., args=...)
    at ../deps/v8/src/builtins/builtins-api.cc:109
#9  0x000000000118ac2b in v8::internal::Builtin_Impl_HandleApiCall (args=..., 
    isolate=isolate@entry=0x4f6da90) at ../deps/v8/src/builtins/builtins-api.cc:139
#10 0x000000000118b641 in v8::internal::Builtin_HandleApiCall (args_length=5, 
    args_object=0x7ffe42feadb8, isolate=0x4f6da90) at ../deps/v8/src/builtins/builtins-api.cc:127
#11 0x000000000249ea95 in Builtins_CEntry_Return1_DontSaveFPRegs_ArgvOnStack_NoBuiltinExit ()
#12 0x00000840f988cb8e in ?? ()
#13 0x000004c1108025a1 in ?? ()
#14 0x000018d96d7f5621 in ?? ()
#15 0x0000000500000000 in ?? ()
#16 0x000004c110802681 in ?? ()
#17 0x00003a17f9dd3be9 in ?? ()
(gdb) f 5
#5  0x0000000000f00105 in node::worker::Worker::Exit (this=0x5060d20, code=1)
    at ../src/node_worker.cc:549
549	  if (!thread_stopper_->IsStopped()) {
(gdb) thr 8
[Switching to thread 8 (Thread 0x7f1d06ffd700 (LWP 42711))]
#0  0x00007f1d0db4bb01 in clone () from /lib64/libc.so.6
(gdb) where
#0  0x00007f1d0db4bb01 in clone () from /lib64/libc.so.6
#1  0x00007f1d0de21d10 in ?? () from /lib64/libpthread.so.0
#2  0x00007f1d06ffd700 in ?? ()
#3  0x0000000000000000 in ?? ()
(gdb) 

@addaleax
Copy link
Member

you mean to unwrap the AsyncRequest object and make everything part of the worker::Worker?

Sorry for being unclear – what I meant was to use AsyncRequest thread_stopper_; instead of std::unique_ptr<AsyncRequest> thread_stopper_; (and to perform the necessary replacements of -> with . etc.).

@gireeshpunathil
Copy link
Member Author

thanks @addaleax - I followed that suggestion.

Several test were crashing in windows. Here is what I found:

00007FF6A9505BE0  push        rbx  
00007FF6A9505BE2  sub         rsp,20h  
00007FF6A9505BE6  mov         rbx,qword ptr [rcx]  
00007FF6A9505BE9  mov         edx,0E0h  
00007FF6A9505BEE  mov         rax,qword ptr [rbx]  
>> 00007FF6A9505BF1  dec         dword ptr [rax+7D0h]  
00007FF6A9505BF7  mov         rax,qword ptr [rbx+10h]  
00007FF6A9505BFB  mov         qword ptr [rcx],rax  
00007FF6A9505BFE  call        operator delete (07FF6A93C1B4Ah)  
00007FF6A9505C03  mov         edx,18h  
00007FF6A9505C08  mov         rcx,rbx  
00007FF6A9505C0B  add         rsp,20h  
00007FF6A9505C0F  pop         rbx  

this points to data->env access in CloseHandle method

  uv_close(reinterpret_cast<uv_handle_t*>(handle), [](uv_handle_t* handle) {
    std::unique_ptr<CloseData> data { static_cast<CloseData*>(handle->data) };
    data->env->handle_cleanup_waiting_--;
    handle->data = data->original_data;
    data->callback(reinterpret_cast<T*>(handle));
  });

It is possible that when we issue the CloseHandle call the Environment was live, but by the time the callback was issued (may be in the next tick it is torn down? And in windows we have seen in the past that freed memory gets filled with garbage.

If I nullify the env_ field in the OnScopeLeave and skip the CloseHandle call based on env_'s existence, the problem surfaces as handle leak, reported by CheckedUvLoopClose.

Looking for suggestions at this point!

@gireeshpunathil
Copy link
Member Author

ok, I guess I found the issue - I was issuing thread_stopper_.Uninstall() very late in the cycle, based on my original premise of keeping the pointer as late as possible. Following the existing code, I see it closes the handle in MessagePort::OnClose. Following the same sequence, I am able to solve the issue. Currently running tests in Windows locally.

@addaleax
Copy link
Member

@gireeshpunathil Yeah, I think that makes sense. :)

@gireeshpunathil gireeshpunathil force-pushed the uvcontrol branch 3 times, most recently from 0035d2c to f53a1e4 Compare February 17, 2019 10:55
src/node_worker.h Outdated Show resolved Hide resolved
src/node_worker.h Outdated Show resolved Hide resolved
@@ -34,14 +55,13 @@ class Worker : public AsyncWrap {
tracker->TrackFieldWithSize(
"isolate_data", sizeof(IsolateData), "IsolateData");
tracker->TrackFieldWithSize("env", sizeof(Environment), "Environment");
tracker->TrackField("thread_exit_async", *thread_exit_async_);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We’re not tracking the uv_async_t anymore, right? Maybe we should add something like tracker->TrackInlineField() that allows us to keep track of MemoryRetainers that are direct members of the class…

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean - the one represented by thread_exit_async_ ? that is replaced withAsyncRequest objects that creates uv_async_t objects, and tracks through the interface method. the async_ field in AsyncRequest is still a pointer, direct member of neither AsyncRequest nor Worker.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neither *thread_stopper_.async_ nor *on_thread_finished_.async_ are tracked, yes, because we don’t inform the tracker about the existence of the AsyncRequest fields.

Also, side note: I’m just noticing that we have the IsolateData and Environment fields listed here as well, which I’m not sure makes sense given that they are no longer directly allocated by this object…

Copy link
Member Author

@gireeshpunathil gireeshpunathil Feb 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, I don't understand. *thread_stopper_.async and *on_thread_finished_.async_ are not tracked through the tracker instance or of worker, but those are tracked through the tracker instance of AsyncRequest object (line 98):

void AsyncRequest::MemoryInfo(MemoryTracker* tracker) const {
  Mutex::ScopedLock lock(mutex_);
  if (async_ != nullptr) tracker->TrackField("async_request", *async_);
}

Isn't it enough? I hope we don't need multiple trackers for the same allocation?

For the IsolateData and Environment: I just removed those from being actively tracked by the worker and pushed in under this PR itself.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gireeshpunathil The problem is that the memory tracker doesn’t know that it should call AsyncRequest::MemoryInfo. Currently, the way to inform it would be adding tracker->TrackField("thread_stopper_", &thread_stopper_);, but then we would end up tracking the memory for the AsyncRequest itself twice.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gireeshpunathil Should we change this PR to use TrackInlineField now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@addaleax - yes. Though I knew this depend on #26161 for a moment I forgot about that!

@addaleax addaleax mentioned this pull request Feb 17, 2019
4 tasks
@gireeshpunathil
Copy link
Member Author

@addaleax addaleax added the author ready PRs that have at least one approval, no pending requests for changes, and a CI started. label Feb 28, 2019
@addaleax
Copy link
Member

The current mechanism of uses two async handles, one owned by the
creator of the worker thread to terminate a running worker,
and another one employed by the worker to interrupt its creator on its
natural termination. The force termination piggybacks on the message-
passing mechanism to inform the worker to quiesce.

Also there are few flags that represent the other thread's state /
request state because certain code path is shared by multiple
control flows, and there are certain code path where the async
handles may not have come to life.

Refactor into a LoopStopper abstraction that exposes routines to
install a handle as well as to save a state.

Refs: nodejs#21283
@gireeshpunathil
Copy link
Member Author

@addaleax
Copy link
Member

addaleax commented Mar 1, 2019

Landed in d14cba4 :)

@addaleax addaleax closed this Mar 1, 2019
addaleax pushed a commit that referenced this pull request Mar 1, 2019
The current mechanism of uses two async handles, one owned by the
creator of the worker thread to terminate a running worker,
and another one employed by the worker to interrupt its creator on its
natural termination. The force termination piggybacks on the message-
passing mechanism to inform the worker to quiesce.

Also there are few flags that represent the other thread's state /
request state because certain code path is shared by multiple
control flows, and there are certain code path where the async
handles may not have come to life.

Refactor into an AsyncRequest abstraction that exposes routines to
install a handle as well as to save a state.

PR-URL: #26099
Refs: #21283
Reviewed-By: Anna Henningsen <[email protected]>
addaleax pushed a commit that referenced this pull request Mar 1, 2019
The current mechanism of uses two async handles, one owned by the
creator of the worker thread to terminate a running worker,
and another one employed by the worker to interrupt its creator on its
natural termination. The force termination piggybacks on the message-
passing mechanism to inform the worker to quiesce.

Also there are few flags that represent the other thread's state /
request state because certain code path is shared by multiple
control flows, and there are certain code path where the async
handles may not have come to life.

Refactor into an AsyncRequest abstraction that exposes routines to
install a handle as well as to save a state.

PR-URL: #26099
Refs: #21283
Reviewed-By: Anna Henningsen <[email protected]>
@BridgeAR BridgeAR mentioned this pull request Mar 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
author ready PRs that have at least one approval, no pending requests for changes, and a CI started. c++ Issues and PRs that require attention from people who are familiar with C++. worker Issues and PRs related to Worker support.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants