Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage hang on terminate #3005

Closed
kikimo opened this issue Oct 7, 2021 · 4 comments
Closed

storage hang on terminate #3005

kikimo opened this issue Oct 7, 2021 · 4 comments
Assignees
Labels
type/bug Type: something is unexpected
Milestone

Comments

@kikimo
Copy link
Contributor

kikimo commented Oct 7, 2021

Please check the FAQ documentation before raising an issue

Please check the FAQ documentation and old issues before raising an issue in case someone has asked the same question that you are asking.

Describe the bug (must be provided)

storage hand on terminate, this seems to be a bug introduced by #2843 .

Your Environments (must be provided)

  • OS: uname -a
  • Compliler: g++ --version or clang++ --version
  • CPU: lscpu
  • Commit id (e.g. a3ffc7d8)

How To Reproduce(must be provided)

start a cluster, execute kill ${PID_OF_STORAGED}, it will not exit, pstack shows:

...
thread: 0, lwp: 71129, type: 0
#0  0x00007ffff7e57376 in __pthread_cond_wait()+534 in /lib/x86_64-linux-gnu/libpthread.so.0 at futex-internal.h:183
#1  0x0000000004b187f0 in _ZNSt18condition_variable4waitERSt11unique_lockISt5mutexE!()+15 in /root/src/nebula/build/bin/nebula-storaged
#2  0x0000000002b8c048 in _ZZN6nebula7storage16AdminTaskManager21handleUnreportedTasksEvENKUlvE_clEv!()+117 in /root/src/nebula/build/bin/nebula-storaged at AdminTaskManager.cpp:47
#3  0x0000000002bb07ee in std::__invoke_impl<void, nebula::storage::AdminTaskManager::handleUnreportedTasks()::<lambda()> >()+29 in /root/src/nebula/build/bin/nebula-storaged at invoke.h:60
#4  0x0000000002bb07a4 in std::__invoke<nebula::storage::AdminTaskManager::handleUnreportedTasks()::<lambda()> >()+29 in /root/src/nebula/build/bin/nebula-storaged at invoke.h:95
#5  0x0000000002bb0752 in std::thread::_Invoker<std::tuple<nebula::storage::AdminTaskManager::handleUnreportedTasks()::<lambda()> > >::_M_invoke<0>()+37 in /root/src/nebula/build/bin/nebula-storaged at thread:244
#6  0x0000000002bb0708 in std::thread::_Invoker<std::tuple<nebula::storage::AdminTaskManager::handleUnreportedTasks()::<lambda()> > >::operator()()+21 in /root/src/nebula/build/bin/nebula-storaged at thread:251
#7  0x0000000002bb0502 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<nebula::storage::AdminTaskManager::handleUnreportedTasks()::<lambda()> > > >::_M_run()+29 in /root/src/nebula/build/bin/nebula-storaged at thread:195
#8  0x0000000004b8f2b4 in execute_native_thread_routine!()+19 in /root/src/nebula/build/bin/nebula-storaged
#9  0x00007ffff7e50609 in start_thread()+216 in /lib/x86_64-linux-gnu/libpthread.so.0 at pthread_create.c:477
#10 0x00007ffff7d77293 in __GI___clone!()+66 in /lib/x86_64-linux-gnu/libc.so.6 at clone.S:95
...

code around AdminTaskManager.cpp:47:

void AdminTaskManager::handleUnreportedTasks() {
using futTuple =
std::tuple<JobID, TaskID, std::string, folly::Future<StatusOr<nebula::cpp2::ErrorCode>>>;
if (env_ == nullptr) return;
unreportedAdminThread_.reset(new std::thread([this] {
bool ifAny = true;
while (true) {
std::unique_lock<std::mutex> lk(unreportedMutex_);
if (!ifAny) unreportedCV_.wait(lk);
ifAny = false;
std::unique_ptr<kvstore::KVIterator> iter;
auto kvRet = env_->adminStore_->scan(&iter);
if (kvRet != nebula::cpp2::ErrorCode::SUCCEEDED || iter == nullptr) continue;
std::vector<std::string> keys;
std::vector<futTuple> futVec;
for (; iter->valid(); iter->next()) {

and the log shows:

I1007 07:58:03.991796 75386 NebulaStore.cpp:49] ~NebulaStore()
I1007 07:58:03.992041 75485 StorageServer.cpp:269] The admin service stopped
I1007 07:58:03.992151 75486 StorageServer.cpp:294] The internal storage  service stopped
I1007 07:58:03.992262 75484 StorageServer.cpp:240] The storage service stopped
I1007 07:58:03.992362 75386 StorageDaemon.cpp:147] The storage Daemon stopped

seems that the thread unreportedAdminThread_ waiting on the cond var unreportedCV_ is blocking the whole process from exiting.

Expected behavior

A clear and concise description of what you expected to happen.

Additional context

Provide logs and configs, or any other context to trace the problem.

@kikimo kikimo added the type/bug Type: something is unexpected label Oct 7, 2021
@Sophie-Xie Sophie-Xie added this to the v2.6.0 milestone Oct 8, 2021
@kikimo
Copy link
Contributor Author

kikimo commented Oct 8, 2021

Mark, we should also add integration test to check cluster graceful termination later @kikimo @HarrisChu

@liwenhui-soul
Copy link
Contributor

#0  futex_wait (private=<optimized out>, expected=12, futex_word=0x58d1664 <nebula::storage::AdminTaskManager::instance(nebula::storage::StorageEnv*)::sAdminTaskManager+2660>)
    at ../sysdeps/nptl/futex-internal.h:141
#1  futex_wait_simple (private=<optimized out>, expected=12, futex_word=0x58d1664 <nebula::storage::AdminTaskManager::instance(nebula::storage::StorageEnv*)::sAdminTaskManager+2660>)
    at ../sysdeps/nptl/futex-internal.h:172
#2  __pthread_cond_destroy (cond=0x58d1640 <nebula::storage::AdminTaskManager::instance(nebula::storage::StorageEnv*)::sAdminTaskManager+2624>) at pthread_cond_destroy.c:54
#3  0x00000000029e3ab2 in nebula::storage::AdminTaskManager::~AdminTaskManager (this=0x58d0c00 <nebula::storage::AdminTaskManager::instance(nebula::storage::StorageEnv*)::sAdminTaskManager>,
    __in_chrg=<optimized out>) at /root/src/nebula/src/storage/admin/AdminTaskManager.h:23
#4  0x00007ffff7c9ea27 in __run_exit_handlers (status=0, listp=0x7ffff7e40718 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at exit.c:108
#5  0x00007ffff7c9ebe0 in __GI_exit (status=<optimized out>) at exit.c:139
#6  0x00007ffff7c7c0ba in __libc_start_main (main=0x29cd5a9 <main(int, char**)>, argc=23, argv=0x7fffffffe3b8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>,
    stack_end=0x7fffffffe3a8) at ../csu/libc-start.c:342
#7  0x00000000029c824e in _start ()

@kikimo
Copy link
Contributor Author

kikimo commented Oct 9, 2021

#0  futex_wait (private=<optimized out>, expected=12, futex_word=0x58d1664 <nebula::storage::AdminTaskManager::instance(nebula::storage::StorageEnv*)::sAdminTaskManager+2660>)
    at ../sysdeps/nptl/futex-internal.h:141
#1  futex_wait_simple (private=<optimized out>, expected=12, futex_word=0x58d1664 <nebula::storage::AdminTaskManager::instance(nebula::storage::StorageEnv*)::sAdminTaskManager+2660>)
    at ../sysdeps/nptl/futex-internal.h:172
#2  __pthread_cond_destroy (cond=0x58d1640 <nebula::storage::AdminTaskManager::instance(nebula::storage::StorageEnv*)::sAdminTaskManager+2624>) at pthread_cond_destroy.c:54
#3  0x00000000029e3ab2 in nebula::storage::AdminTaskManager::~AdminTaskManager (this=0x58d0c00 <nebula::storage::AdminTaskManager::instance(nebula::storage::StorageEnv*)::sAdminTaskManager>,
    __in_chrg=<optimized out>) at /root/src/nebula/src/storage/admin/AdminTaskManager.h:23
#4  0x00007ffff7c9ea27 in __run_exit_handlers (status=0, listp=0x7ffff7e40718 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at exit.c:108
#5  0x00007ffff7c9ebe0 in __GI_exit (status=<optimized out>) at exit.c:139
#6  0x00007ffff7c7c0ba in __libc_start_main (main=0x29cd5a9 <main(int, char**)>, argc=23, argv=0x7fffffffe3b8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>,
    stack_end=0x7fffffffe3a8) at ../csu/libc-start.c:342
#7  0x00000000029c824e in _start ()

This problem happends in ubuntu 20.04, but cannot be reproduced in cent7.9.2009, the above stack show a thread is blocking on __pthread_cond_destroy(), and we can see from the man page of pthread_cond_destroy() that:

It shall be safe to destroy an initialized condition variable upon which no threads are currently blocked. Attempting to destroy a condition variable upon which other threads are currently blocked results in undefined behavior.

so it might be a problem relate to pthread implementation, anyway, we'd better terminate a working thread gracefully at the end of running.

@critical27
Copy link
Contributor

Fixed in #3014

cangfengzhs added a commit to cangfengzhs/nebula that referenced this issue Dec 13, 2023
* copy local id when clone space

* add test

* fix bug

* fix bug

---------

Co-authored-by: Sophie <[email protected]>
critical27 added a commit that referenced this issue Dec 18, 2023
* copy local id when clone space

* add test

* fix bug

* fix bug

---------

Co-authored-by: Sophie <[email protected]>
Co-authored-by: Doodle <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Type: something is unexpected
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants