storage hang on terminate #3005

kikimo · 2021-10-07T08:40:49Z

Please check the FAQ documentation before raising an issue

Please check the FAQ documentation and old issues before raising an issue in case someone has asked the same question that you are asking.

Describe the bug (must be provided)

storage hand on terminate, this seems to be a bug introduced by #2843 .

Your Environments (must be provided)

OS: uname -a
Compliler: g++ --version or clang++ --version
CPU: lscpu
Commit id (e.g. a3ffc7d8)

How To Reproduce(must be provided)

start a cluster, execute kill ${PID_OF_STORAGED}, it will not exit, pstack shows:

...
thread: 0, lwp: 71129, type: 0
#0  0x00007ffff7e57376 in __pthread_cond_wait()+534 in /lib/x86_64-linux-gnu/libpthread.so.0 at futex-internal.h:183
#1  0x0000000004b187f0 in _ZNSt18condition_variable4waitERSt11unique_lockISt5mutexE!()+15 in /root/src/nebula/build/bin/nebula-storaged
#2  0x0000000002b8c048 in _ZZN6nebula7storage16AdminTaskManager21handleUnreportedTasksEvENKUlvE_clEv!()+117 in /root/src/nebula/build/bin/nebula-storaged at AdminTaskManager.cpp:47
#3  0x0000000002bb07ee in std::__invoke_impl<void, nebula::storage::AdminTaskManager::handleUnreportedTasks()::<lambda()> >()+29 in /root/src/nebula/build/bin/nebula-storaged at invoke.h:60
#4  0x0000000002bb07a4 in std::__invoke<nebula::storage::AdminTaskManager::handleUnreportedTasks()::<lambda()> >()+29 in /root/src/nebula/build/bin/nebula-storaged at invoke.h:95
#5  0x0000000002bb0752 in std::thread::_Invoker<std::tuple<nebula::storage::AdminTaskManager::handleUnreportedTasks()::<lambda()> > >::_M_invoke<0>()+37 in /root/src/nebula/build/bin/nebula-storaged at thread:244
#6  0x0000000002bb0708 in std::thread::_Invoker<std::tuple<nebula::storage::AdminTaskManager::handleUnreportedTasks()::<lambda()> > >::operator()()+21 in /root/src/nebula/build/bin/nebula-storaged at thread:251
#7  0x0000000002bb0502 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<nebula::storage::AdminTaskManager::handleUnreportedTasks()::<lambda()> > > >::_M_run()+29 in /root/src/nebula/build/bin/nebula-storaged at thread:195
#8  0x0000000004b8f2b4 in execute_native_thread_routine!()+19 in /root/src/nebula/build/bin/nebula-storaged
#9  0x00007ffff7e50609 in start_thread()+216 in /lib/x86_64-linux-gnu/libpthread.so.0 at pthread_create.c:477
#10 0x00007ffff7d77293 in __GI___clone!()+66 in /lib/x86_64-linux-gnu/libc.so.6 at clone.S:95
...

code around AdminTaskManager.cpp:47:

nebula/src/storage/admin/AdminTaskManager.cpp

Lines 39 to 54 in 9462d35

    
           void AdminTaskManager::handleUnreportedTasks() { 
        
             using futTuple = 
        
                 std::tuple<JobID, TaskID, std::string, folly::Future<StatusOr<nebula::cpp2::ErrorCode>>>; 
        
             if (env_ == nullptr) return; 
        
             unreportedAdminThread_.reset(new std::thread([this] { 
        
               bool ifAny = true; 
        
               while (true) { 
        
                 std::unique_lock<std::mutex> lk(unreportedMutex_); 
        
                 if (!ifAny) unreportedCV_.wait(lk); 
        
                 ifAny = false; 
        
                 std::unique_ptr<kvstore::KVIterator> iter; 
        
                 auto kvRet = env_->adminStore_->scan(&iter); 
        
                 if (kvRet != nebula::cpp2::ErrorCode::SUCCEEDED || iter == nullptr) continue; 
        
                 std::vector<std::string> keys; 
        
                 std::vector<futTuple> futVec; 
        
                 for (; iter->valid(); iter->next()) {

and the log shows:

I1007 07:58:03.991796 75386 NebulaStore.cpp:49] ~NebulaStore()
I1007 07:58:03.992041 75485 StorageServer.cpp:269] The admin service stopped
I1007 07:58:03.992151 75486 StorageServer.cpp:294] The internal storage  service stopped
I1007 07:58:03.992262 75484 StorageServer.cpp:240] The storage service stopped
I1007 07:58:03.992362 75386 StorageDaemon.cpp:147] The storage Daemon stopped

seems that the thread unreportedAdminThread_ waiting on the cond var unreportedCV_ is blocking the whole process from exiting.

Expected behavior

A clear and concise description of what you expected to happen.

Additional context

Provide logs and configs, or any other context to trace the problem.

The text was updated successfully, but these errors were encountered:

kikimo · 2021-10-08T07:04:48Z

Mark, we should also add integration test to check cluster graceful termination later @kikimo @HarrisChu

liwenhui-soul · 2021-10-08T11:32:45Z

#0  futex_wait (private=<optimized out>, expected=12, futex_word=0x58d1664 <nebula::storage::AdminTaskManager::instance(nebula::storage::StorageEnv*)::sAdminTaskManager+2660>)
    at ../sysdeps/nptl/futex-internal.h:141
#1  futex_wait_simple (private=<optimized out>, expected=12, futex_word=0x58d1664 <nebula::storage::AdminTaskManager::instance(nebula::storage::StorageEnv*)::sAdminTaskManager+2660>)
    at ../sysdeps/nptl/futex-internal.h:172
#2  __pthread_cond_destroy (cond=0x58d1640 <nebula::storage::AdminTaskManager::instance(nebula::storage::StorageEnv*)::sAdminTaskManager+2624>) at pthread_cond_destroy.c:54
#3  0x00000000029e3ab2 in nebula::storage::AdminTaskManager::~AdminTaskManager (this=0x58d0c00 <nebula::storage::AdminTaskManager::instance(nebula::storage::StorageEnv*)::sAdminTaskManager>,
    __in_chrg=<optimized out>) at /root/src/nebula/src/storage/admin/AdminTaskManager.h:23
#4  0x00007ffff7c9ea27 in __run_exit_handlers (status=0, listp=0x7ffff7e40718 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at exit.c:108
#5  0x00007ffff7c9ebe0 in __GI_exit (status=<optimized out>) at exit.c:139
#6  0x00007ffff7c7c0ba in __libc_start_main (main=0x29cd5a9 <main(int, char**)>, argc=23, argv=0x7fffffffe3b8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>,
    stack_end=0x7fffffffe3a8) at ../csu/libc-start.c:342
#7  0x00000000029c824e in _start ()

kikimo · 2021-10-09T01:31:43Z

#0  futex_wait (private=<optimized out>, expected=12, futex_word=0x58d1664 <nebula::storage::AdminTaskManager::instance(nebula::storage::StorageEnv*)::sAdminTaskManager+2660>)
    at ../sysdeps/nptl/futex-internal.h:141
#1  futex_wait_simple (private=<optimized out>, expected=12, futex_word=0x58d1664 <nebula::storage::AdminTaskManager::instance(nebula::storage::StorageEnv*)::sAdminTaskManager+2660>)
    at ../sysdeps/nptl/futex-internal.h:172
#2  __pthread_cond_destroy (cond=0x58d1640 <nebula::storage::AdminTaskManager::instance(nebula::storage::StorageEnv*)::sAdminTaskManager+2624>) at pthread_cond_destroy.c:54
#3  0x00000000029e3ab2 in nebula::storage::AdminTaskManager::~AdminTaskManager (this=0x58d0c00 <nebula::storage::AdminTaskManager::instance(nebula::storage::StorageEnv*)::sAdminTaskManager>,
    __in_chrg=<optimized out>) at /root/src/nebula/src/storage/admin/AdminTaskManager.h:23
#4  0x00007ffff7c9ea27 in __run_exit_handlers (status=0, listp=0x7ffff7e40718 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at exit.c:108
#5  0x00007ffff7c9ebe0 in __GI_exit (status=<optimized out>) at exit.c:139
#6  0x00007ffff7c7c0ba in __libc_start_main (main=0x29cd5a9 <main(int, char**)>, argc=23, argv=0x7fffffffe3b8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>,
    stack_end=0x7fffffffe3a8) at ../csu/libc-start.c:342
#7  0x00000000029c824e in _start ()

This problem happends in ubuntu 20.04, but cannot be reproduced in cent7.9.2009, the above stack show a thread is blocking on __pthread_cond_destroy(), and we can see from the man page of pthread_cond_destroy() that:

It shall be safe to destroy an initialized condition variable upon which no threads are currently blocked. Attempting to destroy a condition variable upon which other threads are currently blocked results in undefined behavior.

so it might be a problem relate to pthread implementation, anyway, we'd better terminate a working thread gracefully at the end of running.

critical27 · 2021-10-11T07:04:00Z

Fixed in #3014

* copy local id when clone space * add test * fix bug * fix bug --------- Co-authored-by: Sophie <[email protected]>

* copy local id when clone space * add test * fix bug * fix bug --------- Co-authored-by: Sophie <[email protected]> Co-authored-by: Doodle <[email protected]>

kikimo added the type/bug Type: something is unexpected label Oct 7, 2021

kikimo assigned liwenhui-soul Oct 7, 2021

kikimo mentioned this issue Oct 7, 2021

fix storage hang on terminate #3006

Closed

Sophie-Xie added this to the v2.6.0 milestone Oct 8, 2021

critical27 mentioned this issue Oct 9, 2021

Fix storage hang on terminate #3014

Merged

jamieliu1023 mentioned this issue Oct 9, 2021

Weekly Report 2021-10-09 vesoft-inc/nebula-community#29

Closed

critical27 closed this as completed Oct 11, 2021

Sophie-Xie mentioned this issue Nov 1, 2021

Add integration test to check cluster graceful termination #3122

Closed

kikimo mentioned this issue Nov 2, 2021

add bdd test on nebula cluster graceful termination #3252

Closed

kikimo mentioned this issue Dec 15, 2021

task_manager_test stuck #3015

Closed

cangfengzhs added a commit to cangfengzhs/nebula that referenced this issue Dec 13, 2023

copy local id when clone space (vesoft-inc#3005)

70e0258

* copy local id when clone space * add test * fix bug * fix bug --------- Co-authored-by: Sophie <[email protected]>

critical27 added a commit that referenced this issue Dec 18, 2023

copy local id when clone space (#3005) (#5781)

3305c87

* copy local id when clone space * add test * fix bug * fix bug --------- Co-authored-by: Sophie <[email protected]> Co-authored-by: Doodle <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage hang on terminate #3005

storage hang on terminate #3005

kikimo commented Oct 7, 2021 •

edited

Loading

kikimo commented Oct 8, 2021

liwenhui-soul commented Oct 8, 2021

kikimo commented Oct 9, 2021 •

edited

Loading

critical27 commented Oct 11, 2021

storage hang on terminate #3005

storage hang on terminate #3005

Comments

kikimo commented Oct 7, 2021 • edited Loading

kikimo commented Oct 8, 2021

liwenhui-soul commented Oct 8, 2021

kikimo commented Oct 9, 2021 • edited Loading

critical27 commented Oct 11, 2021

kikimo commented Oct 7, 2021 •

edited

Loading

kikimo commented Oct 9, 2021 •

edited

Loading