Rebuild edge index running forever and storaged hang on exit #3353

kikimo · 2021-11-25T06:54:00Z

Please check the FAQ documentation before raising an issue

Please check the FAQ documentation and old issues before raising an issue in case someone has asked the same question that you are asking.

Describe the bug (required)

Rebuild edge index running forever and storaged hang on exit.

seems that the index build task is stucked on a baton.wait():

#28 0x0000000004621d26 in folly::EventBase::loop() ()
Thread 132 (Thread 0x7fdda00ff700 (LWP 398938) "TaskManager1"):
#0  0x00007fde20ce8d19 in syscall () from /lib64/libc.so.6
#1  0x00000000041415f1 in folly::detail::futexWaitImpl(std::atomic<unsigned int> const*, unsigned int, std::chrono::time_point<std::chrono::_V2::system_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > const*, std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > const*, unsigned int) ()
#2  0x0000000002a9343e in folly::detail::futexWaitUntil<std::atomic<unsigned int>, std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > (futex=0x7fdda00fbc4c, expected=2, deadline=..., waitMask=4294967295) at /data/src/wwl/nebula/build/third-party/install/include/folly/detail/Futex-inl.h:119
#3  0x0000000002a8d8c7 in folly::detail::MemoryIdler::futexWaitUntil<std::atomic<unsigned int>, std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > (fut=..., expected=2, deadline=..., waitMask=4294967295, idleTimeout=..., stackToRetain=1024, timeoutVariationFrac=0.5) at /data/src/wwl/nebula/build/third-party/install/include/folly/detail/MemoryIdler.h:164
#4  0x0000000002b4d5b4 in folly::Baton<true, std::atomic>::tryWaitSlow<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > (this=0x7fdda00fbc4c, deadline=..., opt=...) at /data/src/wwl/nebula/build/third-party/install/include/folly/synchronization/Baton.h:305
#5  0x0000000002c7648c in folly::Baton<true, std::atomic>::wait (opt=..., this=<optimized out>) at /data/src/wwl/nebula/build/third-party/install/include/folly/synchronization/Baton.h:177
#6  nebula::storage::RebuildIndexTask::removeLegacyLogs (this=0x7fde18934b00, space=1, part=1) at /data/src/wwl/nebula/src/storage/admin/RebuildIndexTask.cpp:211
#7  0x0000000002c74907 in nebula::storage::RebuildIndexTask::invoke (this=0x7fde18934b00, space=1, part=1, items=...) at /data/src/wwl/nebula/src/storage/admin/RebuildIndexTask.cpp:82
#8  0x0000000002c81848 in std::__invoke_impl<nebula::cpp2::ErrorCode, nebula::cpp2::ErrorCode (nebula::storage::RebuildIndexTask::*&)(int, int, std::vector<std::shared_ptr<nebula::meta::cpp2::IndexItem>, std::allocator<std::shared_ptr<nebula::meta::cpp2::IndexItem> > > const&), nebula::storage::RebuildIndexTask*&, int&, int&, std::vector<std::shared_ptr<nebula::meta::cpp2::IndexItem>, std::allocator<std::shared_ptr<nebula::meta::cpp2::IndexItem> > >&> (__f=@0x7fdded4270c0: (nebula::cpp2::ErrorCode (nebula::storage::RebuildIndexTask::*)(nebula::storage::RebuildIndexTask * const, int, int, const std::vector<std::shared_ptr<nebula::meta::cpp2::IndexItem>, std::allocator<std::shared_ptr<nebula::meta::cpp2::IndexItem> > > &)) 0x2c748c6 <nebula::storage::RebuildIndexTask::invoke(int, int, std::vector<std::shared_ptr<nebula::meta::cpp2::IndexItem>, std::allocator<std::shared_ptr<nebula::meta::cpp2::IndexItem> > > const&)>, __t=@0x7fdded4270f0: 0x7fde18934b00) at /data/vesoft/toolset/gcc/7.5.0/include/c++/7.5.0/bits/invoke.h:73

the stucked method:

nebula/src/storage/admin/RebuildIndexTask.cpp

Lines 195 to 214 in a14d7b4

    
           nebula::cpp2::ErrorCode RebuildIndexTask::removeLegacyLogs(GraphSpaceID space, PartitionID part) { 
        
             auto operationPrefix = OperationKeyUtils::operationPrefix(part); 
        
             folly::Baton<true, std::atomic> baton; 
        
             auto result = nebula::cpp2::ErrorCode::SUCCEEDED; 
        
             env_->kvstore_->asyncRemoveRange(space, 
        
                                              part, 
        
                                              NebulaKeyUtils::firstKey(operationPrefix, sizeof(int64_t)), 
        
                                              NebulaKeyUtils::lastKey(operationPrefix, sizeof(int64_t)), 
        
                                              [&result, &baton](nebula::cpp2::ErrorCode code) { 
        
                                                if (code != nebula::cpp2::ErrorCode::SUCCEEDED) { 
        
                                                  LOG(ERROR) << "Modify the index failed"; 
        
                                                  result = code; 
        
                                                } 
        
                                                baton.post(); 
        
                                              }); 
        
             baton.wait(); 
        
             return nebula::cpp2::ErrorCode::SUCCEEDED; 
        
           }

full pstack:
365749.txt

Your Environments (required)

OS: uname -a
Compiler: g++ --version or clang++ --version
CPU: lscpu
Commit id c6d1046

How To Reproduce(required)

Steps to reproduce the behavior:

Step 1
Step 2
Step 3

Expected behavior

A clear and concise description of what you expected to happen.

Additional context

Provide logs and configs, or any other context to trace the problem.

The text was updated successfully, but these errors were encountered:

liuyu85cn · 2021-11-25T11:21:57Z

we got this situation because two thread dead lock.
(This only occur when whole cluster shutdown, and only happens on leader if its two follower shutdown before it).

Thread 1: (Thread 132 in pstack)
running a rebuild index job ,
and call kvsotre::asyncRemoveRange,
but as its two follower already shutdown,
this thread will run into infinite loop until raft.stop called.
raft.stop() will be called by dtor of NebulaStore

Thread 2: (Thread 1 in pstack)
calling StorageServer::stop() to stop all service.
but we need to stop TaskManager before reset NebulaStore.
how ever, TaskManger need to wait all its task running.
which is the above thread.

Solution:
We may have a solution that we can set raft service stop at the stop() function of NebulaStore,
instead of the dtor of it.

critical27 · 2021-11-29T02:20:17Z

fixed in #3358

kikimo added the type/bug Type: something is unexpected label Nov 25, 2021

kikimo assigned liwenhui-soul, liuyu85cn and critical27 Nov 25, 2021

kikimo added this to Nebula Graph v3.0.0 Nov 25, 2021

kikimo moved this to Todo in Nebula Graph v3.0.0 Nov 25, 2021

liuyu85cn mentioned this issue Nov 25, 2021

[fix] storage may stuck, if we have kvstore operation while shutdown cluster #3358

Merged

3 tasks

Sophie-Xie added this to the v3.0.0 milestone Nov 26, 2021

Sophie-Xie removed this from Nebula Graph v3.0.0 Nov 26, 2021

jamieliu1023 mentioned this issue Nov 27, 2021

Weekly Report 2021-11-26 vesoft-inc/nebula-community#49

Closed

critical27 closed this as completed Nov 29, 2021

jamieliu1023 mentioned this issue Dec 4, 2021

Weekly Report 2021-12-03 vesoft-inc/nebula-community#51

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rebuild edge index running forever and storaged hang on exit #3353

Rebuild edge index running forever and storaged hang on exit #3353

kikimo commented Nov 25, 2021 •

edited

Loading

liuyu85cn commented Nov 25, 2021

critical27 commented Nov 29, 2021

Rebuild edge index running forever and storaged hang on exit #3353

Rebuild edge index running forever and storaged hang on exit #3353

Comments

kikimo commented Nov 25, 2021 • edited Loading

liuyu85cn commented Nov 25, 2021

critical27 commented Nov 29, 2021

kikimo commented Nov 25, 2021 •

edited

Loading