Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rebuild edge index running forever and storaged hang on exit #3353

Closed
kikimo opened this issue Nov 25, 2021 · 2 comments
Closed

Rebuild edge index running forever and storaged hang on exit #3353

kikimo opened this issue Nov 25, 2021 · 2 comments
Assignees
Labels
type/bug Type: something is unexpected
Milestone

Comments

@kikimo
Copy link
Contributor

kikimo commented Nov 25, 2021

Please check the FAQ documentation before raising an issue

Please check the FAQ documentation and old issues before raising an issue in case someone has asked the same question that you are asking.

Describe the bug (required)

Rebuild edge index running forever and storaged hang on exit.

seems that the index build task is stucked on a baton.wait():

#28 0x0000000004621d26 in folly::EventBase::loop() ()
Thread 132 (Thread 0x7fdda00ff700 (LWP 398938) "TaskManager1"):
#0  0x00007fde20ce8d19 in syscall () from /lib64/libc.so.6
#1  0x00000000041415f1 in folly::detail::futexWaitImpl(std::atomic<unsigned int> const*, unsigned int, std::chrono::time_point<std::chrono::_V2::system_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > const*, std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > const*, unsigned int) ()
#2  0x0000000002a9343e in folly::detail::futexWaitUntil<std::atomic<unsigned int>, std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > (futex=0x7fdda00fbc4c, expected=2, deadline=..., waitMask=4294967295) at /data/src/wwl/nebula/build/third-party/install/include/folly/detail/Futex-inl.h:119
#3  0x0000000002a8d8c7 in folly::detail::MemoryIdler::futexWaitUntil<std::atomic<unsigned int>, std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > >, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > (fut=..., expected=2, deadline=..., waitMask=4294967295, idleTimeout=..., stackToRetain=1024, timeoutVariationFrac=0.5) at /data/src/wwl/nebula/build/third-party/install/include/folly/detail/MemoryIdler.h:164
#4  0x0000000002b4d5b4 in folly::Baton<true, std::atomic>::tryWaitSlow<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l> > > (this=0x7fdda00fbc4c, deadline=..., opt=...) at /data/src/wwl/nebula/build/third-party/install/include/folly/synchronization/Baton.h:305
#5  0x0000000002c7648c in folly::Baton<true, std::atomic>::wait (opt=..., this=<optimized out>) at /data/src/wwl/nebula/build/third-party/install/include/folly/synchronization/Baton.h:177
#6  nebula::storage::RebuildIndexTask::removeLegacyLogs (this=0x7fde18934b00, space=1, part=1) at /data/src/wwl/nebula/src/storage/admin/RebuildIndexTask.cpp:211
#7  0x0000000002c74907 in nebula::storage::RebuildIndexTask::invoke (this=0x7fde18934b00, space=1, part=1, items=...) at /data/src/wwl/nebula/src/storage/admin/RebuildIndexTask.cpp:82
#8  0x0000000002c81848 in std::__invoke_impl<nebula::cpp2::ErrorCode, nebula::cpp2::ErrorCode (nebula::storage::RebuildIndexTask::*&)(int, int, std::vector<std::shared_ptr<nebula::meta::cpp2::IndexItem>, std::allocator<std::shared_ptr<nebula::meta::cpp2::IndexItem> > > const&), nebula::storage::RebuildIndexTask*&, int&, int&, std::vector<std::shared_ptr<nebula::meta::cpp2::IndexItem>, std::allocator<std::shared_ptr<nebula::meta::cpp2::IndexItem> > >&> (__f=@0x7fdded4270c0: (nebula::cpp2::ErrorCode (nebula::storage::RebuildIndexTask::*)(nebula::storage::RebuildIndexTask * const, int, int, const std::vector<std::shared_ptr<nebula::meta::cpp2::IndexItem>, std::allocator<std::shared_ptr<nebula::meta::cpp2::IndexItem> > > &)) 0x2c748c6 <nebula::storage::RebuildIndexTask::invoke(int, int, std::vector<std::shared_ptr<nebula::meta::cpp2::IndexItem>, std::allocator<std::shared_ptr<nebula::meta::cpp2::IndexItem> > > const&)>, __t=@0x7fdded4270f0: 0x7fde18934b00) at /data/vesoft/toolset/gcc/7.5.0/include/c++/7.5.0/bits/invoke.h:73

the stucked method:

nebula::cpp2::ErrorCode RebuildIndexTask::removeLegacyLogs(GraphSpaceID space, PartitionID part) {
auto operationPrefix = OperationKeyUtils::operationPrefix(part);
folly::Baton<true, std::atomic> baton;
auto result = nebula::cpp2::ErrorCode::SUCCEEDED;
env_->kvstore_->asyncRemoveRange(space,
part,
NebulaKeyUtils::firstKey(operationPrefix, sizeof(int64_t)),
NebulaKeyUtils::lastKey(operationPrefix, sizeof(int64_t)),
[&result, &baton](nebula::cpp2::ErrorCode code) {
if (code != nebula::cpp2::ErrorCode::SUCCEEDED) {
LOG(ERROR) << "Modify the index failed";
result = code;
}
baton.post();
});
baton.wait();
return nebula::cpp2::ErrorCode::SUCCEEDED;
}

full pstack:
365749.txt

Your Environments (required)

  • OS: uname -a
  • Compiler: g++ --version or clang++ --version
  • CPU: lscpu
  • Commit id c6d1046

How To Reproduce(required)

Steps to reproduce the behavior:

  1. Step 1
  2. Step 2
  3. Step 3

Expected behavior

A clear and concise description of what you expected to happen.

Additional context

Provide logs and configs, or any other context to trace the problem.

@kikimo kikimo added the type/bug Type: something is unexpected label Nov 25, 2021
@kikimo kikimo moved this to Todo in Nebula Graph v3.0.0 Nov 25, 2021
@liuyu85cn
Copy link
Contributor

we got this situation because two thread dead lock.
(This only occur when whole cluster shutdown, and only happens on leader if its two follower shutdown before it).

Thread 1: (Thread 132 in pstack)
running a rebuild index job ,
and call kvsotre::asyncRemoveRange,
but as its two follower already shutdown,
this thread will run into infinite loop until raft.stop called.
raft.stop() will be called by dtor of NebulaStore

Thread 2: (Thread 1 in pstack)
calling StorageServer::stop() to stop all service.
but we need to stop TaskManager before reset NebulaStore.
how ever, TaskManger need to wait all its task running.
which is the above thread.

Solution:
We may have a solution that we can set raft service stop at the stop() function of NebulaStore,
instead of the dtor of it.

@critical27
Copy link
Contributor

fixed in #3358

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Type: something is unexpected
Projects
None yet
Development

No branches or pull requests

5 participants