All storage instances crashed after 7 hours of pressure test #3373

kikimo · 2021-11-29T03:59:26Z

Please check the FAQ documentation before raising an issue

Describe the bug (required)

A nebula cluster of 3storages + 1graph + 1meta, we keep inserting edge and trigger leader, after running for about 7 hours, all storage instances crash almost at the same time and the all have similar crash stack:

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
--Type <RET> for more, q to quit, c to continue without paging--c
Core was generated by `/data/src/wwl/nebula/build/bin/nebula-storaged --flagfile /data/src/wwl/test/et'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000000002fb8260 in folly::ThreadLocalPtr<folly::SingletonThreadLocal<nebula::meta::MetaClient::ThreadLocalInfo, folly::detail::DefaultTag, folly::detail::DefaultMake<nebula::meta::MetaClient::ThreadLocalInfo>, void>::Wrapper, void, void>::get (this=0x7fbd204c6ee0) at /data/src/wwl/nebula/build/third-party/install/include/folly/ThreadLocal.h:153
153	    return static_cast<T*>(w.ptr);
[Current thread is 1 (Thread 0x7fbc115ff700 (LWP 3293767))]
(gdb) bt
#0  0x0000000002fb8260 in folly::ThreadLocalPtr<folly::SingletonThreadLocal<nebula::meta::MetaClient::ThreadLocalInfo, folly::detail::DefaultTag, folly::detail::DefaultMake<nebula::meta::MetaClient::ThreadLocalInfo>, void>::Wrapper, void, void>::get (this=0x7fbd204c6ee0)
    at /data/src/wwl/nebula/build/third-party/install/include/folly/ThreadLocal.h:153
#1  0x0000000002f2ffe4 in folly::ThreadLocal<folly::SingletonThreadLocal<nebula::meta::MetaClient::ThreadLocalInfo, folly::detail::DefaultTag, folly::detail::DefaultMake<nebula::meta::MetaClient::ThreadLocalInfo>, void>::Wrapper, void, void>::get (this=0x7fbd204c6ee0)
    at /data/src/wwl/nebula/build/third-party/install/include/folly/ThreadLocal.h:69
#2  folly::ThreadLocal<folly::SingletonThreadLocal<nebula::meta::MetaClient::ThreadLocalInfo, folly::detail::DefaultTag, folly::detail::DefaultMake<nebula::meta::MetaClient::ThreadLocalInfo>, void>::Wrapper, void, void>::operator* (this=0x7fbd204c6ee0)
    at /data/src/wwl/nebula/build/third-party/install/include/folly/ThreadLocal.h:78
#3  0x0000000002eed374 in folly::SingletonThreadLocal<nebula::meta::MetaClient::ThreadLocalInfo, folly::detail::DefaultTag, folly::detail::DefaultMake<nebula::meta::MetaClient::ThreadLocalInfo>, void>::getWrapper ()
    at /data/src/wwl/nebula/build/third-party/install/include/folly/SingletonThreadLocal.h:147
#4  0x0000000002eed388 in folly::SingletonThreadLocal<nebula::meta::MetaClient::ThreadLocalInfo, folly::detail::DefaultTag, folly::detail::DefaultMake<nebula::meta::MetaClient::ThreadLocalInfo>, void>::LocalLifetime::~LocalLifetime (this=0x7fbc115fee00, __in_chrg=<optimized out>)
    at /data/src/wwl/nebula/build/third-party/install/include/folly/SingletonThreadLocal.h:119
#5  0x0000000004b4c906 in (anonymous namespace)::run(void*) ()
#6  0x00007fbd20bd8ca2 in __nptl_deallocate_tsd () from /lib64/libpthread.so.0
#7  0x00007fbd20bd8eb3 in start_thread () from /lib64/libpthread.so.0
#8  0x00007fbd209019fd in clone () from /lib64/libc.so.6
(gdb)

Your Environments (required)

OS: CentOS Linux release 7.9.2009 (Core) —— 5.4.151-1.el7.elrepo.x86_64
Compiler: g++ --version or clang++ --version
CPU: lscpu
Commit id (e.g. a3ffc7d8) c6d1046

How To Reproduce(required)

Steps to reproduce the behavior:

Step 1
Step 2
Step 3

Expected behavior

A clear and concise description of what you expected to happen.

Additional context

Provide logs and configs, or any other context to trace the problem.

The text was updated successfully, but these errors were encountered:

cangfengzhs · 2021-11-29T08:01:31Z

Did graph and meta crash?

kikimo · 2021-11-29T08:27:28Z

Did graph and meta crash?

no, they are running well.

cangfengzhs · 2021-11-29T08:39:29Z

There are thread unsafe operations (see below), but I think it should not be the cause of crash

threadLocalInfo.localCache_[spaceId] = infoDeepCopy; // infoDeepCopy is a shared_ptr

critical27 · 2021-11-29T09:15:46Z

perhaps related to #3192, there is a hidden bug in MetaClient. The leader change frequently would make meta version updated, and meta client will pull data from meta server in consequence

cangfengzhs · 2021-11-29T10:15:26Z

facebook/folly#1252 I find a similar bug in folly repo.

cangfengzhs · 2021-12-03T03:10:18Z

The current guess is that the NULL pointer is not checked during OOM, which causes the memory near the 0x00 address to be modified, which causes the program to crash when the destructor is called at stop.

kikimo added the type/bug Type: something is unexpected label Nov 29, 2021

kikimo assigned critical27, wenhaocs and liuyu85cn Nov 29, 2021

kikimo assigned cangfengzhs Nov 29, 2021

jamieliu1023 mentioned this issue Dec 4, 2021

Weekly Report 2021-12-03 vesoft-inc/nebula-community#51

Closed

Sophie-Xie added this to the v3.0.0 milestone Dec 7, 2021

Sophie-Xie unassigned critical27, wenhaocs and liuyu85cn Dec 15, 2021

cangfengzhs linked a pull request Dec 23, 2021 that will close this issue

Issue3373 storage exit crash #3553

Merged

7 tasks

cangfengzhs mentioned this issue Dec 29, 2021

Issue3373 storage exit crash #3553

Merged

7 tasks

HarrisChu mentioned this issue Jan 5, 2022

[cannot repeat] nebula-storage is exited when importing data #3623

Closed

critical27 closed this as completed in #3553 Jan 7, 2022

jamieliu1023 mentioned this issue Jan 8, 2022

Weekly Report 2022-01-07 vesoft-inc/nebula-community#84

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

All storage instances crashed after 7 hours of pressure test #3373

All storage instances crashed after 7 hours of pressure test #3373

kikimo commented Nov 29, 2021 •

edited

Loading

cangfengzhs commented Nov 29, 2021

kikimo commented Nov 29, 2021

cangfengzhs commented Nov 29, 2021

critical27 commented Nov 29, 2021 •

edited

Loading

cangfengzhs commented Nov 29, 2021

cangfengzhs commented Dec 3, 2021

All storage instances crashed after 7 hours of pressure test #3373

All storage instances crashed after 7 hours of pressure test #3373

Comments

kikimo commented Nov 29, 2021 • edited Loading

cangfengzhs commented Nov 29, 2021

kikimo commented Nov 29, 2021

cangfengzhs commented Nov 29, 2021

critical27 commented Nov 29, 2021 • edited Loading

cangfengzhs commented Nov 29, 2021

cangfengzhs commented Dec 3, 2021

kikimo commented Nov 29, 2021 •

edited

Loading

critical27 commented Nov 29, 2021 •

edited

Loading