rdsn coredump - add primary replica as a learner #315

hycdong · 2019-04-04T02:30:51Z

Server Version

Pegasus Server 1.11.3 (b45cb06) Release
c6 ai cluster (Date: 4.4 09:38)

Coredump Stack

(gdb) bt
#0  0x00007f838254f1d7 in raise () from /lib64/libc.so.6
#1  0x00007f83825508c8 in abort () from /lib64/libc.so.6
#2  0x00007f838601b3ee in dsn_coredump () at /home/work/qinzuoyan/Pegasus/pegasus/rdsn/src/core/core/service_api_c.cpp:73
#3  0x00007f8385f1c2b7 in dsn::replication::replica::update_local_configuration (this=this@entry=0x394d600, config=..., same_ballot=same_ballot@entry=true)
    at /home/work/qinzuoyan/Pegasus/pegasus/rdsn/src/dist/replication/lib/replica_config.cpp:807
#4  0x00007f8385f811cb in dsn::replication::replica::on_add_learner (this=0x394d600, request=...) at /home/work/qinzuoyan/Pegasus/pegasus/rdsn/src/dist/replication/lib/replica_learn.cpp:1377
#5  0x00007f8385ee2ce2 in dsn::replication::replica_stub::on_add_learner (this=0x31b6580, request=...) at /home/work/qinzuoyan/Pegasus/pegasus/rdsn/src/dist/replication/lib/replica_stub.cpp:1004
#6  0x00007f8385efd9e0 in operator() (request=<optimized out>, __closure=0x3a399660) at /home/work/qinzuoyan/Pegasus/pegasus/rdsn/include/dsn/cpp/serverlet.h:169
#7  std::_Function_handler<void (dsn::message_ex*), bool dsn::serverlet<dsn::replication::replica_stub>::register_rpc_handler<dsn::replication::group_check_request>(dsn::task_code, char const*, void (dsn::replication::replica_stub::*)(dsn::replication::group_check_request const&))::{lambda(dsn::message_ex*)#1}>::_M_invoke(std::_Any_data const&, dsn::message_ex*) (__functor=..., __args#0=<optimized out>)
    at /home/work/qinzuoyan/Pegasus/toolchain/output/include/c++/4.8.2/functional:2071
#8  0x00007f838606cce9 in dsn::task::exec_internal (this=this@entry=0x57be163e4) at /home/work/qinzuoyan/Pegasus/pegasus/rdsn/src/core/core/task.cpp:180
#9  0x00007f83860ed42d in dsn::task_worker::loop (this=0x2f65ce0) at /home/work/qinzuoyan/Pegasus/pegasus/rdsn/src/core/core/task_worker.cpp:211
#10 0x00007f83860ed5f9 in dsn::task_worker::run_internal (this=0x2f65ce0) at /home/work/qinzuoyan/Pegasus/pegasus/rdsn/src/core/core/task_worker.cpp:191
#11 0x00007f8382ea7600 in std::(anonymous namespace)::execute_native_thread_routine (__p=<optimized out>) at /home/qinzuoyan/git.xiaomi/pegasus/toolchain/objdir/../gcc-4.8.2/libstdc++-v3/src/c++11/thread.cc:84
#12 0x00007f8383b14dc5 in start_thread () from /lib64/libpthread.so.0
#13 0x00007f838261173d in clone () from /lib64/libc.so.6

frame 3 local

(gdb) f 3
#3  0x00007f8385f1c2b7 in dsn::replication::replica::update_local_configuration (this=this@entry=0x394d600, config=..., same_ballot=same_ballot@entry=true)
    at /home/work/qinzuoyan/Pegasus/pegasus/rdsn/src/dist/replication/lib/replica_config.cpp:807
807	/home/work/qinzuoyan/Pegasus/pegasus/rdsn/src/dist/replication/lib/replica_config.cpp: No such file or directory.
(gdb) i locals
__FUNCTION__ = "update_local_configuration"
old_status = dsn::replication::partition_status::PS_PRIMARY
old_ballot = 3
r = false
oldTs = 1553249692836

The text was updated successfully, but these errors were encountered:

hycdong · 2019-05-09T07:06:29Z

现有的failure detection机制如下图所示：

                        |--- lease period ----| lease IsExpired, commit suicide
             |--- lease period ---|
worker: ---------------------------------------------------------------->
             \    /     \    /      _\
         beacon ack  beacon ack       x (beacon deliver failed)
              _\/        _\/
master: ---------------------------------------------------------------->
                |---- grace period ----|
                           |--- grace period ----| grace IsExpired, declare worker dead

replica server(worker) 周期性(3秒)给meta server(maser)发送心跳
worker和master都会周期性(2秒)检查lease是否到期

对worker而言，基于最近一次成功发送的心跳包的时间计算master是否失联
- 若worker认为自己与master失联，将自己所有replica状态转化为inactive
对master而言，基于最近一次接收到心跳包的时间计算worker是否失联
- 若master认为某个worker失联，会处理这个节点上的所有replica
  - 对于secondary，给这个replica的primary发propose，下线这个secondary
  - 对于primary，master会从secondary中选择一个replica发propose，assign primary

维持现有failure detection机制正确性的关键点在于：master使用的grace lease比worker的lease稍长一点，确保worker首先发现自己与master失联，master才发现这个worker失联。

hycdong · 2019-05-09T07:34:35Z

出现core的原因就是违背了上述前提，master发现某个worker失联，而worker并没有发现自己失联，即某个worker“假失联”。

具体来说，master在认为worker失联后，会为worker上的所有primary重新assign primary，在此期间“假失联”的worker通过心跳又被master认为alive，新primary选择这个“假失联”的replica server作为新的secondary，并且meta向它发送propose，而“假失联”的server依旧认为自己是replica的primary。我们认为replica不能从primary转化成secondary，这是一种非法的状态转化，因此产生了这个core。

在实际环境中，我们还发现出现master认为worker“假失联”，但是并没有产生core的情况。
replica server和meta server之间会通过on_config_sync同步config信息，当master认为worker“假失联”后，依旧能够收到replica发来的on_config_sync请求并回复ack给replica。当replica server收到ack后发现meta认为自己没有任何replica，就会主动remove自己所有的replica，这样当收到add secondary propose时就不会产生非法状态转换。同时，该集群“_add_secondary_max_count_for_one_node”配置个数较小，相当于延迟了propose add secondary的时间。

hycdong · 2019-05-09T07:51:58Z

通过分析日志发现，worker没有发现自己失联。第一，没有失联和切换master的log打印出来; 第二，当心跳包的error code不为ERR_OK时，pref-counter _recent_beacon_fail_count会加1，而出问题的集群该pref-counter值都为0; 第三，当master认为worker失联后，它依旧能收到worker的心跳包，并将worker重新设置为alive，因此，初步判断worker的行为是正常的。

目前我在1.11.3的代码中添加了一些log，在测试集群上运行了3周均未复现该问题，但是线上集群却不规律的出现这个bug，出现问题的集群并不是总在灌数据，出bug的时间也没有规律。

现在我计划在出现bug的集群的meta server上搭建meta节点，在测试集群上搭建replica节点，尝试先复现这个问题。

neverchanje · 2019-09-16T09:55:08Z

This bug is fixed in release 1.11.6 https://github.com/XiaoMi/pegasus/releases/tag/v1.11.6.
Close it now.

…315)

hycdong self-assigned this Apr 4, 2019

hycdong added the type/bug This issue reports a bug. label Apr 4, 2019

hycdong changed the title ~~Add primary replica as a learner~~ rdsn core - add primary replica as a learner Apr 4, 2019

neverchanje pinned this issue May 9, 2019

qinzuoyan changed the title ~~rdsn core - add primary replica as a learner~~ rdsn coredump - add primary replica as a learner May 20, 2019

hycdong mentioned this issue May 23, 2019

fd: update failure detection log XiaoMi/rdsn#256

Merged

hycdong mentioned this issue Jul 18, 2019

fd: fix ingeter overflow XiaoMi/rdsn#272

Merged

neverchanje closed this as completed Sep 16, 2019

neverchanje unpinned this issue Sep 16, 2019

neverchanje mentioned this issue Nov 18, 2019

Release 1.11.6 #354

Closed

acelyc111 pushed a commit that referenced this issue Jun 23, 2022

feat(dup): verify private log validity before starting to duplicate (#…

b8c5adf

…315)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rdsn coredump - add primary replica as a learner #315

rdsn coredump - add primary replica as a learner #315

hycdong commented Apr 4, 2019 •

edited

Loading

hycdong commented May 9, 2019

hycdong commented May 9, 2019

hycdong commented May 9, 2019

neverchanje commented Sep 16, 2019

rdsn coredump - add primary replica as a learner #315

rdsn coredump - add primary replica as a learner #315

Comments

hycdong commented Apr 4, 2019 • edited Loading

Server Version

Coredump Stack

frame 3 local

hycdong commented May 9, 2019

hycdong commented May 9, 2019

hycdong commented May 9, 2019

neverchanje commented Sep 16, 2019

hycdong commented Apr 4, 2019 •

edited

Loading