Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rdsn coredump - add primary replica as a learner #315

Closed
hycdong opened this issue Apr 4, 2019 · 4 comments
Closed

rdsn coredump - add primary replica as a learner #315

hycdong opened this issue Apr 4, 2019 · 4 comments
Assignees
Labels
type/bug This issue reports a bug.

Comments

@hycdong
Copy link
Contributor

hycdong commented Apr 4, 2019

Server Version

  • Pegasus Server 1.11.3 (b45cb06) Release
  • c6 ai cluster (Date: 4.4 09:38)

Coredump Stack

(gdb) bt
#0  0x00007f838254f1d7 in raise () from /lib64/libc.so.6
#1  0x00007f83825508c8 in abort () from /lib64/libc.so.6
#2  0x00007f838601b3ee in dsn_coredump () at /home/work/qinzuoyan/Pegasus/pegasus/rdsn/src/core/core/service_api_c.cpp:73
#3  0x00007f8385f1c2b7 in dsn::replication::replica::update_local_configuration (this=this@entry=0x394d600, config=..., same_ballot=same_ballot@entry=true)
    at /home/work/qinzuoyan/Pegasus/pegasus/rdsn/src/dist/replication/lib/replica_config.cpp:807
#4  0x00007f8385f811cb in dsn::replication::replica::on_add_learner (this=0x394d600, request=...) at /home/work/qinzuoyan/Pegasus/pegasus/rdsn/src/dist/replication/lib/replica_learn.cpp:1377
#5  0x00007f8385ee2ce2 in dsn::replication::replica_stub::on_add_learner (this=0x31b6580, request=...) at /home/work/qinzuoyan/Pegasus/pegasus/rdsn/src/dist/replication/lib/replica_stub.cpp:1004
#6  0x00007f8385efd9e0 in operator() (request=<optimized out>, __closure=0x3a399660) at /home/work/qinzuoyan/Pegasus/pegasus/rdsn/include/dsn/cpp/serverlet.h:169
#7  std::_Function_handler<void (dsn::message_ex*), bool dsn::serverlet<dsn::replication::replica_stub>::register_rpc_handler<dsn::replication::group_check_request>(dsn::task_code, char const*, void (dsn::replication::replica_stub::*)(dsn::replication::group_check_request const&))::{lambda(dsn::message_ex*)#1}>::_M_invoke(std::_Any_data const&, dsn::message_ex*) (__functor=..., __args#0=<optimized out>)
    at /home/work/qinzuoyan/Pegasus/toolchain/output/include/c++/4.8.2/functional:2071
#8  0x00007f838606cce9 in dsn::task::exec_internal (this=this@entry=0x57be163e4) at /home/work/qinzuoyan/Pegasus/pegasus/rdsn/src/core/core/task.cpp:180
#9  0x00007f83860ed42d in dsn::task_worker::loop (this=0x2f65ce0) at /home/work/qinzuoyan/Pegasus/pegasus/rdsn/src/core/core/task_worker.cpp:211
#10 0x00007f83860ed5f9 in dsn::task_worker::run_internal (this=0x2f65ce0) at /home/work/qinzuoyan/Pegasus/pegasus/rdsn/src/core/core/task_worker.cpp:191
#11 0x00007f8382ea7600 in std::(anonymous namespace)::execute_native_thread_routine (__p=<optimized out>) at /home/qinzuoyan/git.xiaomi/pegasus/toolchain/objdir/../gcc-4.8.2/libstdc++-v3/src/c++11/thread.cc:84
#12 0x00007f8383b14dc5 in start_thread () from /lib64/libpthread.so.0
#13 0x00007f838261173d in clone () from /lib64/libc.so.6

frame 3 local

(gdb) f 3
#3  0x00007f8385f1c2b7 in dsn::replication::replica::update_local_configuration (this=this@entry=0x394d600, config=..., same_ballot=same_ballot@entry=true)
    at /home/work/qinzuoyan/Pegasus/pegasus/rdsn/src/dist/replication/lib/replica_config.cpp:807
807	/home/work/qinzuoyan/Pegasus/pegasus/rdsn/src/dist/replication/lib/replica_config.cpp: No such file or directory.
(gdb) i locals
__FUNCTION__ = "update_local_configuration"
old_status = dsn::replication::partition_status::PS_PRIMARY
old_ballot = 3
r = false
oldTs = 1553249692836
@hycdong hycdong self-assigned this Apr 4, 2019
@hycdong hycdong added the type/bug This issue reports a bug. label Apr 4, 2019
@hycdong hycdong changed the title Add primary replica as a learner rdsn core - add primary replica as a learner Apr 4, 2019
@hycdong
Copy link
Contributor Author

hycdong commented May 9, 2019

现有的failure detection机制如下图所示:

                        |--- lease period ----| lease IsExpired, commit suicide
             |--- lease period ---|
worker: ---------------------------------------------------------------->
             \    /     \    /      _\
         beacon ack  beacon ack       x (beacon deliver failed)
              _\/        _\/
master: ---------------------------------------------------------------->
                |---- grace period ----|
                           |--- grace period ----| grace IsExpired, declare worker dead

replica server(worker) 周期性(3秒)给meta server(maser)发送心跳
worker和master都会周期性(2秒)检查lease是否到期

  • 对worker而言,基于最近一次成功发送的心跳包的时间计算master是否失联
    • 若worker认为自己与master失联,将自己所有replica状态转化为inactive
  • 对master而言,基于最近一次接收到心跳包的时间计算worker是否失联
    • 若master认为某个worker失联,会处理这个节点上的所有replica
      • 对于secondary,给这个replica的primary发propose,下线这个secondary
      • 对于primary,master会从secondary中选择一个replica发propose,assign primary

维持现有failure detection机制正确性的关键点在于:master使用的grace lease比worker的lease稍长一点,确保worker首先发现自己与master失联,master才发现这个worker失联。

@hycdong
Copy link
Contributor Author

hycdong commented May 9, 2019

出现core的原因就是违背了上述前提,master发现某个worker失联,而worker并没有发现自己失联,即某个worker“假失联”。

具体来说,master在认为worker失联后,会为worker上的所有primary重新assign primary,在此期间“假失联”的worker通过心跳又被master认为alive,新primary选择这个“假失联”的replica server作为新的secondary,并且meta向它发送propose,而“假失联”的server依旧认为自己是replica的primary。我们认为replica不能从primary转化成secondary,这是一种非法的状态转化,因此产生了这个core。

在实际环境中,我们还发现出现master认为worker“假失联”,但是并没有产生core的情况。
replica server和meta server之间会通过on_config_sync同步config信息,当master认为worker“假失联”后,依旧能够收到replica发来的on_config_sync请求并回复ack给replica。当replica server收到ack后发现meta认为自己没有任何replica,就会主动remove自己所有的replica,这样当收到add secondary propose时就不会产生非法状态转换。同时,该集群“_add_secondary_max_count_for_one_node”配置个数较小,相当于延迟了propose add secondary的时间。

@hycdong
Copy link
Contributor Author

hycdong commented May 9, 2019

通过分析日志发现,worker没有发现自己失联。第一,没有失联和切换master的log打印出来; 第二,当心跳包的error code不为ERR_OK时,pref-counter _recent_beacon_fail_count会加1,而出问题的集群该pref-counter值都为0; 第三,当master认为worker失联后,它依旧能收到worker的心跳包,并将worker重新设置为alive,因此,初步判断worker的行为是正常的。

目前我在1.11.3的代码中添加了一些log,在测试集群上运行了3周均未复现该问题,但是线上集群却不规律的出现这个bug,出现问题的集群并不是总在灌数据,出bug的时间也没有规律。

现在我计划在出现bug的集群的meta server上搭建meta节点,在测试集群上搭建replica节点,尝试先复现这个问题。

@neverchanje neverchanje pinned this issue May 9, 2019
@qinzuoyan qinzuoyan changed the title rdsn core - add primary replica as a learner rdsn coredump - add primary replica as a learner May 20, 2019
@neverchanje
Copy link
Contributor

This bug is fixed in release 1.11.6 https://github.com/XiaoMi/pegasus/releases/tag/v1.11.6.
Close it now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug This issue reports a bug.
Projects
None yet
Development

No branches or pull requests

2 participants