Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better log failure handling instead of assert false #287

Closed
hycdong opened this issue Feb 22, 2019 · 1 comment
Closed

Better log failure handling instead of assert false #287

hycdong opened this issue Feb 22, 2019 · 1 comment
Assignees

Comments

@hycdong
Copy link
Contributor

hycdong commented Feb 22, 2019

触发环境

  • 机器:2台meta server,5台replica server
  • server version: hycdong的split branch https://github.com/hycdong/pegasus/tree/split
  • QPS:2w-get,3w-set,同时随机kill test
  • 表:大小1T, partition_count为128,从fds中恢复c3tst-sample-krb集群的usertable表
  • 其他条件: split 1次后,在split过程中或者split完成后均可能触发

coredump

(gdb) bt
#0 0x0000003f852328a5 in raise () from /lib64/libc.so.6
#1 0x0000003f85234085 in abort () from /lib64/libc.so.6
#2 0x00007f93ada6125e in dsn_coredump () at /home/heyuchen/split/pegasus/rdsn/src/core/core/service_api_c.cpp:73
#3 0x00007f93ad93caee in dsn::replication::replica_stub::handle_log_failure (this=<optimized out>, err=...) at /home/heyuchen/split/pegasus/rdsn/src/dist/replication/lib/replica_stub.cpp:1962
#4 0x00007f93ad98eef5 in dsn::replication::replica::on_append_log_completed (this=0x7f920d1eac60, mu=..., err=..., size=<optimized out>)
at /home/heyuchen/split/pegasus/rdsn/src/dist/replication/lib/replica_2pc.cpp:526
#5 0x00007f93ada5f5b8 in operator() (__args#1=<optimized out>, __args#0=..., this=<optimized out>) at /home/heyuchen/toolchain/output/include/c++/4.8.2/functional:2464
#6 dsn::aio_task::exec (this=<optimized out>) at /home/heyuchen/split/pegasus/rdsn/include/dsn/tool-api/task.h:597
#7 0x00007f93ada5d1f9 in dsn::task::exec_internal (this=this@entry=0x7f8cb6f11a88) at /home/heyuchen/split/pegasus/rdsn/src/core/core/task.cpp:180
#8 0x00007f93adab1d9d in dsn::task_worker::loop (this=0x2305f00) at /home/heyuchen/split/pegasus/rdsn/src/core/core/task_worker.cpp:211
#9 0x00007f93adab1f69 in dsn::task_worker::run_internal (this=0x2305f00) at /home/heyuchen/split/pegasus/rdsn/src/core/core/task_worker.cpp:191
#10 0x00007f93ab431600 in std::(anonymous namespace)::execute_native_thread_routine (__p=<optimized out>) at /home/heyuchen/toolchain/objdir/../gcc-4.8.2/libstdc++-v3/src/c++11/thread.cc:84
#11 0x0000003f85607851 in start_thread () from /lib64/libpthread.so.0
#12 0x0000003f852e811d in clone () from /lib64/libc.so.6
(gdb)

相关日志

E2019-02-20 18:30:54.767 (1550658654767145919 3211) replica.replica7.04050007005c7e26: native_aio_provider.linux.cpp:218:aio_internal(): io_submit error, ret = -11
E2019-02-20 18:30:54.767 (1550658654767175695 31ec) replica.default2.040100010017b3a7: mutation_log.cpp:193:operator()(): write shared log failed, err = ERR_FILE_OPERATION_FAILED
E2019-02-20 18:30:54.767 (1550658654767210218 31ef) replica.default5.04050001008ef578: mutation_log.cpp:457:operator()(): write private log failed, err = ERR_FILE_OPERATION_FAILED
E2019-02-20 18:30:54.767 (1550658654767285730 31ee) replica.default4.04050014007090e7: mutation_log.cpp:457:operator()(): write private log failed, err = ERR_FILE_OPERATION_FAILED
E2019-02-20 18:30:54.767 (1550658654767310415 31eb) replica.default1.040500170074149b: mutation_log.cpp:457:operator()(): write private log failed, err = ERR_FILE_OPERATION_FAILED
E2019-02-20 18:30:54.767 (1550658654767357562 31f0) replica.default6.040500150068bf23: mutation_log.cpp:457:operator()(): write private log failed, err = ERR_FILE_OPERATION_FAILED
E2019-02-20 18:30:54.767 (1550658654767400994 321e) replica.replica20.0405001400709062: native_aio_provider.linux.cpp:218:aio_internal(): io_submit error, ret = -11
  • 从日志中看将问题定位到:native_aio_provider.linux.cpp的aio_internal的函数调用io_submit系统调用时返回了-11(EAGAIN),因系统资源不足导致。
  • 关于io_submit相关可参见对linux io_submit的说明: http://man7.org/linux/man-pages/man2/io_submit.2.html
  • 我们目前的代码并不能处理这种io错误,会返回一个ERR_FILE_OPERATION_FAILED的error code,由于是写shared log出错,因此这个错误会返回到replica_stub上,最终直接assert false,产生coredump

解决方案思考

  • 调研其他数据库或存储系统如何处理系统io错误,完善错误处理机制
  • 添加对io速度,io busy相关的监控项,以便提前发现问题
@acelyc111
Copy link
Member

XiaoMi/rdsn#818

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants