[Core][CoreWorker] ensure task_execution_service_ is destructed first #18913

scv119 · 2021-09-27T07:11:10Z

Reading the recent two crashes of CoreWorker, specifically, following two crashes, the common pattern is
the task_execution_service_ is still alive but either the CoreWorkerDirectTaskReceiver or the GRPC is being destructed.

This can't happen during CoreWorker::Shutdown case as we always stop task_execution_service_ first.
So this could only happen we destruct CoreWorker without calling Shutdown.

In this case, we ensure the destructor order where we always destruct task_execution_service_ first.

logging.cc:315: *** SIGSEGV received at time=1632393798 on cpu 5 ***
logging.cc:315: PC: @     0x7f000bb6f6ba  (unknown)  pollset_kick()
logging.cc:315:     @     0x7f000d4fb980  (unknown)  (unknown)
logging.cc:315:     @     0x7f000bb3201b         96  cq_end_op_for_next()
logging.cc:315:     @     0x7f000bb39aff        144  post_batch_completion()
logging.cc:315:     @     0x7f000bb277b5         48  grpc_core::Server::CallData::RecvTrailingMetadataReady()
logging.cc:315:     @     0x7f000bae63af         48  recv_trailing_metadata_ready()
logging.cc:315:     @     0x7f000bae1d58         48  hs_recv_trailing_metadata_ready()
logging.cc:315:     @     0x7f000bae2ee5         32  grpc_core::(anonymous namespace)::CallData::OnRecvTrailingMetadataReady()
logging.cc:315:     @     0x7f000bb6bf04         48  grpc_core::ExecCtx::Flush()
logging.cc:315:     @     0x7f000bb3b619        192  grpc_call_start_batch
logging.cc:315:     @     0x7f000b602796        656  grpc::internal::CallOpSet<>::ContinueFillOpsAfterInterception()
logging.cc:315:     @     0x7f000b651ca5        160  ray::rpc::ServerCallImpl<>::SendReply()
logging.cc:315:     @     0x7f000b651f51        144  std::_Function_handler<>::_M_invoke()
logging.cc:315:     @     0x7f000b6b4364        368  ray::core::CoreWorkerDirectTaskReceiver::HandleTask()::{lambda()#1}::operator()()
logging.cc:315:     @     0x7f000b6b48ba         80  std::_Function_handler<>::_M_invoke()
logging.cc:315:     @     0x7f000b6a18c2         80  std::_Function_handler<>::_M_invoke()
logging.cc:315:     @     0x7f000b6aa772        208  boost::asio::detail::executor_op<>::do_complete()
logging.cc:315:     @     0x7f000bc90928        112  boost::asio::detail::scheduler::do_run_one()
logging.cc:315:     @     0x7f000bc914e1        160  boost::asio::detail::scheduler::run()
logging.cc:315:     @     0x7f000bc915eb         32  boost::asio::detail::posix_thread::func<>::run()
logging.cc:315:     @     0x7f000bc8a7c1         32  boost_asio_detail_posix_thread_function
logging.cc:315:     @     0x7f000d4f06db  (unknown)  start_thread

(pid=14053, ip=172.31.63.35) *** SIGSEGV received at time=1629779882 on cpu 40 ***
(pid=14053, ip=172.31.63.35) PC: @     0x7f8a60f35d10  (unknown)  ray::core::CoreWorkerDirectTaskReceiver::HandleTask()::{lambda()#1}::operator()()
(pid=14053, ip=172.31.63.35)     @     0x7f8a62b8f980  758251280  (unknown)
(pid=14053, ip=172.31.63.35)     @     0x7f8a60f3648a         80  std::_Function_handler<>::_M_invoke()
(pid=14053, ip=172.31.63.35)     @     0x7f8a60eb7b35        448  ray::core::NormalSchedulingQueue::ScheduleRequests()
(pid=14053, ip=172.31.63.35)     @     0x7f8a612755e6        112  boost::asio::detail::completion_handler<>::do_complete()
(pid=14053, ip=172.31.63.35)     @     0x7f8a61377e58        112  boost::asio::detail::scheduler::do_run_one()
(pid=14053, ip=172.31.63.35)     @     0x7f8a61378a11        160  boost::asio::detail::scheduler::run()
(pid=14053, ip=172.31.63.35)     @     0x7f8a6137a560         64  boost::asio::io_context::run()
(pid=14053, ip=172.31.63.35)     @     0x7f8a60f22405        144  ray::core::CoreWorkerProcess::RunTaskExecutionLoop()
(pid=14053, ip=172.31.63.35)     @     0x7f8a60dad3d7         32  __pyx_pw_3ray_7_raylet_10CoreWorker_9run_task_loop()
(pid=14053, ip=172.31.63.35)     @     0x5576d487ab71  (unknown)  _PyMethodDef_RawFastCallKeywords
(pid=14053, ip=172.31.63.35)     @     0x7f8a60dad3c0  (unknown)  (unknown)

src/ray/core_worker/core_worker.h

ericl

Is there a way we could have programmatically had the correct shutdown order without relying on class member ordering? Would shared_ptr have helped?

scv119 linked an issue Sep 27, 2021 that may be closed by this pull request

[Core] CoreWorker crash due to destruction order #18857

Closed

2 tasks

scv119 assigned rkooo567, fishbone, ericl and mwtian Sep 27, 2021

scv119 force-pushed the core-worker-fix3 branch from 7220a01 to e3c85e0 Compare September 27, 2021 08:51

ericl reviewed Sep 27, 2021

View reviewed changes

src/ray/core_worker/core_worker.h Show resolved Hide resolved

ericl reviewed Sep 27, 2021

View reviewed changes

ericl approved these changes Sep 27, 2021

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Sep 27, 2021

scv119 force-pushed the core-worker-fix3 branch from e3c85e0 to b61356f Compare September 27, 2021 18:35

scv119 added the do-not-merge Do not merge this PR! label Sep 27, 2021

ensure task_execution_service_ is destructed first

2257d22

scv119 force-pushed the core-worker-fix3 branch from b61356f to 2257d22 Compare September 27, 2021 18:40

mwtian approved these changes Sep 27, 2021

View reviewed changes

scv119 removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. do-not-merge Do not merge this PR! labels Sep 27, 2021

ericl merged commit 25d14cb into ray-project:master Sep 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core][CoreWorker] ensure task_execution_service_ is destructed first #18913

[Core][CoreWorker] ensure task_execution_service_ is destructed first #18913

scv119 commented Sep 27, 2021 •

edited

Loading

ericl left a comment

[Core][CoreWorker] ensure task_execution_service_ is destructed first #18913

[Core][CoreWorker] ensure task_execution_service_ is destructed first #18913

Conversation

scv119 commented Sep 27, 2021 • edited Loading

ericl left a comment

Choose a reason for hiding this comment

scv119 commented Sep 27, 2021 •

edited

Loading