Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core][CoreWorker] ensure task_execution_service_ is destructed first #18913

Merged
merged 1 commit into from
Sep 28, 2021

Conversation

scv119
Copy link
Contributor

@scv119 scv119 commented Sep 27, 2021

Reading the recent two crashes of CoreWorker, specifically, following two crashes, the common pattern is
the task_execution_service_ is still alive but either the CoreWorkerDirectTaskReceiver or the GRPC is being destructed.

This can't happen during CoreWorker::Shutdown case as we always stop task_execution_service_ first.
So this could only happen we destruct CoreWorker without calling Shutdown.

In this case, we ensure the destructor order where we always destruct task_execution_service_ first.

logging.cc:315: *** SIGSEGV received at time=1632393798 on cpu 5 ***
logging.cc:315: PC: @     0x7f000bb6f6ba  (unknown)  pollset_kick()
logging.cc:315:     @     0x7f000d4fb980  (unknown)  (unknown)
logging.cc:315:     @     0x7f000bb3201b         96  cq_end_op_for_next()
logging.cc:315:     @     0x7f000bb39aff        144  post_batch_completion()
logging.cc:315:     @     0x7f000bb277b5         48  grpc_core::Server::CallData::RecvTrailingMetadataReady()
logging.cc:315:     @     0x7f000bae63af         48  recv_trailing_metadata_ready()
logging.cc:315:     @     0x7f000bae1d58         48  hs_recv_trailing_metadata_ready()
logging.cc:315:     @     0x7f000bae2ee5         32  grpc_core::(anonymous namespace)::CallData::OnRecvTrailingMetadataReady()
logging.cc:315:     @     0x7f000bb6bf04         48  grpc_core::ExecCtx::Flush()
logging.cc:315:     @     0x7f000bb3b619        192  grpc_call_start_batch
logging.cc:315:     @     0x7f000b602796        656  grpc::internal::CallOpSet<>::ContinueFillOpsAfterInterception()
logging.cc:315:     @     0x7f000b651ca5        160  ray::rpc::ServerCallImpl<>::SendReply()
logging.cc:315:     @     0x7f000b651f51        144  std::_Function_handler<>::_M_invoke()
logging.cc:315:     @     0x7f000b6b4364        368  ray::core::CoreWorkerDirectTaskReceiver::HandleTask()::{lambda()#1}::operator()()
logging.cc:315:     @     0x7f000b6b48ba         80  std::_Function_handler<>::_M_invoke()
logging.cc:315:     @     0x7f000b6a18c2         80  std::_Function_handler<>::_M_invoke()
logging.cc:315:     @     0x7f000b6aa772        208  boost::asio::detail::executor_op<>::do_complete()
logging.cc:315:     @     0x7f000bc90928        112  boost::asio::detail::scheduler::do_run_one()
logging.cc:315:     @     0x7f000bc914e1        160  boost::asio::detail::scheduler::run()
logging.cc:315:     @     0x7f000bc915eb         32  boost::asio::detail::posix_thread::func<>::run()
logging.cc:315:     @     0x7f000bc8a7c1         32  boost_asio_detail_posix_thread_function
logging.cc:315:     @     0x7f000d4f06db  (unknown)  start_thread
(pid=14053, ip=172.31.63.35) *** SIGSEGV received at time=1629779882 on cpu 40 ***
(pid=14053, ip=172.31.63.35) PC: @     0x7f8a60f35d10  (unknown)  ray::core::CoreWorkerDirectTaskReceiver::HandleTask()::{lambda()#1}::operator()()
(pid=14053, ip=172.31.63.35)     @     0x7f8a62b8f980  758251280  (unknown)
(pid=14053, ip=172.31.63.35)     @     0x7f8a60f3648a         80  std::_Function_handler<>::_M_invoke()
(pid=14053, ip=172.31.63.35)     @     0x7f8a60eb7b35        448  ray::core::NormalSchedulingQueue::ScheduleRequests()
(pid=14053, ip=172.31.63.35)     @     0x7f8a612755e6        112  boost::asio::detail::completion_handler<>::do_complete()
(pid=14053, ip=172.31.63.35)     @     0x7f8a61377e58        112  boost::asio::detail::scheduler::do_run_one()
(pid=14053, ip=172.31.63.35)     @     0x7f8a61378a11        160  boost::asio::detail::scheduler::run()
(pid=14053, ip=172.31.63.35)     @     0x7f8a6137a560         64  boost::asio::io_context::run()
(pid=14053, ip=172.31.63.35)     @     0x7f8a60f22405        144  ray::core::CoreWorkerProcess::RunTaskExecutionLoop()
(pid=14053, ip=172.31.63.35)     @     0x7f8a60dad3d7         32  __pyx_pw_3ray_7_raylet_10CoreWorker_9run_task_loop()
(pid=14053, ip=172.31.63.35)     @     0x5576d487ab71  (unknown)  _PyMethodDef_RawFastCallKeywords
(pid=14053, ip=172.31.63.35)     @     0x7f8a60dad3c0  (unknown)  (unknown)

@scv119 scv119 linked an issue Sep 27, 2021 that may be closed by this pull request
2 tasks
Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way we could have programmatically had the correct shutdown order without relying on class member ordering? Would shared_ptr have helped?

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Sep 27, 2021
@scv119 scv119 added the do-not-merge Do not merge this PR! label Sep 27, 2021
@scv119 scv119 removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. do-not-merge Do not merge this PR! labels Sep 27, 2021
@ericl ericl merged commit 25d14cb into ray-project:master Sep 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Core] CoreWorker crash due to destruction order
5 participants