-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[fix](scanner) Fix deadlock when scanner submit failed #40495
Conversation
Thank you for your contribution to Apache Doris. Since 2024-03-18, the Document has been moved to doris-website. |
run buildall |
TeamCity be ut coverage result: |
TPC-H: Total hot run time: 37953 ms
|
TPC-DS: Total hot run time: 192227 ms
|
ClickBench: Total hot run time: 31.35 s
|
be/src/olap/tablet.h
Outdated
@@ -83,6 +83,8 @@ enum SortType : int; | |||
|
|||
enum TabletStorageType { STORAGE_TYPE_LOCAL, STORAGE_TYPE_REMOTE, STORAGE_TYPE_REMOTE_AND_LOCAL }; | |||
|
|||
std::string to_string(const TabletStorageType&); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this method could be a static method
@@ -139,10 +139,12 @@ class ScannerContext : public std::enable_shared_from_this<ScannerContext>, | |||
// set the next scanned block to `ScanTask::current_block` | |||
// set the error state to `ScanTask::status` | |||
// set the `eos` to `ScanTask::eos` if there is no more data in current scanner | |||
void submit_scan_task(std::shared_ptr<ScanTask> scan_task); | |||
[[nodiscard]] Status submit_scan_task(std::shared_ptr<ScanTask> scan_task); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not need discard, because status class is always nodiscard
auto scan_task_ptr = std::make_shared<ScanTask>(next_scanner); | ||
Status submit_status = submit_scan_task(scan_task_ptr); | ||
if (!submit_status.ok()) { | ||
append_block_to_queue_locked(scan_task_ptr, l); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just return error if submit failed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also should update the scanner context's status
if (!_process_status.ok()) {
_set_scanner_done();
return _process_status;
}
submit_scan_task(scan_task); | ||
Status submit_status = submit_scan_task(scan_task); | ||
if (!submit_status.ok()) { | ||
append_block_to_queue_locked(scan_task, l); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not need
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
set process status to abnormal and return the error status
run buildall |
TPC-H: Total hot run time: 38608 ms
|
TPC-DS: Total hot run time: 192673 ms
|
TeamCity be ut coverage result: |
ClickBench: Total hot run time: 31.77 s
|
PR approved by at least one committer and no changes requested. |
PR approved by anyone and no changes requested. |
3f0b67f
to
7fcfcf3
Compare
run buildall |
TeamCity be ut coverage result: |
TPC-H: Total hot run time: 37835 ms
|
TPC-DS: Total hot run time: 193218 ms
|
ClickBench: Total hot run time: 31.89 s
|
run cloud p0 |
We have dead lock when submit scanner to scheduler failed. pstack looks like ```txt Thread 2012 (Thread 0x7f87363fb700 (LWP 4179707) "Pipe_normal [wo"): .h:749 ``` Deallock happens with following ```cpp Status ScannerContext::get_block_from_queue { std::unique_lock l(_transfer_lock); ... if (scan_task->is_eos()) { ... } else { // resubmit current running scanner to read the next block submit_scan_task(scan_task); } } ScannerContext::submit_scan_task(std::shared_ptr<ScanTask> scan_task) { _scanner_scheduler->submit(shared_from_this(), scan_task); } void ScannerScheduler::submit(std::shared_ptr<ScannerContext> ctx, std::shared_ptr<ScanTask> scan_task) { ... if (auto ret = sumbit_task(); !ret) { scan_task->set_status(Status::InternalError( "Failed to submit scanner to scanner pool reason:" + std::string(ret.msg()) + "|type:" + std::to_string(type))); ctx->append_block_to_queue(scan_task); return; } } void ScannerContext::append_block_to_queue(std::shared_ptr<ScanTask> scan_task) { ... std::lock_guard<std::mutex> l(_transfer_lock); ... } ``` Since mutex in cpp is not re-enterable, so the scanner thread will deadlock with itself. This pr fix the problem by making `ScannerScheduler::submit` return a Status instead of doing append failed task to the ScannerContext. The caller itself will decide where resubmit the scanner or just abort the execution of the query.
We have dead lock when submit scanner to scheduler failed. pstack looks like ```txt Thread 2012 (Thread 0x7f87363fb700 (LWP 4179707) "Pipe_normal [wo"): #0 0x00007f8b8f3dc82d in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x00007f8b8f3d5ad9 in pthread_mutex_lock () from /lib64/libpthread.so.0 #2 0x000055b20f333e7a in __gthread_mutex_lock (__mutex=0x7f8733d960a8) at /mnt/disk1/hezhiqiang/toolchains/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/x86_64-linux-gnu/c++/11/bits/gthr-default .h:749 #3 std::mutex::lock (this=0x7f8733d960a8) at /mnt/disk1/hezhiqiang/toolchains/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_mutex.h:100 #4 std::lock_guard<std::mutex>::lock_guard (__m=..., this=<optimized out>) at /mnt/disk1/hezhiqiang/toolchains/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_mutex.h:229 #5 doris::vectorized::ScannerContext::append_block_to_queue (this=<optimized out>, scan_task=...) at /mnt/disk1/hezhiqiang/doris/be/src/vec/exec/scan/scanner_context.cpp:234 #6 0x000055b20f32c0f9 in doris::vectorized::ScannerScheduler::submit (this=<optimized out>, ctx=..., scan_task=...) at /mnt/disk1/hezhiqiang/doris/be/src/vec/exec/scan/scanner_scheduler.cpp:209 #7 0x000055b20f3338fc in doris::vectorized::ScannerContext::submit_scan_task (this=this@entry=0x7f8733d96010, scan_task=...) at /mnt/disk1/hezhiqiang/doris/be/src/vec/exec/scan/scanner_context.cpp:217 #8 0x000055b20f3346cd in doris::vectorized::ScannerContext::get_block_from_queue (this=0x7f8733d96010, state=<optimized out>, block=0x7f871f728de0, eos=0x7f871abce470, id=<optimized out>) at /mnt/disk1/hezhiqiang/doris/be/src/vec/exec/scan/scanner_context.cpp:290 #9 0x000055b214cb4f13 in doris::pipeline::ScanOperatorX<doris::pipeline::OlapScanLocalState>::get_block (this=<optimized out>, state=0x7f872f0eb400, block=0x7f8b8f3dc82d <__lll_lock_wait+29>, eos=0x7f871abce470) at /mnt/disk1/hezhiqiang/doris/be/src/pipeline/exec/scan_operator.cpp:1292 #10 0x000055b2142b5772 in doris::pipeline::ScanOperatorX<doris::pipeline::OlapScanLocalState>::get_block_after_projects (this=0x80, state=0x0, block=0x7f8b8f3dc82d <__lll_lock_wait+29>, eos=0x7f8733d960a8) at /mnt/disk1/hezhiqiang/doris/be/src/pipeline/exec/scan_operator.h:363 #11 0x000055b2142e7880 in doris::pipeline::StatefulOperatorX<doris::pipeline::StreamingAggLocalState>::get_block (this=0x7f871f9bee00, state=0x7f872f0eb400, block=0x7f8716d49060, eos=0x7f87363f4937) at /mnt/disk1/hezhiqiang/doris/be/src/pipeline/exec/operator.cpp:587 ``` Deallock happens with following ```cpp Status ScannerContext::get_block_from_queue { std::unique_lock l(_transfer_lock); ... if (scan_task->is_eos()) { ... } else { // resubmit current running scanner to read the next block submit_scan_task(scan_task); } } ScannerContext::submit_scan_task(std::shared_ptr<ScanTask> scan_task) { _scanner_scheduler->submit(shared_from_this(), scan_task); } void ScannerScheduler::submit(std::shared_ptr<ScannerContext> ctx, std::shared_ptr<ScanTask> scan_task) { ... if (auto ret = sumbit_task(); !ret) { scan_task->set_status(Status::InternalError( "Failed to submit scanner to scanner pool reason:" + std::string(ret.msg()) + "|type:" + std::to_string(type))); ctx->append_block_to_queue(scan_task); return; } } void ScannerContext::append_block_to_queue(std::shared_ptr<ScanTask> scan_task) { ... std::lock_guard<std::mutex> l(_transfer_lock); ... } ``` Since mutex in cpp is not re-enterable, so the scanner thread will deadlock with itself. This pr fix the problem by making `ScannerScheduler::submit` return a Status instead of doing append failed task to the ScannerContext. The caller itself will decide where resubmit the scanner or just abort the execution of the query.
We have dead lock when submit scanner to scheduler failed. pstack looks like ```txt Thread 2012 (Thread 0x7f87363fb700 (LWP 4179707) "Pipe_normal [wo"): #0 0x00007f8b8f3dc82d in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x00007f8b8f3d5ad9 in pthread_mutex_lock () from /lib64/libpthread.so.0 #2 0x000055b20f333e7a in __gthread_mutex_lock (__mutex=0x7f8733d960a8) at /mnt/disk1/hezhiqiang/toolchains/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/x86_64-linux-gnu/c++/11/bits/gthr-default .h:749 #3 std::mutex::lock (this=0x7f8733d960a8) at /mnt/disk1/hezhiqiang/toolchains/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_mutex.h:100 #4 std::lock_guard<std::mutex>::lock_guard (__m=..., this=<optimized out>) at /mnt/disk1/hezhiqiang/toolchains/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_mutex.h:229 #5 doris::vectorized::ScannerContext::append_block_to_queue (this=<optimized out>, scan_task=...) at /mnt/disk1/hezhiqiang/doris/be/src/vec/exec/scan/scanner_context.cpp:234 #6 0x000055b20f32c0f9 in doris::vectorized::ScannerScheduler::submit (this=<optimized out>, ctx=..., scan_task=...) at /mnt/disk1/hezhiqiang/doris/be/src/vec/exec/scan/scanner_scheduler.cpp:209 #7 0x000055b20f3338fc in doris::vectorized::ScannerContext::submit_scan_task (this=this@entry=0x7f8733d96010, scan_task=...) at /mnt/disk1/hezhiqiang/doris/be/src/vec/exec/scan/scanner_context.cpp:217 #8 0x000055b20f3346cd in doris::vectorized::ScannerContext::get_block_from_queue (this=0x7f8733d96010, state=<optimized out>, block=0x7f871f728de0, eos=0x7f871abce470, id=<optimized out>) at /mnt/disk1/hezhiqiang/doris/be/src/vec/exec/scan/scanner_context.cpp:290 #9 0x000055b214cb4f13 in doris::pipeline::ScanOperatorX<doris::pipeline::OlapScanLocalState>::get_block (this=<optimized out>, state=0x7f872f0eb400, block=0x7f8b8f3dc82d <__lll_lock_wait+29>, eos=0x7f871abce470) at /mnt/disk1/hezhiqiang/doris/be/src/pipeline/exec/scan_operator.cpp:1292 #10 0x000055b2142b5772 in doris::pipeline::ScanOperatorX<doris::pipeline::OlapScanLocalState>::get_block_after_projects (this=0x80, state=0x0, block=0x7f8b8f3dc82d <__lll_lock_wait+29>, eos=0x7f8733d960a8) at /mnt/disk1/hezhiqiang/doris/be/src/pipeline/exec/scan_operator.h:363 #11 0x000055b2142e7880 in doris::pipeline::StatefulOperatorX<doris::pipeline::StreamingAggLocalState>::get_block (this=0x7f871f9bee00, state=0x7f872f0eb400, block=0x7f8716d49060, eos=0x7f87363f4937) at /mnt/disk1/hezhiqiang/doris/be/src/pipeline/exec/operator.cpp:587 ``` Deallock happens with following ```cpp Status ScannerContext::get_block_from_queue { std::unique_lock l(_transfer_lock); ... if (scan_task->is_eos()) { ... } else { // resubmit current running scanner to read the next block submit_scan_task(scan_task); } } ScannerContext::submit_scan_task(std::shared_ptr<ScanTask> scan_task) { _scanner_scheduler->submit(shared_from_this(), scan_task); } void ScannerScheduler::submit(std::shared_ptr<ScannerContext> ctx, std::shared_ptr<ScanTask> scan_task) { ... if (auto ret = sumbit_task(); !ret) { scan_task->set_status(Status::InternalError( "Failed to submit scanner to scanner pool reason:" + std::string(ret.msg()) + "|type:" + std::to_string(type))); ctx->append_block_to_queue(scan_task); return; } } void ScannerContext::append_block_to_queue(std::shared_ptr<ScanTask> scan_task) { ... std::lock_guard<std::mutex> l(_transfer_lock); ... } ``` Since mutex in cpp is not re-enterable, so the scanner thread will deadlock with itself. This pr fix the problem by making `ScannerScheduler::submit` return a Status instead of doing append failed task to the ScannerContext. The caller itself will decide where resubmit the scanner or just abort the execution of the query.
We have dead lock when submit scanner to scheduler failed.
pstack looks like
Deallock happens with following
Since mutex in cpp is not re-enterable, so the scanner thread will deadlock with itself.
This pr fix the problem by making
ScannerScheduler::submit
return a Status instead of doing append failed task to the ScannerContext. The caller itself will decide where resubmit the scanner or just abort the execution of the query.