-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[improvement] Optimize send fragment logic to reduce send fragment timeout error #9720
Conversation
LGTM |
fb9a06e
to
75b42c2
Compare
I have made following tests:
|
if (_need_wait_execution_trigger) { | ||
// if _need_wait_execution_trigger is true, which means this instance | ||
// is prepared but need to wait for the signal to do the rest execution. | ||
_fragments_ctx->wait_for_start(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a timeout here, avoid occupy the running thread too much time during exceptions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need, there is a "timeout checker" on BE side to check and notify this.
if (_need_wait_execution_trigger) { | ||
// if _need_wait_execution_trigger is true, which means this instance | ||
// is prepared but need to wait for the signal to do the rest execution. | ||
_fragments_ctx->wait_for_start(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a timeout here, avoid occupy the running thread too much time during exceptions.
I think it is better to divide into two rpc interfaces. |
In addition to the observed reduction in the number of RPCs, are there any mitigations for the timeout issue? |
Let me try |
I tested a sql with 3 fragments, 1FE, 1BE, jmeter, thread num 100. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR approved by at least one committer and no changes requested. |
PR approved by anyone and no changes requested. |
…meout error (#9720) This CL mainly changes: 1. Reducing the rpc timeout problem caused by rpc waiting for the worker thread of brpc. 1. Merge multiple fragment instances on the same BE to send requests to reduce the number of send fragment rpcs 2. If fragments size >= 3, use 2 phase RPC: one is to send all fragments, two is to start these fragments. So that there will be at most 2 RPC for each query on one BE. 3. Set the timeout of send fragment rpc to the query timeout to ensure the consistency of users' expectation of query timeout period. 4. Do not close the connection anymore when rpc timeout occurs. 5. Change some log level from info to debug to simplify the fe.log content. NOTICE: 1. Change the definition of execPlanFragment rpc, must first upgrade BE. 3. Remove FE config `remote_fragment_exec_timeout_ms`
This bug was introduced from apache#9720
This bug was introduced from apache#9720
) This bug was introduced from #9720
) This bug was introduced from #9720
…pache#12495) 1. For query with 1656 union, the plan thrift size will be reduced from 400MB+ to 2MB. This optimization is introduced from apache#4904, but lost after apache#9720 2. Disable ExprSubstitutionMap.verify when debug is disable. So that the plan time of query with 1656 union will be reduced from 20s to 2s
Proposed changes
Issue Number: close #8942
Problem Summary:
This CL mainly changes:
Reducing the rpc timeout problem caused by rpc waiting for the worker thread of brpc.
will be at most 2 RPC for each query on one BE.
Set the timeout of send fragment rpc to the query timeout to ensure the consistency of users' expectation of query timeout period.
Do not close the connection anymore when rpc timeout occurs.
Change some log level from info to debug to simplify the fe.log content.
NOTICE:
remote_fragment_exec_timeout_ms
Checklist(Required)
Further comments
If this is a relatively large or complex change, kick off the discussion at [email protected] by explaining why you chose the solution you did and what alternatives you considered, etc...