[improvement] Optimize send fragment logic to reduce send fragment timeout error #9720

morningman · 2022-05-21T15:40:36Z

Proposed changes

Issue Number: close #8942

Problem Summary:

This CL mainly changes:

Reducing the rpc timeout problem caused by rpc waiting for the worker thread of brpc.
1. Merge multiple fragment instances on the same BE to send requests to reduce the number of send fragment rpcs
2. If fragments size >= 3, use 2 phase RPC: one is to send all fragments, two is to start these fragments. So that there
  will be at most 2 RPC for each query on one BE.
Set the timeout of send fragment rpc to the query timeout to ensure the consistency of users' expectation of query timeout period.
Do not close the connection anymore when rpc timeout occurs.
Change some log level from info to debug to simplify the fe.log content.

NOTICE:

Change the definition of execPlanFragment rpc, must first upgrade BE.
Remove FE config remote_fragment_exec_timeout_ms

Checklist(Required)

Does it affect the original behavior: (Yes)
Has unit tests been added: (No Need)
Has document been added or modified: (Yes)
Does it need to update dependencies: (No)
Are there any changes that cannot be rolled back: (No)

Further comments

If this is a relatively large or complex change, kick off the discussion at [email protected] by explaining why you chose the solution you did and what alternatives you considered, etc...

xinyiZzz · 2022-05-24T00:04:03Z

LGTM

morningman · 2022-05-29T06:01:16Z

I have made following tests:

Complex query with 47 fragments. Before: 47 RPCs. After: 2 RPCs
High concurrency test for query with 3 fragments: there is no impact on QPS.
High concurrency test for query with 2 fragments: these is no impact on QPS.
The new BE is compatible with request from old FE.

morningman · 2022-05-29T07:57:21Z

@xinyiZzz @yangzhg PTAL

yiguolei · 2022-05-31T00:52:05Z

be/src/runtime/fragment_mgr.cpp

+    if (_need_wait_execution_trigger) {
+        // if _need_wait_execution_trigger is true, which means this instance
+        // is prepared but need to wait for the signal to do the rest execution.
+        _fragments_ctx->wait_for_start();


Add a timeout here, avoid occupy the running thread too much time during exceptions.

No need, there is a "timeout checker" on BE side to check and notify this.

yiguolei · 2022-05-31T00:52:05Z

be/src/runtime/fragment_mgr.cpp

+    if (_need_wait_execution_trigger) {
+        // if _need_wait_execution_trigger is true, which means this instance
+        // is prepared but need to wait for the signal to do the rest execution.
+        _fragments_ctx->wait_for_start();


Add a timeout here, avoid occupy the running thread too much time during exceptions.

yangzhg · 2022-05-31T03:08:26Z

I think it is better to divide into two rpc interfaces. exec_plan_fragment processing only fragments size < 3. When 2 phase RPC is required, use two new rpc interfaces exec_plan_fragment_prepare and exec_plan_fragment_start. This may be more clear

yangzhg · 2022-05-31T11:55:05Z

I have made following tests:

Complex query with 47 fragments. Before: 47 RPCs. After: 2 RPCs

High concurrency test for query with 3 fragments: there is no impact on QPS.

High concurrency test for query with 2 fragments: these is no impact on QPS.

The new BE is compatible with request from old FE.

In addition to the observed reduction in the number of RPCs, are there any mitigations for the timeout issue?

morningman · 2022-06-01T01:21:55Z

I think it is better to divide into two rpc interfaces. exec_plan_fragment processing only fragments size < 3. When 2 phase RPC is required, use two new rpc interfaces exec_plan_fragment_prepare and exec_plan_fragment_start. This may be more clear

Let me try

morningman · 2022-06-01T01:23:58Z

In addition to the observed reduction in the number of RPCs, are there any mitigations for the timeout issue?

I tested a sql with 3 fragments, 1FE, 1BE, jmeter, thread num 100.
before: send fragment timeout after running a few seconds.
after: no error.

dataroaring

LGTM

github-actions · 2022-06-03T03:40:54Z

PR approved by at least one committer and no changes requested.

github-actions · 2022-06-03T03:40:56Z

PR approved by anyone and no changes requested.

…meout error (#9720) This CL mainly changes: 1. Reducing the rpc timeout problem caused by rpc waiting for the worker thread of brpc. 1. Merge multiple fragment instances on the same BE to send requests to reduce the number of send fragment rpcs 2. If fragments size >= 3, use 2 phase RPC: one is to send all fragments, two is to start these fragments. So that there will be at most 2 RPC for each query on one BE. 3. Set the timeout of send fragment rpc to the query timeout to ensure the consistency of users' expectation of query timeout period. 4. Do not close the connection anymore when rpc timeout occurs. 5. Change some log level from info to debug to simplify the fe.log content. NOTICE: 1. Change the definition of execPlanFragment rpc, must first upgrade BE. 3. Remove FE config `remote_fragment_exec_timeout_ms`

This bug was introduced from apache#9720

) This bug was introduced from #9720

…12495) 1. For query with 1656 union, the plan thrift size will be reduced from 400MB+ to 2MB. This optimization is introduced from #4904, but lost after #9720 2. Disable ExprSubstitutionMap.verify when debug is disable. So that the plan time of query with 1656 union will be reduced from 20s to 2s

…pache#12495) 1. For query with 1656 union, the plan thrift size will be reduced from 400MB+ to 2MB. This optimization is introduced from apache#4904, but lost after apache#9720 2. Disable ExprSubstitutionMap.verify when debug is disable. So that the plan time of query with 1656 union will be reduced from 20s to 2s

morningman added the kind/improvement label May 21, 2022

github-actions bot added the kind/docs Categorizes issue or PR as related to documentation. label May 21, 2022

morningman added this to the v1.1 milestone May 27, 2022

morningman added the dev/1.0.1-deprecated should be merged into dev-1.0.1 branch label May 27, 2022

morningman force-pushed the send_timeout branch from 2e13cfd to 407e224 Compare May 27, 2022 15:56

morningman marked this pull request as draft May 28, 2022 03:04

morningman force-pushed the send_timeout branch 3 times, most recently from fb9a06e to 75b42c2 Compare May 29, 2022 05:53

github-actions bot added area/load Issues or PRs related to all kinds of load area/routine load labels May 29, 2022

morningman marked this pull request as ready for review May 29, 2022 06:05

yiguolei reviewed May 31, 2022

View reviewed changes

morningman force-pushed the send_timeout branch from 0a9b92a to d310a5d Compare May 31, 2022 01:57

morningman force-pushed the send_timeout branch from 139ccaa to f6a180c Compare June 1, 2022 15:49

morningman added 7 commits June 3, 2022 09:01

pass

6f2971a

code format

969000f

add log

9f4a1e0

add log

593560d

fix ut

b6d5d68

1

2dd009d

fix ut

a98a82e

morningman force-pushed the send_timeout branch from f6a180c to a98a82e Compare June 3, 2022 01:18

add missing service

d606a3e

dataroaring approved these changes Jun 3, 2022

View reviewed changes

github-actions bot added approved Indicates a PR has been approved by one committer. reviewed labels Jun 3, 2022

morningman merged commit c996334 into apache:master Jun 3, 2022

morningman added dev/merged-1.0.1-deprecated PR has been merged into dev-1.0.1 and removed dev/1.0.1-deprecated should be merged into dev-1.0.1 branch labels Jun 3, 2022

morningman added a commit to morningman/doris that referenced this pull request Jun 7, 2022

[fix](coordinator) fix bug that unable to generate query profile

6a4b29c

This bug was introduced from apache#9720

morningman mentioned this pull request Jun 7, 2022

[fix](coordinator) fix bug that unable to generate query profile #10002

Merged

morningman added a commit to morningman/doris that referenced this pull request Jun 8, 2022

[fix](coordinator) fix bug that unable to generate query profile

b881abe

This bug was introduced from apache#9720

yiguolei pushed a commit that referenced this pull request Jun 8, 2022

[fix](coordinator) fix bug that unable to generate query profile (#10002

dcdfc5b

) This bug was introduced from #9720

morningman added a commit that referenced this pull request Jun 8, 2022

[fix](coordinator) fix bug that unable to generate query profile (#10002

440ad03

) This bug was introduced from #9720

morningman mentioned this pull request Sep 8, 2022

[improvement](planner) unset common fields to reduce plan thrift size #12495

Merged

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[improvement] Optimize send fragment logic to reduce send fragment timeout error #9720

[improvement] Optimize send fragment logic to reduce send fragment timeout error #9720

morningman commented May 21, 2022 •

edited

Loading

xinyiZzz commented May 24, 2022

morningman commented May 29, 2022

morningman commented May 29, 2022

yiguolei May 31, 2022

morningman May 31, 2022

yiguolei May 31, 2022

yangzhg commented May 31, 2022

yangzhg commented May 31, 2022

morningman commented Jun 1, 2022

morningman commented Jun 1, 2022

dataroaring left a comment

github-actions bot commented Jun 3, 2022

github-actions bot commented Jun 3, 2022

[improvement] Optimize send fragment logic to reduce send fragment timeout error #9720

[improvement] Optimize send fragment logic to reduce send fragment timeout error #9720

Conversation

morningman commented May 21, 2022 • edited Loading

Proposed changes

Problem Summary:

Checklist(Required)

Further comments

xinyiZzz commented May 24, 2022

morningman commented May 29, 2022

morningman commented May 29, 2022

yiguolei May 31, 2022

Choose a reason for hiding this comment

morningman May 31, 2022

Choose a reason for hiding this comment

yiguolei May 31, 2022

Choose a reason for hiding this comment

yangzhg commented May 31, 2022

yangzhg commented May 31, 2022

morningman commented Jun 1, 2022

morningman commented Jun 1, 2022

dataroaring left a comment

Choose a reason for hiding this comment

github-actions bot commented Jun 3, 2022

github-actions bot commented Jun 3, 2022

morningman commented May 21, 2022 •

edited

Loading