Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[refactor](Coordinator) refactor coordinator #41730

Merged
merged 3 commits into from
Nov 7, 2024

Conversation

924060929
Copy link
Contributor

@924060929 924060929 commented Oct 11, 2024

Proposed changes

Use NereidsSqlCoordinator instead of Coordinator because the code of Coordinator is too hard to maintaining

The main design approach is as follows:

  1. Divide the original flat Coordinator into multiple modules, with each module maintaining high cohesion.
  • DistributePlanner: The logic for calculating parallelism has been extracted in [refactor](nereids) New distribute planner #36531, and in the future, we will dynamically calculate parallelism based on cost.
  • CoordinatorContext: Some global parameters and states related to the Coordinator are encapsulated within CoordinatorContext.
  • PipelineExecutionTask: The entire scheduling task is encapsulated by PipelineExecutionTask, which includes the mapping relationship between each Backend and Pipeline task. PipelineExecutionTask contains two layers of tasks, each responsible for specific duties, with state maintenance handled internally rather than being centralized in the Coordinator.
    • MultiFragmentsPipelineTask: A Backend will generate multiple fragment tasks, which are bundled together and sent concurrently to the corresponding Backend.
    • SingleFragmentPipelineTask: A single fragment task for a Backend.
  • JobProcessor: Describes two types of tasks: SQL tasks and Load tasks.
    • QueryProcessor: Represents query tasks and provides a ResultReceiver to obtain query results.
    • LoadProcessor: Represents Insert into and Broker load tasks, providing a blocking function to wait for load completion.
  • ThriftPlansBuilder: Uses the DistributedPlan structure to build thrift parameters and encapsulates some intermediate temporary variables within functions, rather than placing them in the Coordinator.
  1. The overall Coordinator logic is more clearly organized. We can see that the NereidsCoordinator consists of only a few functions, allowing quick understanding of the main flow when reading the code.
  • Construct CoordinatorContext.
  • Enqueue the tasks.
  • Handle different sinks accordingly.
  • Register the Coordinator with QeProcessorImpl for cancellation and progress tracking.
  • Construct thrift parameters.
  • Build PipelineTask.
  • Initiate RPC calls to each Backend.

TODO:

  1. delete old Coordinator
  2. support cloud mode

Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

2 similar comments
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@924060929 924060929 marked this pull request as ready for review October 12, 2024 13:09
@924060929
Copy link
Contributor Author

run buildall

8 similar comments
@924060929
Copy link
Contributor Author

run buildall

@924060929
Copy link
Contributor Author

run buildall

@924060929
Copy link
Contributor Author

run buildall

@924060929
Copy link
Contributor Author

run buildall

@924060929
Copy link
Contributor Author

run buildall

@924060929
Copy link
Contributor Author

run buildall

@924060929
Copy link
Contributor Author

run buildall

@924060929
Copy link
Contributor Author

run buildall

@924060929
Copy link
Contributor Author

run buildall

2 similar comments
@924060929
Copy link
Contributor Author

run buildall

@924060929
Copy link
Contributor Author

run buildall

Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@924060929
Copy link
Contributor Author

run buildall

1 similar comment
@924060929
Copy link
Contributor Author

run buildall

Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@924060929
Copy link
Contributor Author

run buildall

1 similar comment
@924060929
Copy link
Contributor Author

run buildall

@924060929
Copy link
Contributor Author

run buildall

1 similar comment
@924060929
Copy link
Contributor Author

run buildall

@github-actions github-actions bot removed the approved Indicates a PR has been approved by one committer. label Oct 25, 2024
@924060929
Copy link
Contributor Author

run buildall

2 similar comments
@924060929
Copy link
Contributor Author

run buildall

@924060929
Copy link
Contributor Author

run buildall

@924060929 924060929 force-pushed the new_scheduler5 branch 2 times, most recently from bad6e2a to 51ce209 Compare October 29, 2024 07:28
@924060929
Copy link
Contributor Author

run buildall

1 similar comment
@924060929
Copy link
Contributor Author

run buildall

@924060929
Copy link
Contributor Author

run buildall

@924060929
Copy link
Contributor Author

run buildall

Copy link
Contributor

github-actions bot commented Nov 7, 2024

PR approved by at least one committer and no changes requested.

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Nov 7, 2024
@924060929 924060929 merged commit 46e5294 into apache:master Nov 7, 2024
30 of 32 checks passed
924060929 added a commit that referenced this pull request Nov 12, 2024
…or (#43763)

fix QueryProcessor cannot be cast to class LoadProcessor, introduced by
#41730

Problem Summary:

sql: any select statement

it only meet when open debug log, so I can not write a test
```
2024-11-12 08:15:52,266 WARN (mysql-nio-pool-0|206) [ConnectProcessor.handleQueryException():480] Process one query failed because unknown reason: 
java.lang.ClassCastException: class org.apache.doris.qe.runtime.QueryProcessor cannot be cast to class org.apache.doris.qe.runtime.LoadProcessor (org.apache.doris.qe.runtime.QueryProcessor and org.apache.doris.qe.runtime.LoadProcessor are in unnamed module of loader 'app')
	at org.apache.doris.qe.CoordinatorContext.asLoadProcessor(CoordinatorContext.java:262) ~[doris-fe.jar:1.2-SNAPSHOT]
	at org.apache.doris.qe.NereidsCoordinator.getJobId(NereidsCoordinator.java:202) ~[doris-fe.jar:1.2-SNAPSHOT]
	at org.apache.doris.qe.QeProcessorImpl.registerQuery(QeProcessorImpl.java:116) ~[doris-fe.jar:1.2-SNAPSHOT]
	at org.apache.doris.qe.StmtExecutor.executeAndSendResult(StmtExecutor.java:1925) ~[doris-fe.jar:1.2-SNAPSHOT]
	at org.apache.doris.qe.StmtExecutor.handleQueryStmt(StmtExecutor.java:1897) ~[doris-fe.jar:1.2-SNAPSHOT]
	at org.apache.doris.qe.StmtExecutor.handleQueryWithRetry(StmtExecutor.java:901) ~[doris-fe.jar:1.2-SNAPSHOT]
	at org.apache.doris.qe.StmtExecutor.executeByNereids(StmtExecutor.java:833) ~[doris-fe.jar:1.2-SNAPSHOT]
	at org.apache.doris.qe.StmtExecutor.execute(StmtExecutor.java:605) ~[doris-fe.jar:1.2-SNAPSHOT]
	at org.apache.doris.qe.StmtExecutor.queryRetry(StmtExecutor.java:568) ~[doris-fe.jar:1.2-SNAPSHOT]
	at org.apache.doris.qe.StmtExecutor.execute(StmtExecutor.java:558) ~[doris-fe.jar:1.2-SNAPSHOT]
	at org.apache.doris.qe.ConnectProcessor.executeQuery(ConnectProcessor.java:340) ~[doris-fe.jar:1.2-SNAPSHOT]
	at org.apache.doris.qe.ConnectProcessor.handleQuery(ConnectProcessor.java:243) ~[doris-fe.jar:1.2-SNAPSHOT]
	at org.apache.doris.qe.MysqlConnectProcessor.handleQuery(MysqlConnectProcessor.java:209) ~[doris-fe.jar:1.2-SNAPSHOT]
	at org.apache.doris.qe.MysqlConnectProcessor.dispatch(MysqlConnectProcessor.java:237) ~[doris-fe.jar:1.2-SNAPSHOT]
	at org.apache.doris.qe.MysqlConnectProcessor.processOnce(MysqlConnectProcessor.java:414) ~[doris-fe.jar:1.2-SNAPSHOT]
	at org.apache.doris.mysql.ReadListener.lambda$handleEvent$0(ReadListener.java:52) ~[doris-fe.jar:1.2-SNAPSHOT]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
	at java.lang.Thread.run(Thread.java:840) ~[?:?] 
```
py023 pushed a commit to py023/doris that referenced this pull request Nov 13, 2024
…or (apache#43763)

fix QueryProcessor cannot be cast to class LoadProcessor, introduced by
apache#41730

Problem Summary:

sql: any select statement

it only meet when open debug log, so I can not write a test
```
2024-11-12 08:15:52,266 WARN (mysql-nio-pool-0|206) [ConnectProcessor.handleQueryException():480] Process one query failed because unknown reason: 
java.lang.ClassCastException: class org.apache.doris.qe.runtime.QueryProcessor cannot be cast to class org.apache.doris.qe.runtime.LoadProcessor (org.apache.doris.qe.runtime.QueryProcessor and org.apache.doris.qe.runtime.LoadProcessor are in unnamed module of loader 'app')
	at org.apache.doris.qe.CoordinatorContext.asLoadProcessor(CoordinatorContext.java:262) ~[doris-fe.jar:1.2-SNAPSHOT]
	at org.apache.doris.qe.NereidsCoordinator.getJobId(NereidsCoordinator.java:202) ~[doris-fe.jar:1.2-SNAPSHOT]
	at org.apache.doris.qe.QeProcessorImpl.registerQuery(QeProcessorImpl.java:116) ~[doris-fe.jar:1.2-SNAPSHOT]
	at org.apache.doris.qe.StmtExecutor.executeAndSendResult(StmtExecutor.java:1925) ~[doris-fe.jar:1.2-SNAPSHOT]
	at org.apache.doris.qe.StmtExecutor.handleQueryStmt(StmtExecutor.java:1897) ~[doris-fe.jar:1.2-SNAPSHOT]
	at org.apache.doris.qe.StmtExecutor.handleQueryWithRetry(StmtExecutor.java:901) ~[doris-fe.jar:1.2-SNAPSHOT]
	at org.apache.doris.qe.StmtExecutor.executeByNereids(StmtExecutor.java:833) ~[doris-fe.jar:1.2-SNAPSHOT]
	at org.apache.doris.qe.StmtExecutor.execute(StmtExecutor.java:605) ~[doris-fe.jar:1.2-SNAPSHOT]
	at org.apache.doris.qe.StmtExecutor.queryRetry(StmtExecutor.java:568) ~[doris-fe.jar:1.2-SNAPSHOT]
	at org.apache.doris.qe.StmtExecutor.execute(StmtExecutor.java:558) ~[doris-fe.jar:1.2-SNAPSHOT]
	at org.apache.doris.qe.ConnectProcessor.executeQuery(ConnectProcessor.java:340) ~[doris-fe.jar:1.2-SNAPSHOT]
	at org.apache.doris.qe.ConnectProcessor.handleQuery(ConnectProcessor.java:243) ~[doris-fe.jar:1.2-SNAPSHOT]
	at org.apache.doris.qe.MysqlConnectProcessor.handleQuery(MysqlConnectProcessor.java:209) ~[doris-fe.jar:1.2-SNAPSHOT]
	at org.apache.doris.qe.MysqlConnectProcessor.dispatch(MysqlConnectProcessor.java:237) ~[doris-fe.jar:1.2-SNAPSHOT]
	at org.apache.doris.qe.MysqlConnectProcessor.processOnce(MysqlConnectProcessor.java:414) ~[doris-fe.jar:1.2-SNAPSHOT]
	at org.apache.doris.mysql.ReadListener.lambda$handleEvent$0(ReadListener.java:52) ~[doris-fe.jar:1.2-SNAPSHOT]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
	at java.lang.Thread.run(Thread.java:840) ~[?:?] 
```
924060929 added a commit that referenced this pull request Nov 13, 2024
fix new coordinator compute a wrong scanRangeNum, introduced by #41730

This bug will show a wrong progress in s3 load:
```
Progress: 0.00%(73/2147483647)
```
924060929 added a commit that referenced this pull request Nov 18, 2024
optimize new distribute planner performance in tpc-h, because #41730
made some performance rollback has occurred

1. fix the wrong runtime filter thrift parameters
2. not default to print distribute plan in profile, you should config
`set profile_level=3` to see it
3. for shuffle join which two sides distribution of natural +
execution_bucketed, support compare cost between plans of shuffle to
left/right
924060929 added a commit that referenced this pull request Dec 2, 2024
…s CTE (#44753)

fix NereidsCoordinator compute wrong result when exists CTE, introduced
by #41730
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by one committer. reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants