-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-44872][CONNECT] Server testing infra and ReattachableExecuteSuite #42560
[SPARK-44872][CONNECT] Server testing infra and ReattachableExecuteSuite #42560
Conversation
// Other suites using mocks leave a mess in the global executionManager, | ||
// shut it down so that it's cleared before starting server. | ||
SparkConnectService.executionManager.shutdown() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fact that we use singletons is ugly. I raised https://issues.apache.org/jira/browse/SPARK-44779 for it some time ago already.
…rror ### What changes were proposed in this pull request? Make INVALID_CURSOR.DISCONNECTED a retriable error. ### Why are the changes needed? This error can happen if two RPCs are racing to reattach to the query, and the client is still using the losing one. SPARK-44833 was a bug that exposed such a situation. That was fixed, but to be more robust, we can make this error retryable. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tests will be added in #42560 ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42818 from juliuszsompolski/SPARK-44835. Authored-by: Juliusz Sompolski <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
…rror ### What changes were proposed in this pull request? Make INVALID_CURSOR.DISCONNECTED a retriable error. ### Why are the changes needed? This error can happen if two RPCs are racing to reattach to the query, and the client is still using the losing one. SPARK-44833 was a bug that exposed such a situation. That was fixed, but to be more robust, we can make this error retryable. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tests will be added in #42560 ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42818 from juliuszsompolski/SPARK-44835. Authored-by: Juliusz Sompolski <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit f13743d) Signed-off-by: Hyukjin Kwon <[email protected]>
a6e63a3
to
48aa78e
Compare
(deleting comment - was supposed to be on another PR) |
https://github.com/juliuszsompolski/apache-spark/actions/runs/6157633940/job/16708789265
flaky |
Alright merging. |
### What changes were proposed in this pull request? Add `SparkConnectServerTest` with infra to test real server with real client in the same process, but communicating over RPC. Add `ReattachableExecuteSuite` with some tests for reattachable execute. Two bugs were found by the tests: * Fix bug in `SparkConnectExecutionManager.createExecuteHolder` when attempting to resubmit an operation that was deemed abandoned. This bug is benign in reattachable execute, because reattachable execute would first send a ReattachExecute, which would be handled correctly in SparkConnectReattachExecuteHandler. For non-reattachable execute (disabled or old client), this is also a very unlikely scenario, because the retrying mechanism should be able to resubmit before the query is declared abandoned, and hence get an INVALID_HANDLE.OPERATION_ALREADY_EXISTS. This bug can manifest only if a non-reattachable execution is retried with so much delay that the operation was declared abandoned. * In `ExecuteGrpcResponseSender` there was an assertion that assumed that if `sendResponse` did not send, it was because deadline was reached. But it can also be because of interrupt. This would have resulted in interrupt returning an assertion error instead of CURSOR_DISCONNECTED in testing. Outside of testing assertions are not enabled, so this was not a problem outside of testing. ### Why are the changes needed? Testing of reattachable execute. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tests added. Closes #42560 from juliuszsompolski/sc-reattachable-tests. Authored-by: Juliusz Sompolski <[email protected]> Signed-off-by: Herman van Hovell <[email protected]> (cherry picked from commit 4b96add) Signed-off-by: Herman van Hovell <[email protected]>
…st result task is finished ### What changes were proposed in this pull request? In the situation before, query will only be FINISHED when all results have been pushed into the output buffers (not necessarily received by client, but pushed out of the server). For LocalTableScanExec, post FINISHED before sending result batches, because nothing is executed, only cached local results are returned. For regular execution, post FINISHED after all task results have been returned from Spark, not after they have been processed and sent out. ### Why are the changes needed? Currently, even if a query finished running in Spark, it keeps being RUNNING until all results are sent. Then there is a very small difference between FINISHED and CLOSED. This change makes it behave more similar to e.g. Thriftserver. ### Does this PR introduce _any_ user-facing change? Yes. Queries will be posted as FINISHED when they finish executing, not when they finish sending results. ### How was this patch tested? Will add test in #42560 ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42889 from juliuszsompolski/SPARK-45133. Authored-by: Juliusz Sompolski <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @juliuszsompolski and @hvanhovell .
ReattachableExecuteSuite
seems to be very unstable.
Since the commit to merge this PR, all CIs fails sequentially except only one doc PR.
- https://github.com/apache/spark/actions/runs/6167442896/job/16738435813
- https://github.com/apache/spark/actions/runs/6161837332/job/16734028280
- https://github.com/apache/spark/actions/runs/6161871902/job/16722114515
- https://github.com/apache/spark/actions/runs/6167186123/job/16737786210
- https://github.com/apache/spark/actions/runs/6166972473/job/16737277903
- https://github.com/apache/spark/actions/runs/6167442896/job/16738435813
Apologies. I ran the final version 10+ times on my machine without flakes, but it may be related to the CI runner having less resources. Looking. |
All the linked runs fail with the same:
It doesn't reproduced for me, but I know what could be going on, so will tweak the test. |
Thank you for checking, @juliuszsompolski . |
@juliuszsompolski For the daily tests of Scala 2.13, in addition to https://github.com/apache/spark/actions/runs/6163764050/job/16728064919
|
What changes were proposed in this pull request?
Add
SparkConnectServerTest
with infra to test real server with real client in the same process, but communicating over RPC.Add
ReattachableExecuteSuite
with some tests for reattachable execute.Two bugs were found by the tests:
SparkConnectExecutionManager.createExecuteHolder
when attempting to resubmit an operation that was deemed abandoned. This bug is benign in reattachable execute, because reattachable execute would first send a ReattachExecute, which would be handled correctly in SparkConnectReattachExecuteHandler. For non-reattachable execute (disabled or old client), this is also a very unlikely scenario, because the retrying mechanism should be able to resubmit before the query is declared abandoned, and hence get an INVALID_HANDLE.OPERATION_ALREADY_EXISTS. This bug can manifest only if a non-reattachable execution is retried with so much delay that the operation was declared abandoned.ExecuteGrpcResponseSender
there was an assertion that assumed that ifsendResponse
did not send, it was because deadline was reached. But it can also be because of interrupt. This would have resulted in interrupt returning an assertion error instead of CURSOR_DISCONNECTED in testing. Outside of testing assertions are not enabled, so this was not a problem outside of testing.Why are the changes needed?
Testing of reattachable execute.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Tests added.