-
Notifications
You must be signed in to change notification settings - Fork 28.5k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[SPARK-48258][PYTHON][CONNECT] Checkpoint and localCheckpoint in Spar…
…k Connect ### What changes were proposed in this pull request? This PR proposes to `DataFrame.checkpoint` and `DataFrame.localCheckpoint` API in Spark Connect. #### Overview  1. Spark Connect Client invokes [local]checkpoint - Connects to the server, store (Session UI, UUID) <> Checkpointed DataFrame 2. Execute [local]checkpoint 3. Returns UUID for the checkedpointed DataFrame. - Client side holds the UUID with truncated (replaced) the protobuf message 4. When the DataFrame in client side is garbage-collected, it is invoked to clear the state within Spark Connect server. 5. If the checkpointed RDD is not referred anymore (e.g., not even by temp view as an example), it is cleaned by ContextCleaner (which runs separately, and periodically) 6. *When the session is closed, it attempts to clear all mapped state in Spark Connect server (because it is not guaranteed to call `DataFrame.__del__` in Python upon garbage-collection) 7. *If the checkpointed RDD is not referred anymore (e.g., not even by temp view as an example), it is cleaned by ContextCleaner (which runs separately, and periodically) *In 99.999% cases, the state (map<(session_id, uuid), c'p'dataframe>) will be cleared when DataFrame is garbage-collected, e.g., unless there are some crashes. Practically, Py4J also leverages to clean up their Java objects. For 0.001% cases, the 6. and 7. address them. Both steps happen when session is closed, and session holder is released, see also [#41580](#41580). #### Command/RPCs Reuse `CachedRemoteRelation` (from [#41580](#41580)) ```proto message Command { oneof command_type { ... CheckpointCommand checkpoint_command = 14; RemoveCachedRemoteRelationCommand remove_cached_remote_relation_command = 15; ... } } // Command to remove `CashedRemoteRelation` message RemoveCachedRemoteRelationCommand { // (Required) The remote to be related CachedRemoteRelation relation = 1; } message CheckpointCommand { // (Required) The logical plan to checkpoint. Relation relation = 1; // (Optional) Locally checkpoint using a local temporary // directory in Spark Connect server (Spark Driver) optional bool local = 2; // (Optional) Whether to checkpoint this dataframe immediately. optional bool eager = 3; } message CheckpointCommandResult { // (Required) The logical plan checkpointed. CachedRemoteRelation relation = 1; } ``` ```proto message ExecutePlanResponse { ... oneof response_type { ... CheckpointCommandResult checkpoint_command_result = 19; } ... message Checkpoint { // (Required) The logical plan checkpointed. CachedRemoteRelation relation = ...; } } ``` #### Usage ```bash ./sbin/start-connect-server.sh --conf spark.checkpoint.dir=/path/to/checkpoint ``` ```python spark.range(1).localCheckpoint() spark.range(1).checkpoint() ``` ### Why are the changes needed? For feature parity without Spark Connect. ### Does this PR introduce _any_ user-facing change? Yes, it adds both `DataFrame.checkpoint` and `DataFrame.localCheckpoint` API in Spark Connect. ### How was this patch tested? Unittests, and manually tested as below: **Code** ```bash ./bin/pyspark --remote "local[*]" ``` ```python >>> df = spark.range(1).localCheckpoint() >>> df.explain(True) == Parsed Logical Plan == LogicalRDD [id#1L], false == Analyzed Logical Plan == id: bigint LogicalRDD [id#1L], false == Optimized Logical Plan == LogicalRDD [id#1L], false == Physical Plan == *(1) Scan ExistingRDD[id#1L] >>> df._plan <pyspark.sql.connect.plan.CachedRemoteRelation object at 0x147734a50> >>> del df ``` **Logs** ``` ... {"ts":"2024-05-14T06:18:01.711Z","level":"INFO","msg":"Caching DataFrame with id 7316f315-d20d-446d-b5e7-ac848870e280","context":{"dataframe_id":"7316f315-d20d-446d-b5e7-ac848870e280"},"logger":"SparkConnectAnalyzeHandler"} ... {"ts":"2024-05-14T06:18:11.718Z","level":"INFO","msg":"Removing DataFrame with id 7316f315-d20d-446d-b5e7-ac848870e280 from the cache","context":{"dataframe_id":"7316f315-d20d-446d-b5e7-ac848870e280"},"logger":"SparkConnectPlanner"} ... ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46570 from HyukjinKwon/SPARK-48258. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
- Loading branch information
1 parent
0393ab4
commit 7d6bb74
Showing
18 changed files
with
656 additions
and
295 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Large diffs are not rendered by default.
Oops, something went wrong.
Oops, something went wrong.