Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spark 4.0] Account for PartitionedFileUtil.splitFiles signature change. #10857

Merged
merged 11 commits into from
May 31, 2024

Conversation

mythrocks
Copy link
Collaborator

@mythrocks mythrocks commented May 21, 2024

Fixes #10299.

In Apache Spark 4.0, the signature of PartitionedFileUtil.splitFiles was changed to remove unused parameters (apache/spark@eabea643c74). This causes the Spark RAPIDS plugin build to break with Spark 4.0.

This commit introduces a shim to account for the signature change.

fixes #10299

Fixes NVIDIA#10299.

In Apache Spark 4.0, the signature of `PartitionedFileUtil.splitFiles` was changed
to remove unused parameters (apache/spark@eabea643c74).  This causes the Spark RAPIDS
plugin build to break with Spark 4.0.

This commit introduces a shim to account for the signature change.

Signed-off-by: MithunR <[email protected]>
@mythrocks mythrocks added the audit_4.0.0 Audit related tasks for 4.0.0 label May 21, 2024
@mythrocks mythrocks self-assigned this May 21, 2024
@mythrocks mythrocks requested a review from razajafri May 22, 2024 01:23
@mythrocks
Copy link
Collaborator Author

Build

@mythrocks
Copy link
Collaborator Author

Build

@mythrocks mythrocks changed the title Account for PartitionedFileUtil.splitFiles signature change. [Spark 4.0] Account for PartitionedFileUtil.splitFiles signature change. May 22, 2024
Copy link
Collaborator

@razajafri razajafri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just update the copyrights on one file

@mythrocks
Copy link
Collaborator Author

mythrocks commented May 28, 2024

(Working on the style fixes.)

Edit: Fixed.

@mythrocks mythrocks changed the base branch from branch-24.06 to branch-24.08 May 28, 2024 19:32
@mythrocks
Copy link
Collaborator Author

Build

@mythrocks mythrocks requested a review from razajafri May 28, 2024 19:41
@mythrocks
Copy link
Collaborator Author

I've merged up to pull in #10933.

@mythrocks mythrocks requested a review from razajafri May 29, 2024 17:27
razajafri
razajafri previously approved these changes May 29, 2024
@razajafri
Copy link
Collaborator

build

@razajafri razajafri added Spark 4.0+ Spark 4.0+ issues and removed audit_4.0.0 Audit related tasks for 4.0.0 labels May 29, 2024
@razajafri
Copy link
Collaborator

CI failed with Databricks build result : FAILURE

@razajafri
Copy link
Collaborator

build

@mythrocks
Copy link
Collaborator Author

Ah, this PR is also failing CI on the following tangential problem:

2024-05-29T18:16:38.0437840Z [2024-05-29T18:16:04.105Z] [2024-05-29T18:15:57.180Z] [ERROR] /home/ubuntu/spark-rapids/sql-plugin/src/main/spark330db/scala/org/apache/spark/rapids/execution/GpuSubqueryBroadcastMeta.scala:33: GpuSubqueryBroadcastMeta is already defined as class GpuSubqueryBroadcastMeta
2024-05-29T18:16:38.0438346Z [2024-05-29T18:16:04.105Z] [2024-05-29T18:15:57.180Z] [ERROR] class GpuSubqueryBroadcastMeta(
2024-05-29T18:16:38.0438690Z [2024-05-29T18:16:04.105Z] [2024-05-29T18:15:57.180Z] [ERROR]       ^
2024-05-29T18:16:38.0439076Z [2024-05-29T18:16:04.105Z] [2024-05-29T18:15:57.180Z] [ERROR] one error found
2024-05-29T18:16:38.0439680Z [2024-05-29T18:16:04.105Z] [2024-05-29T18:15:57.180Z] [INFO] ------------------------------------------------------------------------

#10945 should help.

@razajafri
Copy link
Collaborator

build

@mythrocks
Copy link
Collaborator Author

Looks like this will need special handling for Databricks:

/org/apache/spark/sql/execution/rapids/shims/SplitFiles.scala:59: value length is not a member of Nothing
2024-05-30T07:07:13.7388140Z [2024-05-30T07:06:29.793Z] [2024-05-30T07:06:20.064Z] [ERROR]       }.sortBy(_.length)(implicitly[Ordering[Long]].reverse)
2024-05-30T07:07:13.7389160Z [2024-05-30T07:06:29.793Z] [2024-05-30T07:06:20.064Z] [ERROR]                  ^
2024-05-30T07:07:13.7391095Z [2024-05-30T07:06:29.793Z] [2024-05-30T07:06:20.064Z] [ERROR] /home/ubuntu/spark-rapids/sql-plugin/src/main/spark341db/scala/org/apache/spark/sql/execution/rapids/shims/SplitFiles.scala:82: value sortBy is not a member of Array[Nothing]
2024-05-30T07:07:13.7393196Z [2024-05-30T07:06:29.793Z] [2024-05-30T07:06:20.064Z] possible cause: maybe a semicolon is missing before `value sortBy'?
2024-05-30T07:07:13.7394447Z [2024-05-30T07:06:29.793Z] [2024-05-30T07:06:20.064Z] [ERROR]     }.sortBy(_.length)(implicitly[Ordering[Long]].reverse)
2024-05-30T07:07:13.7395420Z [2024-05-30T07:06:29.793Z] [2024-05-30T07:06:20.064Z] [ERROR]       ^
2024-05-30T07:07:13.7397343Z [2024-05-30T07:06:29.793Z] [2024-05-30T07:06:20.064Z] [ERROR] /home/ubuntu/spark-rapids/sql-plugin/src/main/spark341db/scala/org/apache/spark/sql/execution/rapids/shims/SplitFiles.scala:40: local method canBeSplit in method splitFiles is never used
2024-05-30T07:07:13.7399511Z [2024-05-30T07:06:29.793Z] [2024-05-30T07:06:20.064Z] [ERROR]     def canBeSplit(filePath: Path, hadoopConf: Configuration): Boolean = {
2024-05-30T07:07:13.7400581Z [2024-05-30T07:06:29.793Z] [2024-05-30T07:06:20.064Z] [ERROR]         ^
2024-05-30T07:07:13.7402484Z [2024-05-30T07:06:29.793Z] [2024-05-30T07:06:20.064Z] [ERROR] /home/ubuntu/spark-rapids/sql-plugin/src/main/spark341db/scala/org/apache/spark/sql/execution/rapids/shims/SplitFiles.scala:72: local val isSplitable in value $anonfun is never used
2024-05-30T07:07:13.7404528Z [2024-05-30T07:06:29.793Z] [2024-05-30T07:06:20.064Z] [ERROR]         val isSplitable = relation.fileFormat.isSplitable(
2024-05-30T07:07:13.7405531Z [2024-05-30T07:06:29.793Z] [2024-05-30T07:06:20.064Z] [ERROR]             ^

I'll get on this shortly.

@mythrocks
Copy link
Collaborator Author

Looks like this will need special handling for Databricks...

Barked up the wrong tree for a bit. This was only a missing import. Testing the fix now.

@mythrocks
Copy link
Collaborator Author

Build

@mythrocks mythrocks merged commit 822ad9b into NVIDIA:branch-24.08 May 31, 2024
44 checks passed
@mythrocks
Copy link
Collaborator Author

This change has been merged. Thank you for the reviews, @razajafri, @NVnavkumar.

SurajAralihalli pushed a commit to SurajAralihalli/spark-rapids that referenced this pull request Jul 12, 2024
…ange. (NVIDIA#10857)

* Account for PartitionedFileUtil.splitFiles signature change.

Fixes NVIDIA#10299.

In Apache Spark 4.0, the signature of `PartitionedFileUtil.splitFiles` was changed
to remove unused parameters (apache/spark@eabea643c74).  This causes the Spark RAPIDS
plugin build to break with Spark 4.0.

This commit introduces a shim to account for the signature change.

Signed-off-by: MithunR <[email protected]>

* Common base for PartitionFileUtilsShims.

Signed-off-by: MithunR <[email protected]>

* Reusing existing PartitionedFileUtilsShims.

* More refactor, for pre-3.5 compile.

* Updated Copyright date.

* Fixed style error.

* Re-fixed the copyright year.

* Added missing import.

---------

Signed-off-by: MithunR <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Spark 4.0+ Spark 4.0+ issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[AUDIT][SPARK-42821][SQL] Remove unused parameters in splitFiles methods
3 participants