-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shim Layer to support multiple Spark versions #414
Merged
tgravescs
merged 62 commits into
NVIDIA:branch-0.2
from
tgravescs:shimBranch0.2-spark301
Jul 23, 2020
Merged
Shim Layer to support multiple Spark versions #414
tgravescs
merged 62 commits into
NVIDIA:branch-0.2
from
tgravescs:shimBranch0.2-spark301
Jul 23, 2020
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
build |
Note I'm in progress of running a final pass over integration tests now. |
revans2
approved these changes
Jul 23, 2020
This was referenced Jul 23, 2020
jlowe
approved these changes
Jul 23, 2020
Ran integration tests on 3.0.0 and 3.0.1 and all pass, 3.1.0 has a few failures that I expected and have issue to investigate: #416 |
nartal1
pushed a commit
to nartal1/spark-rapids
that referenced
this pull request
Jun 9, 2021
* Shim Layer to support multiple Spark versions - adds Spark 3.0.0, 3.0.1, and 3.1.0 Signed-off-by: Thomas Graves <[email protected]>
nartal1
pushed a commit
to nartal1/spark-rapids
that referenced
this pull request
Jun 9, 2021
* Shim Layer to support multiple Spark versions - adds Spark 3.0.0, 3.0.1, and 3.1.0 Signed-off-by: Thomas Graves <[email protected]>
pxLi
pushed a commit
to pxLi/spark-rapids
that referenced
this pull request
May 12, 2022
tgravescs
pushed a commit
to tgravescs/spark-rapids
that referenced
this pull request
Nov 30, 2023
Signed-off-by: spark-rapids automation <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
fixes #355
This adds a shim layer to support multiple Spark versions. This PR adds the framework and support for Apache Spark 3.0.0, Apache Spark 3.0.1 and Apache Spark 3.10. All the tests pass for Spark 3.0.0 and 3.0.1 but there are a few failures still for 3.1.0. We can finish resolving those after this is merged. The problem I keep running into is upmerged to spark 3.1 as well as the plugin itself keeps having conflicts and requires continuous retesting and fixing those. So if we can get this part reviewed and in, I think it will be easier to finish the remaining things and allow multiple people to look at it.
Note that to shim the ShuffleManger, it requires the user to specify the spark package version right now.
A lot of the changes are around Join. HashJoin had changes which requires a lot of things to be copied into the shim layer. There may be ways it improve this to share code but would like to look more at that in followup, if you have ideas let me know. If you diff those files there should be very little differences. The other issue there is BuildSide, BuildRight, BuildLeft all moved packages.
Spark 3.1.0 had api changes for other things - TimeSub, ScalaUDF, MapOutputTracker, ShuffleManager, FileSourceScan, First and Last
For the Shimlayer itself we use service loaders for each of the versions. There is a lightweight loader class that first determines if the loader applies to that Spark version, then it has a buildShim function to load the entire shim. This keeps us from loading a bunch of classes we don't really need.
I added a profile spark30tests and spark31tests that you can specify to run the tests on different versions. Once we have things working we can hook that up to CI along with running integration tests on both versions. slf4j in the tests had conflicts with the cudf versions, so for now I made them pull in the newer version. I want to update CUDF to have the same version as well.
Note that I tested the core sql-plugin code by building against both Spark 3.0 and Spark 3.1 to ensure I didn't miss anything there. Tests were run against both versions. 1 unit test failure on 3.1 and 3.1 has some integration test failures - some FullOuter join tests, log test, and TimeSub tests,
I filed a few followup issues to look at the remaining items for commonize join code, changing loaders to use priority, investigating if we want version match to be strict or not, and investigate shuffle having common base class.