Shim Layer to support multiple Spark versions #414

tgravescs · 2020-07-23T18:27:12Z

fixes #355

This adds a shim layer to support multiple Spark versions. This PR adds the framework and support for Apache Spark 3.0.0, Apache Spark 3.0.1 and Apache Spark 3.10. All the tests pass for Spark 3.0.0 and 3.0.1 but there are a few failures still for 3.1.0. We can finish resolving those after this is merged. The problem I keep running into is upmerged to spark 3.1 as well as the plugin itself keeps having conflicts and requires continuous retesting and fixing those. So if we can get this part reviewed and in, I think it will be easier to finish the remaining things and allow multiple people to look at it.

Note that to shim the ShuffleManger, it requires the user to specify the spark package version right now.

A lot of the changes are around Join. HashJoin had changes which requires a lot of things to be copied into the shim layer. There may be ways it improve this to share code but would like to look more at that in followup, if you have ideas let me know. If you diff those files there should be very little differences. The other issue there is BuildSide, BuildRight, BuildLeft all moved packages.

Spark 3.1.0 had api changes for other things - TimeSub, ScalaUDF, MapOutputTracker, ShuffleManager, FileSourceScan, First and Last

For the Shimlayer itself we use service loaders for each of the versions. There is a lightweight loader class that first determines if the loader applies to that Spark version, then it has a buildShim function to load the entire shim. This keeps us from loading a bunch of classes we don't really need.

I added a profile spark30tests and spark31tests that you can specify to run the tests on different versions. Once we have things working we can hook that up to CI along with running integration tests on both versions. slf4j in the tests had conflicts with the cudf versions, so for now I made them pull in the newer version. I want to update CUDF to have the same version as well.

Note that I tested the core sql-plugin code by building against both Spark 3.0 and Spark 3.1 to ensure I didn't miss anything there. Tests were run against both versions. 1 unit test failure on 3.1 and 3.1 has some integration test failures - some FullOuter join tests, log test, and TimeSub tests,

I filed a few followup issues to look at the remaining items for commonize join code, changing loaders to use priority, investigating if we want version match to be strict or not, and investigate shuffle having common base class.

…n versions

tgravescs · 2020-07-23T18:27:22Z

build

tgravescs · 2020-07-23T18:28:25Z

Note I'm in progress of running a final pass over integration tests now.

dist/pom.xml

tgravescs · 2020-07-23T19:15:33Z

Ran integration tests on 3.0.0 and 3.0.1 and all pass, 3.1.0 has a few failures that I expected and have issue to investigate: #416

* Shim Layer to support multiple Spark versions - adds Spark 3.0.0, 3.0.1, and 3.1.0 Signed-off-by: Thomas Graves <[email protected]>

Signed-off-by: spark-rapids automation <[email protected]>

tgravescs and others added 30 commits July 14, 2020 11:03

Working 30 and 31

688c757

Build with both spark3.0 and spark 3.1

e9b3268

minor fixes

ab09032

Formatting

e511110

put back building configs

0d97687

iMove GpuFileSourceScanExec to spark specific dirs

773edfd

Fix order of params

8b6ec0a

remove logging

ad9f2e0

move spark30 and spark31 into shims modules

c48e918

Add missing files

c98b38e

Move packages and use serviceloader

918cbd4

Move GpuFirst to shim

bcd5eb0

First and Last moved

edf01fa

Allow multiple serviceloaders in dist jar

af27493

Cleanup

06c403d

Fixes

5d45aeb

Cleanup

e138385

pom fixes to generate docs

b77bf34

Fix Suite for shim classes and cleanup

261bcc7

shim layer for Rapids Shuffle Manager

5731bb9

Shim for shuffle manager

46db449

add in getRapidsShuffleManagerClass

af1d79d

Cleanup shuffle manager

7952d9f

Changes for shuffle manager

e495820

Cleanup

50bad9d

Change spark3.1 getGpuBuildSide

350c34b

MapOutputTracker api

0a9aeed

shim for mapoutputTracker api

9b611f4

explicitly set version in shims

27d786c

Merge remote-tracking branch 'origin/branch-0.2' into shimBranch0.2

7b7e26e

tgravescs and others added 14 commits July 22, 2020 17:11

Switch to use parent pom instead of aggregator module

9540447

Add spark 3.0.1

7686068

Spark 3.1 shim use GpuFirst and GpuLast from 3.0.1

41c132d

Change to have getExprs/getExecs return map that can be reused betwee…

48991ce

…n versions

Fix up types

315fa3f

Fix comment

b7c711e

Fix comments

98ff237

Rename spark30 to spark300

f5a22f3

move spark 31 to spark310

680864c

renames

118b7b4

cleanup

2ea4767

move RapidsShuffleManager 301

b6283dd

Document profiles for unit tests

716518a

cleanup

cf1c60d

revans2 approved these changes Jul 23, 2020

View reviewed changes

dist/pom.xml Show resolved Hide resolved

This was referenced Jul 23, 2020

[BUG] Spark3.1 StringFallbackSuite regexp_replace null cpu fall back test fails. #382

Closed

[BUG] Fix Spark 3.1.0 integration tests #416

Closed

jlowe approved these changes Jul 23, 2020

View reviewed changes

tgravescs merged commit d2383d2 into NVIDIA:branch-0.2 Jul 23, 2020

jlowe added this to the Jul 20 - Jul 31 milestone Jul 24, 2020

sameerz added the build Related to CI / CD or cleanly building label Jul 27, 2020

pxLi pushed a commit to pxLi/spark-rapids that referenced this pull request May 12, 2022

Added components for poc mode. (NVIDIA#414)

ceba1da

tgravescs pushed a commit to tgravescs/spark-rapids that referenced this pull request Nov 30, 2023

Update submodule cudf to 4d2211e (NVIDIA#414)

f1a798f

Signed-off-by: spark-rapids automation <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shim Layer to support multiple Spark versions #414

Shim Layer to support multiple Spark versions #414

tgravescs commented Jul 23, 2020

tgravescs commented Jul 23, 2020

tgravescs commented Jul 23, 2020

tgravescs commented Jul 23, 2020

Shim Layer to support multiple Spark versions #414

Shim Layer to support multiple Spark versions #414

Conversation

tgravescs commented Jul 23, 2020

tgravescs commented Jul 23, 2020

tgravescs commented Jul 23, 2020

tgravescs commented Jul 23, 2020