-
Notifications
You must be signed in to change notification settings - Fork 241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shimplify: Simplify shim source code management #7487
Labels
build
Related to CI / CD or cleanly building
Comments
gerashegalov
added a commit
to gerashegalov/spark-rapids
that referenced
this issue
Jan 27, 2023
Contributes to NVIDIA#7487 Signed-off-by: Gera Shegalov <[email protected]>
gerashegalov
added a commit
that referenced
this issue
Jan 28, 2023
…ge in submodules (#7600) Contributes to #7487. Use a consistent scheme of properties for sql-plugin and tests modules for easier automation. Signed-off-by: Gera Shegalov <[email protected]>
gerashegalov
added a commit
that referenced
this issue
Mar 6, 2023
…7222) Closes #7487 Command executed: ```bash mvn generate-sources -Dshimplify=true -Dshimplify.move=true ``` Verified with: 1) Build a multi-shim jar ```bash mvn clean echo 314 321 331 | xargs -t -n 1 bash -c 'mvn package -pl aggregator -am -Dbuildver=$1 -DskipTests -Dskip -Dmaven.jadadoc.skip' '{}' mvn package -pl dist -Dincluded_buildvers=314,321,331 -Ddist.jar.compress=false ``` 2) Run a smoke test for all shims ```bash SPARK_HOME=~/gits/apache/spark NUM_LOCAL_EXECS=2 PYSP_TEST_spark_rapids_shuffle_mode=MULTITHREADED PYSP_TEST_spark_rapids_shuffle_multiThreaded_writer_threads=2 PYSP_TEST_spark_rapids_shuffle_multiThreaded_reader_threads=2 PYSP_TEST_spark_shuffle_manager=com.nvidia.spark.rapids.spark314.RapidsShuffleManager PYSP_TEST_spark_rapids_memory_gpu_minAllocFraction=0 PYSP_TEST_spark_rapids_memory_gpu_maxAllocFraction=0.1 PYSP_TEST_spark_rapids_memory_gpu_allocFraction=0.1 ./integration_tests/run_pyspark_from_build.sh -k test_hash_grpby_sum ``` ```bash SPARK_HOME=~/dist/spark-3.2.1-bin-hadoop3.2 NUM_LOCAL_EXECS=2 PYSP_TEST_spark_rapids_shuffle_mode=MULTITHREADED PYSP_TEST_spark_rapids_shuffle_multiThreaded_writer_threads=2 PYSP_TEST_spark_rapids_shuffle_multiThreaded_reader_threads=2 PYSP_TEST_spark_shuffle_manager=com.nvidia.spark.rapids.spark321.RapidsShuffleManager PYSP_TEST_spark_rapids_memory_gpu_minAllocFraction=0 PYSP_TEST_spark_rapids_memory_gpu_maxAllocFraction=0.1 PYSP_TEST_spark_rapids_memory_gpu_allocFraction=0.1 ./integration_tests/run_pyspark_from_build.sh -k test_hash_grpby_sum ``` ```bash SPARK_HOME=~/dist/spark-3.3.1-bin-hadoop3 NUM_LOCAL_EXECS=2 PYSP_TEST_spark_rapids_shuffle_mode=MULTITHREADED PYSP_TEST_spark_rapids_shuffle_multiThreaded_writer_threads=2 PYSP_TEST_spark_rapids_shuffle_multiThreaded_reader_threads=2 PYSP_TEST_spark_shuffle_manager=com.nvidia.spark.rapids.spark331.RapidsShuffleManager PYSP_TEST_spark_rapids_memory_gpu_minAllocFraction=0 PYSP_TEST_spark_rapids_memory_gpu_maxAllocFraction=0.1 PYSP_TEST_spark_rapids_memory_gpu_allocFraction=0.1 ./integration_tests/run_pyspark_from_build.sh -k test_hash_grpby_sum ``` Separate PR will be posted removing old sparkXYZ.sources definitions for the previous build. We'll merge it once we are confident in the transition robustness. Signed-off-by: Gera Shegalov <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Problem Statement
We currently support 14 unique Spark builds
$ mvn -q help:evaluate -pl dist \ -Dexpression=included_buildvers \ -PsnapshotsWithDatabricks -DforceStdout | wc -l 14
with the source code spread over 51 directories
representing version ranges with numerous exceptions
This is a proposal to eliminate hard-to-manage version-range directories.
Instead each source code file typically resides in the earliest version directory it applies to. Every shim source code file
has a special comment in the header with the explicit list of all shims the code applies to in the format TBD like JSON Lines:
these headers are processed to generate symlinks under
sql-plugin/target/spark${buildver}/src
We have an option to create copies instead of symlinks but symlinks allow to immediately see the changes tracked by git.
Motivation
First, currently it's hard to see what source directory set constitutes the shim? It was easy with submodules before #3381
With this approach we can easily query git without even building:
Second, IDEs support, however often capriciously, dynamically (via build-helper:add-source) added source roots. There are way too many to chose from https://github.com/NVIDIA/spark-rapids/tree/branch-22.12/sql-plugin/src/main and we lack intuitive-enough naming to fix it manually
With this approach, we have simple per-shim source roots to fix manually in the IDE available after running the build, or after running a script with git grep.
Third, vendor patchwork distributions are really hard to squeeze into the version range scheme. See #7152.
With this approach it's easy to decide where to place a physical shim file. And update the header whatever other shims it applies to without re-engineering the build.
Task List:
Signed-off-by: Gera Shegalov [email protected]
The text was updated successfully, but these errors were encountered: