Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dynamic Spark configuration for Databricks #2116

Merged
merged 6 commits into from
Apr 15, 2021

Conversation

NvTimLiu
Copy link
Collaborator

@NvTimLiu NvTimLiu commented Apr 10, 2021

We need a way to set spark confs dynamically for the Databricks,
e.g., when we test cuDF sonatype release jars, we need to disable cudf-rapids version match by adding
"--conf spark.rapids.cudfVersionOverride=true", or enable/disable AQE, or anything else.

By adding the parameter -f " spark.foo=1,spark.bar=2 ......" for the script 'run-tests.py',
we can dynamically add whatever confs for the Databricks cluster.

Signed-off-by: Tim Liu [email protected]

We need a way to set spark confs dynamically for the Databricks,
e.g., when we test cuDF sonatype release jars, we need to disable cudf-rapids version match by adding
"--conf spark.rapids.cudfVersionOverride=true", or enable/disable AQE, or anything else.

By adding the parameter spark_conf="--conf spark.xxx.xxx=xxx --conf ......" for the script 'run-tests.py',
we can dynamically add whatever confs for the Databricks cluster.

Signed-off-by: Tim Liu <[email protected]>
@NvTimLiu NvTimLiu added the build Related to CI / CD or cleanly building label Apr 10, 2021
@NvTimLiu NvTimLiu self-assigned this Apr 10, 2021
@@ -35,7 +35,7 @@ def main():
print("rsync command: %s" % rsync_command)
subprocess.check_call(rsync_command, shell = True)

ssh_command = "ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null ubuntu@%s -p 2200 -i %s %s %s 2>&1 | tee testout; if [ `echo ${PIPESTATUS[0]}` -ne 0 ]; then false; else true; fi" % (master_addr, params.private_key_file, params.script_dest, params.jar_path)
ssh_command = "ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null ubuntu@%s -p 2200 -i %s %s %s '%s' 2>&1 | tee testout; if [ `echo ${PIPESTATUS[0]}` -ne 0 ]; then false; else true; fi" % (master_addr, params.private_key_file, params.script_dest, params.jar_path, params.spark_conf)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here add the single quotas '%s', because we need to escape the backspaces to make the confs string --conf spark.xxx.xxx=xxx --conf spark.xxx.yyy=zzz ...... as one parameter

Otherwise the config string will be parsed as several parameters asperated by the backspaces

@@ -38,21 +39,32 @@ CUDF_UDF_TEST_ARGS="--conf spark.python.daemon.module=rapids.daemon_databricks \
--conf spark.rapids.python.memory.gpu.allocFraction=0.1 \
--conf spark.rapids.python.concurrentPythonWorkers=2"

## '--conf spark.xxx.xxx=xxx' to 'export PYSP_TEST_spark_xxx_xxx=xxx'
if [ -n "$SPARK_CONF" ]; then
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to translate --conf spark.xxx=xxx into export PYSP_TEST_spark_xxx=xxx, as we are calling the script run_pyspark_from_build.sh to parallelly run the integration tests.

The scripts run_pyspark_from_build.sh requires us to translate spark confs into export PYSP_TEST_spark_xxx=xxx to run IT with parallelism.

TEST_TYPE="nightly"
if [ -d "$LOCAL_JAR_PATH" ]; then
## Run tests with jars in the LOCAL_JAR_PATH dir downloading from the denpedency repo
LOCAL_JAR_PATH=$LOCAL_JAR_PATH bash $LOCAL_JAR_PATH/integration_tests/run_pyspark_from_build.sh --runtime_env="databricks" --test_type=$TEST_TYPE

## Run cudf-udf tests
CUDF_UDF_TEST_ARGS="$CUDF_UDF_TEST_ARGS --conf spark.executorEnv.PYTHONPATH=`ls $LOCAL_JAR_PATH/rapids-4-spark_*.jar | grep -v 'tests.jar'`"
LOCAL_JAR_PATH=$LOCAL_JAR_PATH SPARK_SUBMIT_FLAGS=$CUDF_UDF_TEST_ARGS TEST_PARALLEL=1 \
LOCAL_JAR_PATH=$LOCAL_JAR_PATH SPARK_SUBMIT_FLAGS="$SPARK_CONF $CUDF_UDF_TEST_ARGS" TEST_PARALLEL=1 \
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

append the dynamic spark confs here

@NvTimLiu
Copy link
Collaborator Author

build

@NvTimLiu NvTimLiu requested a review from pxLi April 11, 2021 10:31
@jlowe jlowe changed the title Add daynamic spark confs for the Databricks Add dynamic Spark configuration for Databricks Apr 12, 2021
@@ -26,19 +26,21 @@
clusterid = ''
build_profiles = 'databricks,!snapshot-shims'
jar_path = ''
# `spark_conf` can take comma seperated mutiple spark configurations, e.g., spark.foo=1,spark.bar=2,...'
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment to make the opt '-f' string format clear

@NvTimLiu
Copy link
Collaborator Author

build

@@ -38,21 +39,35 @@ CUDF_UDF_TEST_ARGS="--conf spark.python.daemon.module=rapids.daemon_databricks \
--conf spark.rapids.python.memory.gpu.allocFraction=0.1 \
--conf spark.rapids.python.concurrentPythonWorkers=2"

## 'spark.foo=1,spar.bar=2,...' to 'export PYSP_TEST_spark_foo=1 export PYSP_TEST_spark_bar=2'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit s/spar.bar/spark.bar/

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed typo

TEST_TYPE="nightly"
if [ -d "$LOCAL_JAR_PATH" ]; then
## Run tests with jars in the LOCAL_JAR_PATH dir downloading from the denpedency repo
LOCAL_JAR_PATH=$LOCAL_JAR_PATH bash $LOCAL_JAR_PATH/integration_tests/run_pyspark_from_build.sh --runtime_env="databricks" --test_type=$TEST_TYPE

## Run cudf-udf tests
CUDF_UDF_TEST_ARGS="$CUDF_UDF_TEST_ARGS --conf spark.executorEnv.PYTHONPATH=`ls $LOCAL_JAR_PATH/rapids-4-spark_*.jar | grep -v 'tests.jar'`"
LOCAL_JAR_PATH=$LOCAL_JAR_PATH SPARK_SUBMIT_FLAGS=$CUDF_UDF_TEST_ARGS TEST_PARALLEL=1 \
LOCAL_JAR_PATH=$LOCAL_JAR_PATH SPARK_SUBMIT_FLAGS="$SPARK_ARGS $CUDF_UDF_TEST_ARGS" TEST_PARALLEL=1 \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is SPARK_ARGS, I'm not seeing it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, it is SPARK_CONF, forgot to change it back

@NvTimLiu
Copy link
Collaborator Author

build

@NvTimLiu
Copy link
Collaborator Author

As the blossom is having some network issue, I'll trigger the pre-merge build after Blossom team fix the network issue.

@NvTimLiu
Copy link
Collaborator Author

build

@NvTimLiu
Copy link
Collaborator Author

Both nightly build and nightly test pipelines PASS

@NvTimLiu NvTimLiu merged commit 0509509 into NVIDIA:branch-0.5 Apr 15, 2021
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
* Add daynamic spark confs for the Databricks

We need a way to set spark confs dynamically for the Databricks,
e.g., when we test cuDF sonatype release jars, we need to disable cudf-rapids version match by adding
"--conf spark.rapids.cudfVersionOverride=true", or enable/disable AQE, or anything else.

By adding the parameter spark_conf="--conf spark.xxx.xxx=xxx --conf ......" for the script 'run-tests.py',
we can dynamically add whatever confs for the Databricks cluster.

Signed-off-by: Tim Liu <[email protected]>

* Comma separated list of spark configurations

Signed-off-by: Tim Liu <[email protected]>

* Add a comment to make the '-f' format clear

* Add a comment to make the '-f' format clear

* Fix typo

* Add '--conf' if the SPARK_CONF is not empty
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
* Add daynamic spark confs for the Databricks

We need a way to set spark confs dynamically for the Databricks,
e.g., when we test cuDF sonatype release jars, we need to disable cudf-rapids version match by adding
"--conf spark.rapids.cudfVersionOverride=true", or enable/disable AQE, or anything else.

By adding the parameter spark_conf="--conf spark.xxx.xxx=xxx --conf ......" for the script 'run-tests.py',
we can dynamically add whatever confs for the Databricks cluster.

Signed-off-by: Tim Liu <[email protected]>

* Comma separated list of spark configurations

Signed-off-by: Tim Liu <[email protected]>

* Add a comment to make the '-f' format clear

* Add a comment to make the '-f' format clear

* Fix typo

* Add '--conf' if the SPARK_CONF is not empty
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Related to CI / CD or cleanly building
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants