Enable build for Databricks 13.3 [databricks] #9677

razajafri · 2023-11-12T05:58:10Z

This PR builds on previous PRs to add Databricks 13 support to the Spark Rapids plugin. This PR specifically adds pom changes to build the plugin with Databricks 13.3.

Changes Made:

POM changes: All the modules have been updated with a profile for 341db support
XFAIL failing tests: Tests were marked with xfail pytest marker which should be removed once the support is added for them.
PythonUDAF: Added support for PythonUDAF similar to Spark 3.5

Tests:
All the tests were updated

This is in draft mode because it should be merged only after #9644 is merged

…replace_table_as_select

gerashegalov

Follow the approach #9508 to reduce bloat in poms

datagen/pom.xml

integration_tests/pom.xml

shuffle-plugin/pom.xml

Signed-off-by: Raza Jafri <[email protected]>

integration_tests/src/main/python/delta_lake_update_test.py

tests/pom.xml

This reverts commit 00b498e.

…_atomic_replace_table_as_select" This reverts commit ea2fd40.

…icks

This reverts commit 0a7fa52.

jlowe · 2023-11-16T21:24:24Z

build

jlowe · 2023-11-17T16:14:26Z

Latest failure is in fastparquet compatibility test which I could not reproduce on a Databricks 13.3 instance. Kicking again to see if it's reproducible.

jlowe · 2023-11-17T16:14:30Z

build

jlowe · 2023-11-17T16:54:07Z

I'm now able to reproduce the fastparquet failures, and it appears to be an issue with the fastparquet setup on Databricks 13.3. It's reading NaNs as nulls, whereas the GPU is reading NaNs as NaNs. Not sure yet why we're getting different fastparquet behavior in the DB 13.3 environment with an explicit install of fastparquet vs. what we get on the other Databricks environments.

jlowe · 2023-11-17T18:55:06Z

build

jlowe · 2023-11-20T15:25:22Z

build

jlowe · 2023-11-20T21:04:15Z

build

sameerz · 2023-11-21T01:23:31Z

build

pxLi · 2023-11-21T08:12:47Z

341db failed deltalake cases

[2023-11-21T05:19:07.746Z] �[31mFAILED�[0m�[31m [ 28%]�[0m
[2023-11-21T05:19:46.387Z] ../../src/main/python/delta_lake_merge_test.py::test_delta_merge_not_match_insert_only[10-['a', 'b']-False-(range(0, 5), range(0, 5))][DATAGEN_SEED=1700542054, INJECT_OOM, IGNORE_ORDER, ALLOW_NON_GPU(DeserializeToObjectExec,ShuffleExchangeExec,FileSourceScanExec,FilterExec,MapPartitionsExec,MapElementsExec,ObjectHashAggregateExec,ProjectExec,SerializeFromObjectExec,SortExec)] 23/11/21 05:19:41 ERROR Utils: Aborting task
[2023-11-21T05:19:46.387Z] java.lang.OutOfMemoryError: GC overhead limit exceeded
[2023-11-21T05:19:46.387Z] 23/11/21 05:19:42 ERROR FileFormatWriter: Job job_202311210519103400711262770238775_3241 aborted.
[2023-11-21T05:19:46.387Z] 23/11/21 05:19:42 ERROR Executor: Exception in task 2.0 in stage 3241.0 (TID 11704)
[2023-11-21T05:19:46.387Z] org.apache.spark.SparkException: [TASK_WRITE_FAILED] Task failed while writing rows to file:/tmp/pyspark_tests/1121-014647-nfuszhj3-10-2-128-19-master-371556-540372822/DELTA_DATA/CPU.
[2023-11-21T05:19:46.387Z] 	at org.apache.spark.sql.errors.QueryExecutionErrors$.taskFailedWhileWritingRowsError(QueryExecutionErrors.scala:968)
[2023-11-21T05:19:46.387Z] 	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:551)
[2023-11-21T05:19:46.387Z] 	at org.apache.spark.sql.execution.datasources.WriteFilesExec.$anonfun$doExecuteWrite$1(WriteFiles.scala:116)
[2023-11-21T05:19:46.387Z] 	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:931)
[2023-11-21T05:19:46.387Z] 	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:931)
[2023-11-21T05:19:46.387Z] 	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
[2023-11-21T05:19:46.387Z] 	at org.apache.spark.rdd.RDD.$anonfun$computeOrReadCheckpoint$1(RDD.scala:407)
[2023-11-21T05:19:46.387Z] 	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
[2023-11-21T05:19:46.387Z] 	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:404)
[2023-11-21T05:19:46.387Z] 	at org.apache.spark.rdd.RDD.iterator(RDD.scala:371)
[2023-11-21T05:19:46.387Z] 	at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:82)
[2023-11-21T05:19:46.387Z] 	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
[2023-11-21T05:19:46.387Z] 	at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:82)
[2023-11-21T05:19:46.387Z] 	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
[2023-11-21T05:19:46.387Z] 	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
[2023-11-21T05:19:46.387Z] 	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:196)
[2023-11-21T05:19:46.387Z] 	at org.apache.spark.scheduler.Task.doRunTask(Task.scala:181)
[2023-11-21T05:19:46.387Z] 	at org.apache.spark.scheduler.Task.$anonfun$run$5(Task.scala:146)
[2023-11-21T05:19:46.387Z] 	at com.databricks.unity.EmptyHandle$.runWithAndClose(UCSHandle.scala:125)
[2023-11-21T05:19:46.387Z] 	at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:146)
[2023-11-21T05:19:46.387Z] 	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
[2023-11-21T05:19:46.387Z] 	at org.apache.spark.scheduler.Task.run(Task.scala:99)
[2023-11-21T05:19:46.387Z] 	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$8(Executor.scala:897)
[2023-11-21T05:19:46.387Z] 	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1682)
[2023-11-21T05:19:46.387Z] 	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:900)
[2023-11-21T05:19:46.387Z] 	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
[2023-11-21T05:19:46.387Z] 	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
[2023-11-21T05:19:46.387Z] 	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:795)
[2023-11-21T05:19:46.387Z] 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[2023-11-21T05:19:46.387Z] 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[2023-11-21T05:19:46.387Z] 	at java.lang.Thread.run(Thread.java:750)
[2023-11-21T05:19:46.387Z] Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
[2023-11-21T05:19:52.904Z] 23/11/21 05:19:51 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker for task 2.0 in stage 3241.0 (TID 11704),5,main]
[2023-11-21T05:19:52.905Z] org.apache.spark.SparkException: [TASK_WRITE_FAILED] Task failed while writing rows to file:/tmp/pyspark_tests/1121-014647-nfuszhj3-10-2-128-19-master-371556-540372822/DELTA_DATA/CPU.
[2023-11-21T05:19:52.905Z] 	at org.apache.spark.sql.errors.QueryExecutionErrors$.taskFailedWhileWritingRowsError(QueryExecutionErrors.scala:968)
[2023-11-21T05:19:52.905Z] 	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:551)
[2023-11-21T05:19:52.905Z] 	at org.apache.spark.sql.execution.datasources.WriteFilesExec.$anonfun$doExecuteWrite$1(WriteFiles.scala:116)
[2023-11-21T05:19:52.905Z] 	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:931)
[2023-11-21T05:19:52.905Z] 	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:931)
[2023-11-21T05:19:52.905Z] 	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
[2023-11-21T05:19:52.905Z] 	at org.apache.spark.rdd.RDD.$anonfun$computeOrReadCheckpoint$1(RDD.scala:407)
[2023-11-21T05:19:52.905Z] 	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
[2023-11-21T05:19:52.905Z] 	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:404)
[2023-11-21T05:19:52.905Z] 	at org.apache.spark.rdd.RDD.iterator(RDD.scala:371)
[2023-11-21T05:19:52.905Z] 	at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$3(ResultTask.scala:82)
[2023-11-21T05:19:52.905Z] 	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
[2023-11-21T05:19:52.905Z] 	at org.apache.spark.scheduler.ResultTask.$anonfun$runTask$1(ResultTask.scala:82)
[2023-11-21T05:19:52.905Z] 	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
[2023-11-21T05:19:52.905Z] 	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
[2023-11-21T05:19:52.905Z] 	at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:196)
[2023-11-21T05:19:52.905Z] 	at org.apache.spark.scheduler.Task.doRunTask(Task.scala:181)
[2023-11-21T05:19:52.905Z] 	at org.apache.spark.scheduler.Task.$anonfun$run$5(Task.scala:146)
[2023-11-21T05:19:52.905Z] 	at com.databricks.unity.EmptyHandle$.runWithAndClose(UCSHandle.scala:125)
[2023-11-21T05:19:52.905Z] 	at org.apache.spark.scheduler.Task.$anonfun$run$1(Task.scala:146)
[2023-11-21T05:19:52.905Z] 	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
[2023-11-21T05:19:52.905Z] 	at org.apache.spark.scheduler.Task.run(Task.scala:99)
[2023-11-21T05:19:52.905Z] 	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$8(Executor.scala:897)
[2023-11-21T05:19:52.905Z] 	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1682)
[2023-11-21T05:19:52.905Z] 	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:900)
[2023-11-21T05:19:52.905Z] 	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
[2023-11-21T05:19:52.905Z] 	at com.databricks.spark.util.ExecutorFrameProfiler$.record(ExecutorFrameProfiler.scala:110)
[2023-11-21T05:19:52.905Z] 	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:795)
[2023-11-21T05:19:52.905Z] 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[2023-11-21T05:19:52.905Z] 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[2023-11-21T05:19:52.905Z] 	at java.lang.Thread.run(Thread.java:750)
[2023-11-21T05:19:52.905Z] Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded

jlowe · 2023-11-21T16:51:02Z

build

pxLi · 2023-11-22T01:39:49Z

build

mythrocks · 2023-11-22T05:30:59Z

integration_tests/src/main/python/fastparquet_compatibility_test.py

+    pytest.param(FloatGen(nullable=False),
+                 marks=pytest.mark.xfail(is_databricks_runtime(),
+                                         reason="https://github.com/NVIDIA/spark-rapids/issues/9778")),


I was thinking of including the following:

Suggested change

pytest.param(FloatGen(nullable=False),

marks=pytest.mark.xfail(is_databricks_runtime(),

reason="https://github.com/NVIDIA/spark-rapids/issues/9778")),

pytest.param(FloatGen(nullable=False),

marks=pytest.mark.xfail(is_databricks_runtime(),

reason="https://github.com/NVIDIA/spark-rapids/issues/9778")),

FloatGen(nullable=False, no_nans=True),

Not strictly in the purview of this change. I can add this as a follow-on.

mythrocks

LGTM, barring that single (optional) suggestion. Thanks for disabling the float-double tests.

jlowe · 2023-11-22T14:23:14Z

build

jlowe · 2023-11-22T19:35:27Z

build

pxLi · 2023-11-23T01:09:35Z

thanks! also cc @NvTimLiu to help setup nightly later, thanks

jlowe · 2023-11-23T01:30:40Z

Thanks for merging, @pxLi. I built three times to make sure CI would not be flaky with heap GC OOM or other problems, passed three times in a row. So we should be good with this enabled for premerge and nightly.

razajafri and others added 9 commits November 11, 2023 20:52

pom changes

181066e

pom changes

664af61

pom changes

02dadb3

add databricks13.3 to premerge

4c34d3e

Added ToPrettyString support

9482ba0

xfail approximate percentile test

0a7fa52

xfail failing udf tests

06a5770

xfail failing tests due to WriteIntoDeltaCommand

00b498e

xfail test_delta_atomic_create_table_as_select and test_delta_atomic_…

ea2fd40

…replace_table_as_select

gerashegalov requested changes Nov 12, 2023

View reviewed changes

datagen/pom.xml Outdated Show resolved Hide resolved

integration_tests/pom.xml Outdated Show resolved Hide resolved

shuffle-plugin/pom.xml Outdated Show resolved Hide resolved

razajafri and others added 8 commits November 12, 2023 12:05

Added 341db to shim-deps and removed from datagen/pom.xml

f693799

updated udf-compiler pom.xml

1eb2904

updated sql-plugin pom.xml

cedb635

fixed multiple pom.xml

4b6fd48

updated udf-compiler pom.xml

7a20826

removed TODO

80d5c47

Signoff

d2365db

Signed-off-by: Raza Jafri <[email protected]>

updated scala 2.13 poms

e5acc9b

jlowe reviewed Nov 13, 2023

View reviewed changes

integration_tests/src/main/python/delta_lake_update_test.py Outdated Show resolved Hide resolved

jlowe mentioned this pull request Nov 13, 2023

Add Partial Delta Lake Support for Databricks 13.3 #9644

Merged

gerashegalov reviewed Nov 13, 2023

View reviewed changes

tests/pom.xml Outdated Show resolved Hide resolved

gerashegalov self-requested a review November 13, 2023 20:21

razajafri and others added 8 commits November 13, 2023 13:25

Revert "xfail failing tests due to WriteIntoDeltaCommand"

e2eea68

This reverts commit 00b498e.

Revert "xfail test_delta_atomic_create_table_as_select and test_delta…

799ce62

…_atomic_replace_table_as_select" This reverts commit ea2fd40.

remove tests/pom.xml changes

f58616f

reverted 2.13 generation of tests/pom.xml

df465d0

removed 341db profile from tests as we don't run unit tests on databr…

f65d19e

…icks

fixed the xfail reason to point to the correct issue

04b6c32

removed diff.patch

49ee94f

Revert "xfail approximate percentile test"

2af52e1

This reverts commit 0a7fa52.

jlowe added 2 commits November 16, 2023 15:15

Skip UDF tests until UDF handling is updated

cb01bb8

Remove xfail/skips eclipsed by module-level skip

f43a14f

jlowe added 2 commits November 17, 2023 12:24

Merge branch 'branch-23.12' into final-pr

8a0398b

xfail fastparquet tests due to nulls being introduced by pandas

e0d96e8

Fix incorrect shimplify directives for 341db

20d3e51

jlowe mentioned this pull request Nov 21, 2023

Enable Databricks 13.3 in pre-merge CI [databricks] #9668

Closed

Fix fallback test

9cc0fbc

revans2 approved these changes Nov 21, 2023

View reviewed changes

sameerz mentioned this pull request Nov 21, 2023

[BUG] fastparquet test fails with DATAGEN_SEED=1700171382 on Databricks (Spark 3.4.1) #9767

Open

mythrocks mentioned this pull request Nov 21, 2023

[BUG] fastparquet tests fail on Databricks 13.3 due to NaNs becoming nulls when converting from pandas #9778

Closed

mythrocks reviewed Nov 22, 2023

View reviewed changes

mythrocks approved these changes Nov 22, 2023

View reviewed changes

pxLi merged commit d3629fd into NVIDIA:branch-23.12 Nov 23, 2023
36 checks passed

sameerz added the task Work required that improves the product but is not user facing label Nov 26, 2023

razajafri deleted the final-pr branch November 27, 2023 17:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable build for Databricks 13.3 [databricks] #9677

Enable build for Databricks 13.3 [databricks] #9677

razajafri commented Nov 12, 2023 •

edited

Loading

gerashegalov left a comment

jlowe commented Nov 16, 2023

jlowe commented Nov 17, 2023

jlowe commented Nov 17, 2023

jlowe commented Nov 17, 2023

jlowe commented Nov 17, 2023

jlowe commented Nov 20, 2023

jlowe commented Nov 20, 2023

sameerz commented Nov 21, 2023

pxLi commented Nov 21, 2023

jlowe commented Nov 21, 2023

pxLi commented Nov 22, 2023

mythrocks Nov 22, 2023

mythrocks left a comment

jlowe commented Nov 22, 2023

jlowe commented Nov 22, 2023

pxLi commented Nov 23, 2023

jlowe commented Nov 23, 2023

Enable build for Databricks 13.3 [databricks] #9677

Enable build for Databricks 13.3 [databricks] #9677

Conversation

razajafri commented Nov 12, 2023 • edited Loading

gerashegalov left a comment

Choose a reason for hiding this comment

jlowe commented Nov 16, 2023

jlowe commented Nov 17, 2023

jlowe commented Nov 17, 2023

jlowe commented Nov 17, 2023

jlowe commented Nov 17, 2023

jlowe commented Nov 20, 2023

jlowe commented Nov 20, 2023

sameerz commented Nov 21, 2023

pxLi commented Nov 21, 2023

jlowe commented Nov 21, 2023

pxLi commented Nov 22, 2023

mythrocks Nov 22, 2023

Choose a reason for hiding this comment

mythrocks left a comment

Choose a reason for hiding this comment

jlowe commented Nov 22, 2023

jlowe commented Nov 22, 2023

pxLi commented Nov 23, 2023

jlowe commented Nov 23, 2023

razajafri commented Nov 12, 2023 •

edited

Loading