[BUG] `fastparquet` test fails with `DATAGEN_SEED=1700171382` on Databricks (Spark 3.4.1) #9767

mythrocks · 2023-11-17T06:51:55Z

[2023-11-17T00:19:56.484Z] FAILED ../../src/main/python/fastparquet_compatibility_test.py::test_reading_file_written_by_spark_cpu[Float(not_null)][DATAGEN_SEED=1700171382] - AssertionError: GPU and CPU are not both null at [76, 'a']

[2023-11-17T00:19:56.484Z] FAILED ../../src/main/python/fastparquet_compatibility_test.py::test_reading_file_written_by_spark_cpu[Double(not_null)][DATAGEN_SEED=1700171382] - AssertionError: GPU and CPU are not both null at [12, 'a']

[2023-11-17T00:19:56.484Z] Starting with datagen test seed: 1700171382. Set env variable SPARK_RAPIDS_TEST_DATAGEN_SEED to override.
[2023-11-17T00:19:56.484Z] Starting with OOM injection seed: 1700171382. Set env variable SPARK_RAPIDS_TEST_INJECT_OOM_SEED to override.
[2023-11-17T00:19:56.484Z] 2023-11-16 21:49:42 INFO     Executing global initialization tasks before test launches
[2023-11-17T00:19:56.484Z] 2023-11-16 21:49:42 INFO     Creating directory /home/ubuntu/spark-rapids/integration_tests/target/run_dir-20231116214942-eeD8/hive with permissions 0o777
[2023-11-17T00:19:56.484Z] 2023-11-16 21:49:42 INFO     Skipping findspark init because on xdist master
[2023-11-17T00:19:56.484Z] FAILED ../../src/main/python/fastparquet_compatibility_test.py::test_reading_file_written_by_spark_cpu[Struct(not_null)(('first', Integer(not_null)),('second', Float(not_null)))][DATAGEN_SEED=1700171382, INJECT_OOM] - AssertionError: GPU and CPU are not both null at [37, 'a.second']
[2023-11-17T00:19:56.484Z] FAILED ../../src/main/python/fastparquet_compatibility_test.py::test_reading_file_written_with_gpu[Float(not_null)][DATAGEN_SEED=1700171382] - AssertionError: GPU and CPU are not both null at [38, 'a']
[2023-11-17T00:19:56.484Z] FAILED ../../src/main/python/fastparquet_compatibility_test.py::test_reading_file_written_with_gpu[Double(not_null)][DATAGEN_SEED=1700171382, INJECT_OOM] - AssertionError: GPU and CPU are not both null at [57, 'a']
[2023-11-17T00:19:56.484Z] = 5 failed, 19923 passed, 1045 skipped, 624 xfailed, 302 xpassed, 414 warnings in 9013.84s (2:30:13) =

This seems similar to the other DATAGEN_SEED related failures in the pytests.

The text was updated successfully, but these errors were encountered:

jlowe · 2023-11-20T16:49:16Z

Duplicated by #9778 which has some details on why this fails.

mythrocks · 2023-11-20T16:58:43Z

Thank you, @jlowe. Looks like another xfail case. There are similar xfails in the tests already, pertaining to dataframe conversions through Pandas.

sameerz · 2023-11-21T21:57:11Z

The xfail is being handled in #9677. From Issue #9778 , Jason pointed out:

On Databricks 13.3, nulls in the Pandas DataFrame (represented as NaNs) are being honored as nulls in the resulting Spark DataFrame when converting a Pandas DataFrame to a Spark DataFrame. Pandas thinks there are nulls in the data, and those nulls are propagating to the Spark DataFrame.

fastparquet loads the NaNs properly, but then when converting the data to pandas, pandas thinks the NaN values are null. This, in turn, causes spark.createDataFrame to produce corresponding nulls. When comparing this to the GPU direct load of the data that contains NaNs (not nulls), the test fails. The problem is not in the way the GPU loads the data, it's the way the NaNs get converted into nulls due to sending the data through pandas before converting to a Spark DataFrame.

Basically he is asking to fix the test case so NaNs do not become nulls when passing through Pandas.

mythrocks · 2023-11-21T22:47:07Z

The xfail is being handled in #9677.

Do we mean #9776? Then, no, it's not. #9776 is a different xfail case, for timestamps. This current issue was for floating-point failures.

so NaNs do not become nulls when passing through Pandas...

I cannot do that in the short term. The shortest path to constructing a Spark dataframe from what's read by fastparquet is to route it through Pandas. And Pandas does not distinguish between NaN and null values.

This will go into the already long list of fastparquet/Pandas/Spark incompatible xfail conditions. We can revisit routing around Pandas when the higher priority tasks are sorted.

mythrocks · 2023-11-21T22:58:14Z

Do we mean #9776? Then, no, it's not. #9776 is a different xfail case, for timestamps. This current issue was for floating-point failures.

Ah, I just spoke with @jlowe, and looked more closely at #9677. I understand now: @jlowe has that xfailed out already.

mythrocks added bug Something isn't working ? - Needs Triage Need team to review and classify labels Nov 17, 2023

mythrocks self-assigned this Nov 17, 2023

sameerz removed the ? - Needs Triage Need team to review and classify label Nov 21, 2023

mythrocks mentioned this issue Nov 21, 2023

[BUG] fastparquet tests fail on Databricks 13.3 due to NaNs becoming nulls when converting from pandas #9778

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] `fastparquet` test fails with `DATAGEN_SEED=1700171382` on Databricks (Spark 3.4.1) #9767

[BUG] `fastparquet` test fails with `DATAGEN_SEED=1700171382` on Databricks (Spark 3.4.1) #9767

mythrocks commented Nov 17, 2023

jlowe commented Nov 20, 2023

mythrocks commented Nov 20, 2023

sameerz commented Nov 21, 2023

mythrocks commented Nov 21, 2023

mythrocks commented Nov 21, 2023

[BUG] fastparquet test fails with DATAGEN_SEED=1700171382 on Databricks (Spark 3.4.1) #9767

[BUG] fastparquet test fails with DATAGEN_SEED=1700171382 on Databricks (Spark 3.4.1) #9767

Comments

mythrocks commented Nov 17, 2023

jlowe commented Nov 20, 2023

mythrocks commented Nov 20, 2023

sameerz commented Nov 21, 2023

mythrocks commented Nov 21, 2023

mythrocks commented Nov 21, 2023

[BUG] `fastparquet` test fails with `DATAGEN_SEED=1700171382` on Databricks (Spark 3.4.1) #9767

[BUG] `fastparquet` test fails with `DATAGEN_SEED=1700171382` on Databricks (Spark 3.4.1) #9767