Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] fastparquet test fails with DATAGEN_SEED=1700171382 on Databricks (Spark 3.4.1) #9767

Open
mythrocks opened this issue Nov 17, 2023 · 5 comments
Assignees
Labels
bug Something isn't working

Comments

@mythrocks
Copy link
Collaborator

From a Databricks premerge build:

[2023-11-17T00:19:56.484Z] FAILED ../../src/main/python/fastparquet_compatibility_test.py::test_reading_file_written_by_spark_cpu[Float(not_null)][DATAGEN_SEED=1700171382] - AssertionError: GPU and CPU are not both null at [76, 'a']

[2023-11-17T00:19:56.484Z] FAILED ../../src/main/python/fastparquet_compatibility_test.py::test_reading_file_written_by_spark_cpu[Double(not_null)][DATAGEN_SEED=1700171382] - AssertionError: GPU and CPU are not both null at [12, 'a']

[2023-11-17T00:19:56.484Z] Starting with datagen test seed: 1700171382. Set env variable SPARK_RAPIDS_TEST_DATAGEN_SEED to override.
[2023-11-17T00:19:56.484Z] Starting with OOM injection seed: 1700171382. Set env variable SPARK_RAPIDS_TEST_INJECT_OOM_SEED to override.
[2023-11-17T00:19:56.484Z] 2023-11-16 21:49:42 INFO     Executing global initialization tasks before test launches
[2023-11-17T00:19:56.484Z] 2023-11-16 21:49:42 INFO     Creating directory /home/ubuntu/spark-rapids/integration_tests/target/run_dir-20231116214942-eeD8/hive with permissions 0o777
[2023-11-17T00:19:56.484Z] 2023-11-16 21:49:42 INFO     Skipping findspark init because on xdist master
[2023-11-17T00:19:56.484Z] FAILED ../../src/main/python/fastparquet_compatibility_test.py::test_reading_file_written_by_spark_cpu[Struct(not_null)(('first', Integer(not_null)),('second', Float(not_null)))][DATAGEN_SEED=1700171382, INJECT_OOM] - AssertionError: GPU and CPU are not both null at [37, 'a.second']
[2023-11-17T00:19:56.484Z] FAILED ../../src/main/python/fastparquet_compatibility_test.py::test_reading_file_written_with_gpu[Float(not_null)][DATAGEN_SEED=1700171382] - AssertionError: GPU and CPU are not both null at [38, 'a']
[2023-11-17T00:19:56.484Z] FAILED ../../src/main/python/fastparquet_compatibility_test.py::test_reading_file_written_with_gpu[Double(not_null)][DATAGEN_SEED=1700171382, INJECT_OOM] - AssertionError: GPU and CPU are not both null at [57, 'a']
[2023-11-17T00:19:56.484Z] = 5 failed, 19923 passed, 1045 skipped, 624 xfailed, 302 xpassed, 414 warnings in 9013.84s (2:30:13) =

This seems similar to the other DATAGEN_SEED related failures in the pytests.

@mythrocks mythrocks added bug Something isn't working ? - Needs Triage Need team to review and classify labels Nov 17, 2023
@mythrocks mythrocks self-assigned this Nov 17, 2023
@jlowe
Copy link
Contributor

jlowe commented Nov 20, 2023

Duplicated by #9778 which has some details on why this fails.

@mythrocks
Copy link
Collaborator Author

Thank you, @jlowe. Looks like another xfail case. There are similar xfails in the tests already, pertaining to dataframe conversions through Pandas.

@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Nov 21, 2023
@sameerz
Copy link
Collaborator

sameerz commented Nov 21, 2023

The xfail is being handled in #9677. From Issue #9778 , Jason pointed out:

On Databricks 13.3, nulls in the Pandas DataFrame (represented as NaNs) are being honored as nulls in the resulting Spark DataFrame when converting a Pandas DataFrame to a Spark DataFrame. Pandas thinks there are nulls in the data, and those nulls are propagating to the Spark DataFrame.

fastparquet loads the NaNs properly, but then when converting the data to pandas, pandas thinks the NaN values are null. This, in turn, causes spark.createDataFrame to produce corresponding nulls. When comparing this to the GPU direct load of the data that contains NaNs (not nulls), the test fails. The problem is not in the way the GPU loads the data, it's the way the NaNs get converted into nulls due to sending the data through pandas before converting to a Spark DataFrame.

Basically he is asking to fix the test case so NaNs do not become nulls when passing through Pandas.

@mythrocks
Copy link
Collaborator Author

The xfail is being handled in #9677.

Do we mean #9776? Then, no, it's not. #9776 is a different xfail case, for timestamps. This current issue was for floating-point failures.

so NaNs do not become nulls when passing through Pandas...

I cannot do that in the short term. The shortest path to constructing a Spark dataframe from what's read by fastparquet is to route it through Pandas. And Pandas does not distinguish between NaN and null values.

This will go into the already long list of fastparquet/Pandas/Spark incompatible xfail conditions. We can revisit routing around Pandas when the higher priority tasks are sorted.

@mythrocks
Copy link
Collaborator Author

Do we mean #9776? Then, no, it's not. #9776 is a different xfail case, for timestamps. This current issue was for floating-point failures.

Ah, I just spoke with @jlowe, and looked more closely at #9677. I understand now: @jlowe has that xfailed out already.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants