Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] fastparquet tests fail on Databricks 13.3 due to NaNs becoming nulls when converting from pandas #9778

Closed
jlowe opened this issue Nov 17, 2023 · 2 comments
Labels
bug Something isn't working duplicate This issue or pull request already exists test Only impacts tests

Comments

@jlowe
Copy link
Contributor

jlowe commented Nov 17, 2023

On Databricks 13.3, nulls in the Pandas DataFrame (represented as NaNs) are being honored as nulls in the resulting Spark DataFrame when converting a Pandas DataFrame to a Spark DataFrame. Pandas thinks there are nulls in the data, and those nulls are propagating to the Spark DataFrame.

fastparquet loads the NaNs properly, but then when converting the data to pandas, pandas thinks the NaN values are null. This, in turn, causes spark.createDataFrame to produce corresponding nulls. When comparing this to the GPU direct load of the data that contains NaNs (not nulls), the test fails. The problem is not in the way the GPU loads the data, it's the way the NaNs get converted into nulls due to sending the data through pandas before converting to a Spark DataFrame.

@jlowe jlowe added bug Something isn't working ? - Needs Triage Need team to review and classify test Only impacts tests labels Nov 17, 2023
@sameerz sameerz added duplicate This issue or pull request already exists and removed ? - Needs Triage Need team to review and classify labels Nov 21, 2023
@sameerz
Copy link
Collaborator

sameerz commented Nov 21, 2023

Duplicate of #9776

@sameerz sameerz marked this as a duplicate of #9776 Nov 21, 2023
@sameerz sameerz closed this as completed Nov 21, 2023
@mythrocks mythrocks reopened this Nov 21, 2023
@mythrocks
Copy link
Collaborator

mythrocks commented Nov 21, 2023

Duplicate of #9776

Sorry, no, it's not. #9776 is for timestamps. This failure is for floating point types.

This is a dupe of #9767, though. And it's been xfailed as part of #9677.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working duplicate This issue or pull request already exists test Only impacts tests
Projects
None yet
Development

No branches or pull requests

3 participants