Support timezone aware pandas inputs in cudf #15935

mroeschke · 2024-06-05T21:01:33Z

Description

closes #13611

(This technically does not support pandas objects have interval types that are timezone aware)

@rjzamora let me know if the test I adapted from your PR in #15929 is adequate

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

rjzamora · 2024-06-06T03:41:55Z

python/dask_cudf/dask_cudf/io/tests/test_parquet.py

+        }
+    )
+    pdf.to_parquet(path)
+    # cudf.read_parquet does not support reading timezone aware types yet, so check dtypes


Ah, so now that cudf.DataFrame.from_pandas works for timezone-aware pandas dtypes, dask_cudf.read_parquet(path).meta.dtypes will no longer agree with the computed result. Am I understanding correctly?

This means that we will still need to modify the dask_cudf.read_parquet logic a bit to erase the timezone information until cudf.read_parquet also supports timezone-aware types. It's probably fine to make that change after this PR is merged.

Not really familiar with how dask cudf's parquet reader works under the hood, but if by "computed result" it means relative to using cudf's parquet reader then yes. I've hopefully clarified the test to show that dask_cudf result maintains the timezone in the data type while the cudf results drops the timezone in the data type (IMO the dask_cudf result is "more correct")

Not really familiar with how dask cudf's parquet reader works under the hood, but if by "computed result" ...

Sorry, there is no reason for the behavior of dask_cudf to be obvious to anyone. So, I should definitely clarify my statements a bit:

A dask_cudf.DataFrame collection is "lazy" in the sense that we will not actually read the parquet data when you call dask_cudf.read_parquet. However, in order to keep track of the current column names and dtypes, the collection will keep an empty pd/cudf DataFrame object in memory called meta. This meta object is currently populated in the dask_cudf.read_parquet call by reading the Parquet footer metadata with pyarrow, and then converting to cudf (via pandas).

The problem is that the meta object created by dask_cudf.read_parquet can be "wrong", because dask_cudf will simply use cudf.read_parquet to actually read in the data when compute/persist is called on the collection. In other words: The dask_cudf result is not actually more correct than the cudf result, it's eager meta just happens to be more correct by accident :)

Thanks for that explanation!

The dask_cudf result is not actually more correct than the cudf result, it's eager meta just happens to be more correct by accident

Ah yes I think this conclusion is correct and what you alluded to in your prior comment. In a follow up to this PR, the timezone aware type in meta probably needs to be converted to a timezone naive type in dask_cudf

wence-

I think this makes sense, thanks!

mroeschke · 2024-06-10T16:48:37Z

/merge

Support timezone aware pandas inputs in cudf

39253f1

mroeschke added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change cuDF (Python) labels Jun 5, 2024

mroeschke requested review from a team as code owners June 5, 2024 21:01

mroeschke requested review from bdice and isVoid June 5, 2024 21:01

github-actions bot added the Python Affects Python cuDF API. label Jun 5, 2024

mroeschke mentioned this pull request Jun 5, 2024

Fix dask_cudf.read_parquet regression for legacy timestamp data #15929

Merged

3 tasks

rjzamora reviewed Jun 6, 2024

View reviewed changes

galipremsagar and others added 2 commits June 6, 2024 11:03

Merge branch 'branch-24.08' into enh/pandas/tz_aware

5be89d5

use a better test

0905a28

wence- approved these changes Jun 10, 2024

View reviewed changes

rapids-bot bot merged commit e3ba131 into rapidsai:branch-24.08 Jun 10, 2024
70 checks passed

mroeschke deleted the enh/pandas/tz_aware branch June 10, 2024 16:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support timezone aware pandas inputs in cudf #15935

Support timezone aware pandas inputs in cudf #15935

mroeschke commented Jun 5, 2024

rjzamora Jun 6, 2024

mroeschke Jun 6, 2024

rjzamora Jun 6, 2024

mroeschke Jun 6, 2024

wence- left a comment

mroeschke commented Jun 10, 2024

Support timezone aware pandas inputs in cudf #15935

Support timezone aware pandas inputs in cudf #15935

Conversation

mroeschke commented Jun 5, 2024

Description

Checklist

rjzamora Jun 6, 2024

Choose a reason for hiding this comment

mroeschke Jun 6, 2024

Choose a reason for hiding this comment

rjzamora Jun 6, 2024

Choose a reason for hiding this comment

mroeschke Jun 6, 2024

Choose a reason for hiding this comment

wence- left a comment

Choose a reason for hiding this comment

mroeschke commented Jun 10, 2024