-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support timezone aware pandas inputs in cudf #15935
Support timezone aware pandas inputs in cudf #15935
Conversation
} | ||
) | ||
pdf.to_parquet(path) | ||
# cudf.read_parquet does not support reading timezone aware types yet, so check dtypes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, so now that cudf.DataFrame.from_pandas
works for timezone-aware pandas dtypes, dask_cudf.read_parquet(path).meta.dtypes
will no longer agree with the computed result. Am I understanding correctly?
This means that we will still need to modify the dask_cudf.read_parquet
logic a bit to erase the timezone information until cudf.read_parquet
also supports timezone-aware types. It's probably fine to make that change after this PR is merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really familiar with how dask cudf's parquet reader works under the hood, but if by "computed result" it means relative to using cudf's parquet reader then yes. I've hopefully clarified the test to show that dask_cudf
result maintains the timezone in the data type while the cudf
results drops the timezone in the data type (IMO the dask_cudf result is "more correct")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really familiar with how dask cudf's parquet reader works under the hood, but if by "computed result" ...
Sorry, there is no reason for the behavior of dask_cudf to be obvious to anyone. So, I should definitely clarify my statements a bit:
A dask_cudf.DataFrame
collection is "lazy" in the sense that we will not actually read the parquet data when you call dask_cudf.read_parquet
. However, in order to keep track of the current column names and dtypes, the collection will keep an empty pd/cudf DataFrame
object in memory called meta
. This meta
object is currently populated in the dask_cudf.read_parquet
call by reading the Parquet footer metadata with pyarrow, and then converting to cudf (via pandas).
The problem is that the meta
object created by dask_cudf.read_parquet
can be "wrong", because dask_cudf will simply use cudf.read_parquet
to actually read in the data when compute
/persist
is called on the collection. In other words: The dask_cudf
result is not actually more correct than the cudf
result, it's eager meta
just happens to be more correct by accident :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for that explanation!
The dask_cudf result is not actually more correct than the cudf result, it's eager meta just happens to be more correct by accident
Ah yes I think this conclusion is correct and what you alluded to in your prior comment. In a follow up to this PR, the timezone aware type in meta
probably needs to be converted to a timezone naive type in dask_cudf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this makes sense, thanks!
/merge |
Description
closes #13611
(This technically does not support pandas objects have interval types that are timezone aware)
@rjzamora let me know if the test I adapted from your PR in #15929 is adequate
Checklist