Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] as_column of pandas timestamps delivers different resolution datetime depending on whether we pass a scalar or list #14627

Open
wence- opened this issue Dec 14, 2023 · 3 comments
Labels
bug Something isn't working

Comments

@wence-
Copy link
Contributor

wence- commented Dec 14, 2023

Describe the bug

import pandas as pd
from cudf.core.column import as_column

data = pd.Timestamp("2000-01-01")

from_scalar = as_column(data)
from_list = as_column([data])

assert from_scalar.dtype == from_list.dtype # False

Expected behavior

The resolution should be inferred consistently. Note that cudf.Scalar(data) infers the same (nanosecond) resolution as as_column([data]).

@wence- wence- added bug Something isn't working Needs Triage Need team to review and classify labels Dec 14, 2023
@wence-
Copy link
Contributor Author

wence- commented Dec 14, 2023

This is because scalar values get handled through:

from_arrow(pa.array(pd.Series([data]), from_pandas=True))

Whereas a list is handled by

from_arrow(pa.array([data]))

And pyarrow infers a different resolution for the timestamp compared to pandas.

This appears to be a bug in pyarrow, which does not pick up the correct nanosecond resolution of pandas timestamp objects, treating them like builtin datetime objects which have microsecond resolution.

How do we want to treat this case?

@shwina
Copy link
Contributor

shwina commented Dec 14, 2023

Can we supply an explicit data type to the pa.array() call in the latter case?

@wence-
Copy link
Contributor Author

wence- commented Dec 14, 2023

That requires another round of introspection. I do not know the history of as_column. In the case that we don't hit an "easily handled" path (arrow/numpy/pandas/cudf/cupy), is there a reason we don't just always go via pandas.Series?

@bdice bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants