Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] dt.date to extract date from a timestamp #7880

Open
bschifferer opened this issue Apr 7, 2021 · 12 comments
Open

[FEA] dt.date to extract date from a timestamp #7880

bschifferer opened this issue Apr 7, 2021 · 12 comments
Labels
feature request New feature or request Python Affects Python cuDF API. wontfix This will not be worked on

Comments

@bschifferer
Copy link

Is your feature request related to a problem? Please describe.
As a user, I want to be able to extract the date from a timestamp.
Pandas uses .dt.date - for example df_train['timestamp'].to_pandas()).dt.date

The reason is, that my timestamp contains Year-Month-Day and Hour-Minute.
I want to use the DATE as the index and apply shift functions based on differences in days.
Therefore, I want to extract only the date and not include the hour or minute.

Describe the solution you'd like
Similar to pandas, support .dt.date

@bschifferer bschifferer added Needs Triage Need team to review and classify feature request New feature or request labels Apr 7, 2021
@kkraus14 kkraus14 added Python Affects Python cuDF API. libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify libcudf Affects libcudf (C++/CUDA) code. labels Apr 8, 2021
@kkraus14
Copy link
Collaborator

kkraus14 commented Apr 8, 2021

@bschifferer the .dt.date function returns a numpy array of datetime.date objects which we can't 100% replicate in cuDF. We could maybe have this return a datetime64[D] typed column if that would work for you?

@brandon-b-miller any ideas on a workaround to this for the time being?

@bschifferer
Copy link
Author

@bschifferer the .dt.date function returns a numpy array of datetime.date objects which we can't 100% replicate in cuDF. We could maybe have this return a datetime64[D] typed column if that would work for you?

I think that is fine as long as the returned date (in datetime64[D]) contains only DD-MM-YYYY and removes/cuts off hours and minutes. For Example, the timestamp represents "2021-04-08 10:43" the resulting datetime64[D] is "2021-04-08 00:00"

@brandon-b-miller
Copy link
Contributor

There's a couple of approaches to making this happen, but none that I think are implemented today.

  1. Currently we can use astype to truncate up to a unit, the issue being that we currently only half support [D]. I say half because libcudf seems to support it, so we'd have to wrap this and expose it through dt.day among other places presumably.
  2. This capability should be supported by DateOffset which eventually, if we want full pandas compatibility, should be able to set the hour, minute, second...etc pieces of the date to 0, but this is probably further off.
  3. I think that perhaps floor could handle this once implemented.
  4. As a last resort one could possibly use a workaround using the existing .dt.year and .dt.day functionality to create a bunch of string columns and concatenate them together, but that'd be wildly inefficient.

My suspicion is that we need to implement this natively or leverage TIMESTAMP_DAYS and a cast. In any event though we should definitely do this because rolling data up to the day level is a very common and important operation

@github-actions
Copy link

github-actions bot commented May 8, 2021

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@charlesbluca
Copy link
Member

Bumping this issue as the need for DatetimeProperties.date is coming up in dask-contrib/dask-sql#343

@shwina
Copy link
Contributor

shwina commented Jan 20, 2022

@charlesbluca apologies for missing this request.

What's the expected return type here? Pandas returns an object column composed of Python datetime objects.

@charlesbluca
Copy link
Member

For dask-sql's purposes, ideally any column that can later be cast to a datetime64[ns], so I think string columns should work here if Python datetime objects can't be done.

That being said, the usage in dask-sql is for a workaround to address the fact that cuDF and pandas (for timezoned datetime columns) don't support astype("datetime64[D]") - depending on the work required, it might be easier to add those features to unblock dask-contrib/dask-sql#343

@beckernick
Copy link
Member

Your workaround may be the best approach to this for now. Perhaps pandas behavior here is something we simply don't want to enable, like iteration? cc @shwina

@shwina
Copy link
Contributor

shwina commented Feb 4, 2022

Correct, although if all you care about is removing all the resolutions smaller than D (day), you could also use floor:

In [42]: s
Out[42]:
0   2001-01-01 12:12:12
1   2001-01-02 01:02:03
dtype: datetime64[ns]

In [43]: s.dt.floor("D")
Out[43]:
0   2001-01-01
1   2001-01-02
dtype: datetime64[ns]

@shwina shwina added the wontfix This will not be worked on label Feb 4, 2022
@shwina
Copy link
Contributor

shwina commented Feb 4, 2022

I'm going to close this as a wontfix since we can't return an array of datetime objects without copying to Pandas first. Hopefully, floor will provide a sufficient alternative for most use cases.

@MarcoGorelli
Copy link
Contributor

we can't return an array of datetime objects

Now that pandas returns a proper Series in the pyarrow-dtypes case:

>>> pd.Series([datetime(2020,1,1)]).convert_dtypes(dtype_backend='pyarrow').dt.date
0    2020-01-01
dtype: date32[day][pyarrow]

would you be open to reconsidering?

We're xfailing some tests in Narwhals for cuDF because of this, though of course our preference would be to not have to do so

@vyasr
Copy link
Contributor

vyasr commented Oct 31, 2024

I'd be open to reconsidering, but before this goes anywhere I think we need to figure out our story w.r.t. pandas and pyarrow dtypes going forward, especially in light of discussions like pandas-dev/pandas#57073. Aside from the strings-focused questions, cudf.pandas has raised a number of questions for me around the continued viability of supporting two different type systems in pandas (pyarrow-backed and not-pyarrow-backed) simultaneously. Given the differential level of support for these two in pandas and how tightly we couple ourselves to trying to reproduce pandas behaviors, I'm pretty bearish on trying to support both and would probably want to sort something out in that direction before trying to resolve specific issues like this one.

@vyasr vyasr reopened this Oct 31, 2024
@vyasr vyasr added this to cuDF Python Nov 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request Python Affects Python cuDF API. wontfix This will not be worked on
Projects
Status: Todo
Development

No branches or pull requests

9 participants