-
-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missing resample() implementation for grouped data frame #1134
Comments
Thanks for the report. I think resample isn't implemented for groupby in Dask. The error message could certainly be better, adding it would also be fine |
@phofl , thanks for the quick reply. Is there any workaround to run arbitrary Pandas functions on groups, like |
Do you want the whole group in a single partition? If yes, you can use groupby.apply / groupby.transform |
@phofl, it works, thanks: import pandas as pd
import dask.dataframe as dd
data = {
'id': [1, 1, 1, 2, 2, 2],
'date': pd.to_datetime(['2023-01-01', '2023-01-04', '2023-01-05', '2023-01-01', '2023-01-04', '2023-01-05']),
'metric': [1,1,1,1,1,1]
}
df = dd.from_pandas(pd.DataFrame(data).astype({'id': 'int64[pyarrow]', 'metric': 'int64[pyarrow]', 'date': 'timestamp[ns][pyarrow]'}))
print(
df
.groupby(by=['id'])
.apply(lambda x: x.resample("D", on="date").sum(), include_groups=False, meta={"metric": "int64[pyarrow]"})
.reset_index()
) |
FYI, for those who came across this ticket. It was a bit unexpected for me that Dask keeps one group in a single partition, which means we can lead to OOM if a group is too large, and we should keep it in mind while grouping. Otherwise, we should do this: |
Apply and transform are doing this specifically, there is no way around that fwiw |
@phofl , I've spent a lot of time working around the missing |
that's odd, thanks for digging these up, I'll try to take a look tomorrow |
@phofl, it might be related to a bug in pandas that I spoted during this investigation: pandas-dev/pandas#59823 |
Here is a work around that only work in my case: import pandas as pd
import dask.dataframe as dd
data = {
'id': [1, 1, 1, 2, 2, 2],
'date': pd.to_datetime(['2023-01-01', '2023-01-04', '2023-01-05', '2023-01-01', '2023-01-04', '2023-01-05']),
'metric': [1,1,1,1,1,1]
}
df = dd.from_pandas(pd.DataFrame(data).astype({'id': 'int64[pyarrow]', 'metric': 'int64[pyarrow]', 'date': 'timestamp[ns][pyarrow]'}))
print(
df
# Partitioning by id as a replacement for groupby
.set_index('id')
# See bug: https://github.com/pandas-dev/pandas/issues/59823
.astype({'date': 'datetime64[ns]'})
# Apply the required Pandas function on each partition. Previously, set index guarantee us that each partition has all required rows
.map_partitions(
lambda x: x.groupby('id').resample("D", on="date").sum().reset_index(),
meta={'id': 'int64[pyarrow]', 'date': 'timestamp[ns][pyarrow]', 'metric': 'int64[pyarrow]'},
)
# Remove unnecessary index
.reset_index(drop=True)
.compute()
) |
Describe the issue:
Minimal Complete Verifiable Example:
# Put your MCVE code here
Environment:
dask-expr==1.1.10
The text was updated successfully, but these errors were encountered: