Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: groupby then resample on column gives incorrect results if the index is out of order #59408

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

aram-cinnamon
Copy link
Contributor

@mroeschke mroeschke requested a review from rhshadrach August 5, 2024 18:07
@WillAyd
Copy link
Member

WillAyd commented Aug 9, 2024

This PR looks good to me. I don't think the CI failures are related so @aram-cinnamon if you merge in the main branch will likely fix those.

@rhshadrach @mroeschke any comments?

@rhshadrach
Copy link
Member

rhshadrach commented Aug 10, 2024

Removed

I think the example below "works" on main (i.e. doesn't raise - but it still gives incorrect results as in the issue), but would raise if we merge this PR.

df = pd.DataFrame(dict(
    datetime=[pd.to_datetime('2024-07-30T00:00Z'), pd.to_datetime('2024-07-30T00:01Z')],
    group=['A', 'A'],
    value=[100, 200],
), index=pd.Index([1, 0], name="datetime"))

result = df.groupby('group').resample('1min', on='datetime').aggregate(dict(value='sum'))
print(result)

@aram-cinnamon - can you confirm?

Ignore me - I missed the drop=True in the reset_index.

pandas/core/resample.py Outdated Show resolved Hide resolved
@aram-cinnamon aram-cinnamon force-pushed the groupby-then-resample-if-index-out-of-order branch from 5baf8eb to 3ff8145 Compare August 10, 2024 14:44
Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes based on comment above.

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking much better - a question below.

pandas/core/resample.py Outdated Show resolved Hide resolved
@aram-cinnamon aram-cinnamon force-pushed the groupby-then-resample-if-index-out-of-order branch from 70e71f7 to 55fe96b Compare August 17, 2024 22:39
@rhshadrach
Copy link
Member

aram-cinnamon force-pushed the groupby-then-resample-if-index-out-of-order branch from 70e71f7 to 55fe96b

@aram-cinnamon - not sure if you're aware, but once a review has been done, force-pushing means that reviewers need to review the entire PR again rather than being able to look at your subsequent changes.

Comment on lines +352 to +356
elif isinstance(obj.index, RangeIndex):
ax = self._grouper.take(obj.index)
else:
# GH 59350
ax = self._grouper
Copy link
Member

@rhshadrach rhshadrach Sep 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason we need to do a .take in the RangeIndex case, but not otherwise?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, returning to this. I put in that condition to make this test pass: def test_groupby_resample_on_api_with_getitem(self): which was apparently added as part of #17813
I have not had a chance to look too deeply.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding index=list("xyzwt") to the DataFrame in that test makes the op fail.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the issue happens on pandas.core.resample:1559. We pass only the group x but the entire axis self.ax. I wonder if splitting the axis as we do data on L967 there and passing it to f would resolve.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @rhshadrach, coming back to this, sorry for the hiatus. Can you elaborate on the above please? (I think maybe the word "not" is missing somewhere; I'm not entirely sure what you mean.) What file are you referring to re: L967? Can you give me a link to a line?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did in my previous comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the issue happens on pandas.core.resample:1559.

I think in your previous comment you provided a link for this ^

I wonder if splitting the axis as we do data on L967 there

I ask again because the link you provided in your previous comment does not contain data. Are you sure you didn't mean a different line here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does it need to contain data? We pass the func from the highlighted line to groupby, which has a DataSplitter. This splits data. My comment is that I wonder if we should be splitting axis as we do this data on L967.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rhshadrach for your patience. I am new to making changes to groupby.

  • By L967 you mean resample.py#L1577? (What is the significance of the number 967 in L967?)
  • Are you saying that the code you highlighted causes data to be split downstream? Can you please elaborate on how the data is split? It would help greatly.
  • Can you explain in greater detail how we should split axis?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • By L967 you mean resample.py#L1577? (What is the significance of the number 967 in L967?)

I'm not sure anymore and I doubt it would be helpful.

  • Can you please elaborate on how the data is split?

groupby splits data into groups and operates on each group. x in the code above is one of these groups, not the entire input data, while self.ax is the entire input's axis.

  • Can you explain in greater detail how we should split axis?

I would suggest determining first if this is the right direction to go before spending time on figuring out how to do it. As I've stated, I'm not sure this is the root cause of the issue (or even an issue at all!).

pandas/tests/resample/test_resampler_grouper.py Outdated Show resolved Hide resolved
pandas/tests/resample/test_resampler_grouper.py Outdated Show resolved Hide resolved
pandas/tests/resample/test_resampler_grouper.py Outdated Show resolved Hide resolved
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Oct 30, 2024
@mroeschke
Copy link
Member

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

@mroeschke mroeschke closed this Nov 7, 2024
@aram-cinnamon
Copy link
Contributor Author

@mroeschke Hey sorry, I'm coming back to this after a break. Can you please reopen? I merged in the main branch, and I responded to a review comment.

@rhshadrach rhshadrach reopened this Nov 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: groupby then resample on column gives incorrect results if the index is out of order
4 participants