-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: groupby
then resample
on column gives incorrect results if the index is out of order
#59408
base: main
Are you sure you want to change the base?
BUG: groupby
then resample
on column gives incorrect results if the index is out of order
#59408
Conversation
This PR looks good to me. I don't think the CI failures are related so @aram-cinnamon if you merge in the main branch will likely fix those. @rhshadrach @mroeschke any comments? |
RemovedI think the example below "works" on main (i.e. doesn't raise - but it still gives incorrect results as in the issue), but would raise if we merge this PR. df = pd.DataFrame(dict(
datetime=[pd.to_datetime('2024-07-30T00:00Z'), pd.to_datetime('2024-07-30T00:01Z')],
group=['A', 'A'],
value=[100, 200],
), index=pd.Index([1, 0], name="datetime"))
result = df.groupby('group').resample('1min', on='datetime').aggregate(dict(value='sum'))
print(result) @aram-cinnamon - can you confirm? Ignore me - I missed the |
5baf8eb
to
3ff8145
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requesting changes based on comment above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking much better - a question below.
70e71f7
to
55fe96b
Compare
@aram-cinnamon - not sure if you're aware, but once a review has been done, force-pushing means that reviewers need to review the entire PR again rather than being able to look at your subsequent changes. |
elif isinstance(obj.index, RangeIndex): | ||
ax = self._grouper.take(obj.index) | ||
else: | ||
# GH 59350 | ||
ax = self._grouper |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason we need to do a .take
in the RangeIndex
case, but not otherwise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, returning to this. I put in that condition to make this test pass: def test_groupby_resample_on_api_with_getitem(self):
which was apparently added as part of #17813
I have not had a chance to look too deeply.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding index=list("xyzwt")
to the DataFrame in that test makes the op fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the issue happens on pandas.core.resample:1559
. We pass only the group x
but the entire axis self.ax
. I wonder if splitting the axis as we do data
on L967 there and passing it to f
would resolve.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @rhshadrach, coming back to this, sorry for the hiatus. Can you elaborate on the above please? (I think maybe the word "not" is missing somewhere; I'm not entirely sure what you mean.) What file are you referring to re: L967? Can you give me a link to a line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did in my previous comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the issue happens on
pandas.core.resample:1559
.
I think in your previous comment you provided a link for this ^
I wonder if splitting the axis as we do
data
on L967 there
I ask again because the link you provided in your previous comment does not contain data
. Are you sure you didn't mean a different line here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does it need to contain data? We pass the func
from the highlighted line to groupby, which has a DataSplitter. This splits data
. My comment is that I wonder if we should be splitting axis
as we do this data
on L967.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @rhshadrach for your patience. I am new to making changes to groupby.
- By L967 you mean resample.py#L1577? (What is the significance of the number 967 in L967?)
- Are you saying that the code you highlighted causes
data
to be split downstream? Can you please elaborate on how the data is split? It would help greatly. - Can you explain in greater detail how we should split
axis
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- By L967 you mean resample.py#L1577? (What is the significance of the number 967 in L967?)
I'm not sure anymore and I doubt it would be helpful.
- Can you please elaborate on how the data is split?
groupby splits data into groups and operates on each group. x
in the code above is one of these groups, not the entire input data, while self.ax
is the entire input's axis.
- Can you explain in greater detail how we should split
axis
?
I would suggest determining first if this is the right direction to go before spending time on figuring out how to do it. As I've stated, I'm not sure this is the root cause of the issue (or even an issue at all!).
This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this. |
Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen. |
@mroeschke Hey sorry, I'm coming back to this after a break. Can you please reopen? I merged in the main branch, and I responded to a review comment. |
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.