Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: groupby then resample on column gives incorrect results if the index is out of order #59408

Open
wants to merge 14 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 10 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v3.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -121,6 +121,7 @@ These improvements also fixed certain bugs in groupby:
- :meth:`.DataFrameGroupBy.agg` would fail when there are multiple groupings, unobserved groups, and ``as_index=False`` (:issue:`36698`)
- :meth:`.DataFrameGroupBy.groups` with ``sort=False`` would sort groups; they now occur in the order they are observed (:issue:`56966`)
- :meth:`.DataFrameGroupBy.nunique` would fail when there are multiple groupings, unobserved groups, and ``as_index=False`` (:issue:`52848`)
- :meth:`.DataFrameGroupBy.resample` with an ``on`` value that is not ``None`` would have incorrect values when the index is out of order (:issue:`59350`)
- :meth:`.DataFrameGroupBy.sum` would have incorrect values when there are multiple groupings, unobserved groups, and non-numeric data (:issue:`43891`)
- :meth:`.DataFrameGroupBy.value_counts` would produce incorrect results when used with some categorical and some non-categorical groupings and ``observed=False`` (:issue:`56016`)

Expand Down
6 changes: 5 additions & 1 deletion pandas/core/groupby/grouper.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@
from pandas.core.indexes.api import (
Index,
MultiIndex,
RangeIndex,
default_index,
)
from pandas.core.series import Series
Expand Down Expand Up @@ -348,8 +349,11 @@ def _set_grouper(
reverse_indexer = self._indexer.argsort()
unsorted_ax = self._grouper.take(reverse_indexer)
ax = unsorted_ax.take(obj.index)
else:
elif isinstance(obj.index, RangeIndex):
ax = self._grouper.take(obj.index)
else:
# GH 59350
ax = self._grouper
Comment on lines +352 to +356
Copy link
Member

@rhshadrach rhshadrach Sep 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason we need to do a .take in the RangeIndex case, but not otherwise?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, returning to this. I put in that condition to make this test pass: def test_groupby_resample_on_api_with_getitem(self): which was apparently added as part of #17813
I have not had a chance to look too deeply.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding index=list("xyzwt") to the DataFrame in that test makes the op fail.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the issue happens on pandas.core.resample:1559. We pass only the group x but the entire axis self.ax. I wonder if splitting the axis as we do data on L967 there and passing it to f would resolve.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @rhshadrach, coming back to this, sorry for the hiatus. Can you elaborate on the above please? (I think maybe the word "not" is missing somewhere; I'm not entirely sure what you mean.) What file are you referring to re: L967? Can you give me a link to a line?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did in my previous comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the issue happens on pandas.core.resample:1559.

I think in your previous comment you provided a link for this ^

I wonder if splitting the axis as we do data on L967 there

I ask again because the link you provided in your previous comment does not contain data. Are you sure you didn't mean a different line here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does it need to contain data? We pass the func from the highlighted line to groupby, which has a DataSplitter. This splits data. My comment is that I wonder if we should be splitting axis as we do this data on L967.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rhshadrach for your patience. I am new to making changes to groupby.

  • By L967 you mean resample.py#L1577? (What is the significance of the number 967 in L967?)
  • Are you saying that the code you highlighted causes data to be split downstream? Can you please elaborate on how the data is split? It would help greatly.
  • Can you explain in greater detail how we should split axis?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • By L967 you mean resample.py#L1577? (What is the significance of the number 967 in L967?)

I'm not sure anymore and I doubt it would be helpful.

  • Can you please elaborate on how the data is split?

groupby splits data into groups and operates on each group. x in the code above is one of these groups, not the entire input data, while self.ax is the entire input's axis.

  • Can you explain in greater detail how we should split axis?

I would suggest determining first if this is the right direction to go before spending time on figuring out how to do it. As I've stated, I'm not sure this is the root cause of the issue (or even an issue at all!).

else:
if key not in obj._info_axis:
raise KeyError(f"The grouper name {key} is not found")
Expand Down
243 changes: 243 additions & 0 deletions pandas/tests/resample/test_resampler_grouper.py
Original file line number Diff line number Diff line change
Expand Up @@ -689,3 +689,246 @@ def test_groupby_resample_on_index_with_list_of_keys_missing_column():
rs = gb.resample("2D")
with pytest.raises(KeyError, match="Columns not found"):
rs[["val_not_in_dataframe"]]


def test_groupby_resample_after_set_index_and_not_on_column():
# GH 59350
df = DataFrame(
data={
"datetime": [
pd.to_datetime("2024-07-30T00:00Z"),
pd.to_datetime("2024-07-30T00:01Z"),
],
"group": ["A", "A"],
"numbers": [100, 200],
},
index=[1, 0],
).set_index("datetime")
gb = df.groupby("group")
rs = gb.resample("1min")
result = rs.aggregate({"numbers": "sum"})

index = pd.MultiIndex.from_arrays(
[
["A", "A"],
[pd.to_datetime("2024-07-30T00:00Z"), pd.to_datetime("2024-07-30T00:01Z")],
],
names=[
"group",
"datetime",
],
)
expected = DataFrame({"numbers": [100, 200]}, index=index)

tm.assert_frame_equal(result, expected)


@pytest.mark.parametrize(
"df",
[
DataFrame(
data={
"datetime": [
pd.to_datetime("2024-07-30T00:00Z"),
pd.to_datetime("2024-07-30T00:01Z"),
],
"group": ["A", "A"],
"numbers": [100, 200],
},
index=[1, 0],
),
DataFrame(
data={
"datetime": [
pd.to_datetime("2024-07-30T00:00Z"),
pd.to_datetime("2024-07-30T00:01Z"),
],
"group": ["A", "A"],
"numbers": [100, 200],
},
).set_index("group"),
DataFrame(
data={
"datetime": [
pd.to_datetime("2024-07-30T00:00Z"),
pd.to_datetime("2024-07-30T00:01Z"),
],
"group": ["A", "A"],
"numbers": [100, 200],
},
).set_index("datetime", drop=False),
],
)
def test_groupby_resample_on_column_when_index_is_unusual(df):
# GH 59350
gb = df.groupby("group")
rs = gb.resample("1min", on="datetime")
result = rs.aggregate({"numbers": "sum"})

index = pd.MultiIndex.from_arrays(
[
["A", "A"],
[pd.to_datetime("2024-07-30T00:00Z"), pd.to_datetime("2024-07-30T00:01Z")],
],
names=[
"group",
"datetime",
],
)
expected = DataFrame({"numbers": [100, 200]}, index=index)

tm.assert_frame_equal(result, expected)


def test_groupby_resample_then_groupby_is_reused_when_index_is_out_of_order():
aram-cinnamon marked this conversation as resolved.
Show resolved Hide resolved
# GH 59350
df = DataFrame(
data={
"datetime": [
pd.to_datetime("2024-07-30T00:00Z"),
pd.to_datetime("2024-07-30T00:01Z"),
],
"group": ["A", "A"],
"numbers": [100, 200],
},
index=[1, 0],
)

gb = df.groupby("group")

# use gb
result_1 = gb[["numbers"]].transform("sum")

index = Index([1, 0])
expected = DataFrame({"numbers": [300, 300]}, index=index)

tm.assert_frame_equal(result_1, expected)

# resample gb, unrelated to above
rs = gb.resample("1min", on="datetime")
result_2 = rs.aggregate({"numbers": "sum"})

index = pd.MultiIndex.from_arrays(
[
["A", "A"],
[pd.to_datetime("2024-07-30T00:00Z"), pd.to_datetime("2024-07-30T00:01Z")],
],
names=[
"group",
"datetime",
],
)
expected = DataFrame({"numbers": [100, 200]}, index=index)

tm.assert_frame_equal(result_2, expected)

# reuse gb, unrelated to above
result_3 = gb[["numbers"]].transform("sum")

tm.assert_frame_equal(result_1, result_3)


def test_groupby_resample_then_groupby_is_reused_when_index_is_set_from_column():
aram-cinnamon marked this conversation as resolved.
Show resolved Hide resolved
# GH 59350
df = DataFrame(
data={
"datetime": [
pd.to_datetime("2024-07-30T00:00Z"),
pd.to_datetime("2024-07-30T00:01Z"),
],
"group": ["A", "A"],
"numbers": [100, 200],
},
).set_index("group")

gb = df.groupby("group")

# use gb
result_1 = gb[["numbers"]].transform("sum")

index = Index(["A", "A"], name="group")
expected = DataFrame({"numbers": [300, 300]}, index=index)

tm.assert_frame_equal(result_1, expected)

# resample gb, unrelated to above
rs = gb.resample("1min", on="datetime")
result_2 = rs.aggregate({"numbers": "sum"})

index = pd.MultiIndex.from_arrays(
[
["A", "A"],
[pd.to_datetime("2024-07-30T00:00Z"), pd.to_datetime("2024-07-30T00:01Z")],
],
names=[
"group",
"datetime",
],
)
expected = DataFrame({"numbers": [100, 200]}, index=index)

tm.assert_frame_equal(result_2, expected)

# reuse gb, unrelated to above
result_3 = gb[["numbers"]].transform("sum")

tm.assert_frame_equal(result_1, result_3)


def test_groupby_resample_then_groupby_is_reused_when_groupby_selection_is_not_none():
aram-cinnamon marked this conversation as resolved.
Show resolved Hide resolved
# GH 59350
df = DataFrame(
data={
"datetime": [
pd.to_datetime("2024-07-30T00:00Z"),
pd.to_datetime("2024-07-30T00:01Z"),
],
"group": ["A", "A"],
"numbers": [100, 200],
},
index=[1, 0],
)

gb = df.groupby("group")
gb = gb[["numbers", "datetime"]] # gb._selection is ["numbers", "datetime"]

# use gb
result_1 = gb.transform("max")

index = Index([1, 0])
expected = DataFrame(
{
"numbers": [200, 200],
"datetime": [
pd.to_datetime("2024-07-30T00:01Z"),
pd.to_datetime("2024-07-30T00:01Z"),
],
},
index=index,
)

tm.assert_frame_equal(result_1, expected)

# resample gb, unrelated to above
rs = gb.resample("1min", on="datetime")
result_2 = rs.aggregate({"numbers": "sum"}) # Enter the `except IndexError:` block

index = pd.MultiIndex.from_arrays(
[
["A", "A"],
[pd.to_datetime("2024-07-30T00:00Z"), pd.to_datetime("2024-07-30T00:01Z")],
],
names=[
"group",
"datetime",
],
)
columns = pd.MultiIndex.from_arrays([["numbers"], ["numbers"]])
expected = DataFrame([[100], [200]], index=index, columns=columns)

tm.assert_frame_equal(result_2, expected)

# reuse gb, unrelated to above
result_3 = gb.transform("max")

tm.assert_frame_equal(result_1, result_3)