-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Fix not to reindex on non-Categorical groups (GH9049) #9177
Conversation
@@ -1883,6 +1883,7 @@ def __init__(self, index, grouper=None, obj=None, name=None, level=None, | |||
# pre-computed | |||
self._was_factor = False | |||
self._should_compress = True | |||
self._is_categorical = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, this basically duplicates _was_factor
. I need you to disambiguate when that is being set incorrectly.
According to 8d2c2a8, original |
maybe better to create a new flag maybe |
Ok, I'll try with this approach. |
Improved by using |
@@ -3414,6 +3414,19 @@ def test_groupby_categorical_unequal_len(self): | |||
# len(bins) != len(series) here | |||
self.assertRaises(ValueError,lambda : series.groupby(bins).mean()) | |||
|
|||
def test_groupby_multiindex_missing_pair(self): | |||
df = DataFrame({'group1': ['a','a','a','b'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add an issue reference here as a comment
pls add a release note (bug fix section) referencing the original issue |
if self._was_factor: # pragma: no cover | ||
raise Exception('Should not call this method grouping by level') | ||
if self._grouping_type in ("level", "categorical"): # pragma: no cover | ||
raise Exception( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you know if this is tested anywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it's not tested anywhere.
I think it's better to move all making self._labels
and self._group_index
logic from __init__
to this method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try to refactor it when I have a free time, using another PR :)
Added a comment and release note, and squashed changes. |
@ledmonster pls rebase this when you can and see if you can address the comments above. |
Rebased and refactored
Any feedbacks are welcome, thank you. |
this looks pretty good - cleaned up s lot of older code can u run s perf test (the vbench suite) |
I tried BTW, I got some problems on running perf test, and made Pull #9332 for them. |
My Air Mac freezes by running |
Finally, I could run performance test on my AIr Mac. (I don't know what was wrong before .. 😓) It seems that this pull request doesn't break current performance. I'll try to write a performance test, which represents an effect of this pull request. BTW, some of tests ( detailsenvironment:
command:
vb_suite.log
Results are not so stable, probably because other applications are running on my Mac. For example, I ran
|
Add new performance test for this issue. A test result is like this. Performance improvement is not linear, so with larger DataFrame, performance ratio would be much smaller.
|
def groups(self): | ||
if self._groups is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this makes the grouper lazy. Why did you change this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I replaced "@Property" decorator with "@cache_readonly", so groups still be lazy, I think.
can you incorporate at test by #9344 as well (in that the example both by_levels and by_columns should be the same). I think its reindexing in that case as well (and shouldn't). |
Okay, I'll try to add tests for #9344 on this weekend. |
Added a test for #9344. Now the test passes. |
Rebased. |
@@ -3297,6 +3297,33 @@ def test_groupby_categorical(self): | |||
expected.index.names = ['myfactor', None] | |||
assert_frame_equal(desc_result, expected) | |||
|
|||
def test_groupby_datetime_categorical(self): | |||
levels = pd.date_range('2014-01-01', periods=4) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add the issue number as a comment here
looks good. minor doc change. pls squash into 1-2 commits. and ping when ready. |
@@ -3297,6 +3297,34 @@ def test_groupby_categorical(self): | |||
expected.index.names = ['myfactor', None] | |||
assert_frame_equal(desc_result, expected) | |||
|
|||
def test_groupby_datetime_categorical(self): | |||
# GH9049: ensure backward compatibility |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is for ensuring backward compatibility, so added a comment like this.
@jreback thank you for reviewing. squashed to 2 commits (one is for bug fix, and the other is for refactoring), and rebased. |
BUG: Fix not to reindex on non-Categorical groups (GH9049)
thanks for this! |
My pleasure 😸 |
closes #9049.
closes #9344
_self.was_factor is not appropriate to judge whether grouper is Categorical or not, because it can be "True" when we groupby indices (not columns). So, I added another flag _self.is_categorical to judge Categorical state.
Also, I added a GroupBy test for MultiIndexed data, which was failed before this fix.