Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fix not to reindex on non-Categorical groups (GH9049) #9177

Merged
merged 2 commits into from
Feb 10, 2015

Conversation

ledmonster
Copy link

closes #9049.
closes #9344

_self.was_factor is not appropriate to judge whether grouper is Categorical or not, because it can be "True" when we groupby indices (not columns). So, I added another flag _self.is_categorical to judge Categorical state.

Also, I added a GroupBy test for MultiIndexed data, which was failed before this fix.

@@ -1883,6 +1883,7 @@ def __init__(self, index, grouper=None, obj=None, name=None, level=None,
# pre-computed
self._was_factor = False
self._should_compress = True
self._is_categorical = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, this basically duplicates _was_factor. I need you to disambiguate when that is being set incorrectly.

@jreback jreback added Performance Memory or execution speed performance Numeric Operations Arithmetic, Comparison, and Logical operations labels Jan 2, 2015
@jreback jreback added this to the 0.16.0 milestone Jan 2, 2015
@ledmonster
Copy link
Author

According to 8d2c2a8, original _was_factor seems not meaning "Categorical" but "level based grouping". Probably, ea1186d confused the meaning of _was_factor. How about renaming _was_factor to _grouping_by_level and to use both this and _is_categorical ?

@jreback
Copy link
Contributor

jreback commented Jan 4, 2015

maybe better to create a new flag maybe
_grouping_typ defaulting to None thrn set to a string (level or categorical)
it's internal so ok to change this

@ledmonster
Copy link
Author

Ok, I'll try with this approach.

@ledmonster
Copy link
Author

Improved by using _grouping_type instead of two boolean flags.

@@ -3414,6 +3414,19 @@ def test_groupby_categorical_unequal_len(self):
# len(bins) != len(series) here
self.assertRaises(ValueError,lambda : series.groupby(bins).mean())

def test_groupby_multiindex_missing_pair(self):
df = DataFrame({'group1': ['a','a','a','b'],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add an issue reference here as a comment

@jreback
Copy link
Contributor

jreback commented Jan 10, 2015

pls add a release note (bug fix section) referencing the original issue

if self._was_factor: # pragma: no cover
raise Exception('Should not call this method grouping by level')
if self._grouping_type in ("level", "categorical"): # pragma: no cover
raise Exception(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you know if this is tested anywhere?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it's not tested anywhere.

I think it's better to move all making self._labels and self._group_index logic from __init__ to this method.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try to refactor it when I have a free time, using another PR :)

@ledmonster
Copy link
Author

Added a comment and release note, and squashed changes.

@jreback
Copy link
Contributor

jreback commented Jan 18, 2015

@ledmonster pls rebase this when you can and see if you can address the comments above.

@ledmonster
Copy link
Author

Rebased and refactored Grouping class for Categorical grouper.

  • evaluate _labels and _group_index instead of using _grouping_type flag in _make_labels.
  • stop to convert Categorical self.grouper to ndarray. I checked some logics using self.grouper, but it seems to work fine without converting Categorical one to ndarray. (by modifying L1948 a little)
  • use cache_readonly instead of property for groups property.
  • add tests for datetime Categorical grouper.

Any feedbacks are welcome, thank you.

@jreback
Copy link
Contributor

jreback commented Jan 22, 2015

this looks pretty good - cleaned up s lot of older code

can u run s perf test (the vbench suite)
and report of anythjng anomalous comes up

@ledmonster
Copy link
Author

I tried ./test_perf.sh -b master -t HEAD, but it took too much machine resources. So, I'll try again tonight while I'm in bed, with -r groupby option.

BTW, I got some problems on running perf test, and made Pull #9332 for them.

@ledmonster
Copy link
Author

My Air Mac freezes by running ./test_perf.sh -b master -t HEAD -r groupby, so I'll try it on EC2 instance later.

@ledmonster
Copy link
Author

@jreback

Finally, I could run performance test on my AIr Mac. (I don't know what was wrong before .. 😓)

It seems that this pull request doesn't break current performance. I'll try to write a performance test, which represents an effect of this pull request.

BTW, some of tests (groupby_nth_float64_any and groupby_nth_float32_any) filed with AssertionError, even in master branch.

details

environment:

  • python 2.7.8
  • numpy 1.8.1

command:

./test_perf.sh -b master -t HEAD -r groupby

vb_suite.log

Invoked with :
--ncalls: 3
--repeats: 3


-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
groupby_ngroups_100_cumprod                  |  20.3053 |  28.8743 |   0.7032 |
groupby_ngroups_100_last                     |   0.6303 |   0.8399 |   0.7504 |
groupby_ngroups_100_min                      |   0.6464 |   0.7593 |   0.8513 |
groupby_ngroups_100_any                      |  14.2833 |  16.3154 |   0.8755 |
groupby_ngroups_100_head                     |   0.9576 |   1.0840 |   0.8834 |
groupby_int_count                            |   5.5679 |   6.1113 |   0.9111 |
groupby_ngroups_10000_var                    |   2.8907 |   3.1310 |   0.9232 |
groupby_last_float64                         |   4.9033 |   5.3020 |   0.9248 |
groupby_multi_size                           |  30.7086 |  33.1620 |   0.9260 |
groupby_ngroups_10000_last                   |   3.0739 |   3.2957 |   0.9327 |
groupby_ngroups_10000_mad                    | 6308.1783 | 6728.5376 |   0.9375 |
groupby_frame_singlekey_integer              |   3.1730 |   3.3813 |   0.9384 |
groupby_pivot_table                          |  26.1277 |  27.7747 |   0.9407 |
groupby_multi_different_functions            |  15.0627 |  15.9260 |   0.9458 |
groupby_ngroups_10000_value_counts           | 6888.7103 | 7216.8677 |   0.9545 |
groupby_ngroups_100_var                      |   0.4910 |   0.5110 |   0.9608 |
groupby_ngroups_100_mad                      |  64.2216 |  66.7377 |   0.9623 |
groupby_ngroups_10000_min                    |   2.9553 |   3.0704 |   0.9625 |
groupby_ngroups_100_all                      |  14.5303 |  14.9563 |   0.9715 |
groupby_ngroups_100_first                    |   0.6137 |   0.6301 |   0.9740 |
groupby_indices                              |  10.8583 |  11.1383 |   0.9749 |
groupby_first_float32                        |   4.5627 |   4.6653 |   0.9780 |
groupby_dt_size                              |  30.5503 |  31.2183 |   0.9786 |
groupby_ngroups_10000_any                    | 1387.1396 | 1417.0703 |   0.9789 |
groupby_series_nth_any                       |   5.7160 |   5.8363 |   0.9794 |
groupby_ngroups_10000_skew                   | 2316.1206 | 2363.0474 |   0.9801 |
groupby_ngroups_10000_cumsum                 | 1902.1820 | 1931.6096 |   0.9848 |
groupby_ngroups_100_std                      |   0.5527 |   0.5607 |   0.9857 |
groupby_ngroups_100_value_counts             |  70.4860 |  71.4906 |   0.9859 |
groupby_transform                            | 211.0896 | 214.0510 |   0.9862 |
groupby_series_simple_cython                 | 283.6780 | 287.3253 |   0.9873 |
groupby_ngroups_100_size                     |   0.6260 |   0.6333 |   0.9885 |
groupby_transform_ufunc                      | 170.6970 | 172.5584 |   0.9892 |
groupby_ngroups_10000_size                   |   5.5243 |   5.5800 |   0.9900 |
groupby_sum_booleans                         |   1.5647 |   1.5793 |   0.9907 |
groupby_multi_different_numpy_functions      |  16.0370 |  16.1840 |   0.9909 |
groupby_first_float64                        |   4.3423 |   4.3807 |   0.9912 |
groupby_ngroups_100_median                   |   0.5113 |   0.5156 |   0.9917 |
groupby_ngroups_10000_cummin                 | 1904.2930 | 1919.6667 |   0.9920 |
groupby_ngroups_100_cumsum                   |  20.0423 |  20.1993 |   0.9922 |
groupby_ngroups_10000_std                    |   2.9030 |   2.9247 |   0.9926 |
groupby_ngroups_100_mean                     |   0.4861 |   0.4890 |   0.9940 |
groupby_transform_series                     |  27.8977 |  28.0623 |   0.9941 |
groupby_ngroups_100_max                      |   0.6270 |   0.6304 |   0.9946 |
groupby_ngroups_100_describe                 | 226.6529 | 227.5487 |   0.9961 |
groupby_ngroups_10000_rank                   | 1884.1354 | 1885.8920 |   0.9991 |
groupby_ngroups_10000_max                    |   3.1803 |   3.1820 |   0.9995 |
groupby_dt_timegrouper_size                  |  26.5330 |  26.5294 |   1.0001 |
groupby_frame_cython_many_columns            |   4.1690 |   4.1670 |   1.0005 |
groupby_ngroups_10000_cumprod                | 1911.1340 | 1909.4703 |   1.0009 |
groupby_ngroups_10000_cummax                 | 1900.2120 | 1897.1416 |   1.0016 |
groupby_ngroups_10000_nunique                | 1431.4320 | 1427.3647 |   1.0028 |
groupby_nth_float64_none                     |  95.1933 |  94.8970 |   1.0031 |
groupby_ngroups_100_sem                      |   1.1113 |   1.1070 |   1.0039 |
groupby_first_object                         |  21.0333 |  20.9324 |   1.0048 |
groupby_multi_cython                         |  21.1920 |  21.0533 |   1.0066 |
groupby_series_nth_none                      |   1.7360 |   1.7220 |   1.0081 |
groupby_nth_object_none                      | 716.1887 | 709.5633 |   1.0093 |
groupby_ngroups_10000_cumcount               |  91.9007 |  90.9967 |   1.0099 |
groupby_agg_builtins2                        |  60.5311 |  59.9030 |   1.0105 |
groupby_ngroups_10000_pct_change             | 5491.0356 | 5427.6030 |   1.0117 |
groupby_ngroups_100_skew                     |  25.4494 |  25.0760 |   1.0149 |
groupby_ngroups_10000_prod                   |   3.0660 |   3.0204 |   1.0151 |
groupby_ngroups_10000_count                  |   2.9557 |   2.9027 |   1.0183 |
groupby_transform_multi_key3                 | 1221.3247 | 1198.5786 |   1.0190 |
groupby_ngroups_100_cumcount                 |   0.8957 |   0.8787 |   1.0194 |
groupby_transform_multi_key1                 | 114.1393 | 111.8406 |   1.0206 |
groupby_first_datetimes                      |  15.2536 |  14.9384 |   1.0211 |
groupby_frame_nth_none                       |   3.0247 |   2.9610 |   1.0215 |
groupby_ngroups_10000_all                    | 1471.4240 | 1439.2070 |   1.0224 |
groupby_ngroups_10000_tail                   | 105.2350 | 102.9210 |   1.0225 |
groupby_ngroups_10000_sum                    |   3.0403 |   2.9710 |   1.0233 |
groupby_int64_overflow                       | 482.9050 | 471.1819 |   1.0249 |
groupby_ngroups_10000_unique                 | 993.9707 | 969.5260 |   1.0252 |
groupby_frame_apply_overhead                 |  12.8290 |  12.5104 |   1.0255 |
groupby_apply_dict_return                    |  58.3300 |  56.8620 |   1.0258 |
groupby_ngroups_10000_diff                   | 1913.2260 | 1864.8180 |   1.0260 |
groupby_frame_apply                          |  58.9256 |  57.4137 |   1.0263 |
groupby_transform_multi_key4                 | 212.1150 | 206.3340 |   1.0280 |
groupby_ngroups_10000_describe               | 23349.1937 | 22703.3477 |   1.0284 |
groupby_nth_datetimes_any                    | 1401.2586 | 1361.6330 |   1.0291 |
groupby_ngroups_100_diff                     |  20.6420 |  20.0340 |   1.0303 |
groupby_last_datetimes                       |  15.2263 |  14.7664 |   1.0311 |
groupby_ngroups_100_unique                   |  10.5879 |  10.2584 |   1.0321 |
groupby_ngroups_100_tail                     |   1.1206 |   1.0823 |   1.0354 |
groupby_ngroups_10000_head                   | 101.8883 |  98.2103 |   1.0375 |
groupby_nth_datetimes_none                   | 690.9646 | 665.8187 |   1.0378 |
groupby_ngroups_100_rank                     |  21.0533 |  20.2037 |   1.0421 |
groupby_ngroups_100_pct_change               |  56.0367 |  53.7550 |   1.0424 |
groupby_ngroups_100_sum                      |   0.6863 |   0.6580 |   1.0431 |
groupby_ngroups_100_cummin                   |  21.7299 |  20.7803 |   1.0457 |
groupby_transform_multi_key2                 |  78.0807 |  74.6547 |   1.0459 |
groupby_last_object                          |  21.3246 |  20.3857 |   1.0461 |
groupby_transform_series2                    | 174.5074 | 166.2600 |   1.0496 |
groupby_multi_python                         | 206.9554 | 196.9930 |   1.0506 |
groupby_nth_float32_none                     | 117.9597 | 110.9044 |   1.0636 |
groupby_nth_object_any                       | 1407.9730 | 1323.5826 |   1.0638 |
groupby_multi_series_op                      |  19.2900 |  18.0667 |   1.0677 |
groupby_ngroups_100_cummax                   |  21.5353 |  20.1233 |   1.0702 |
groupby_ngroups_100_nunique                  |  15.7940 |  14.7266 |   1.0725 |
groupby_simple_compress_timing               |  43.8840 |  40.8143 |   1.0752 |
groupby_agg_builtins1                        |  15.7374 |  14.3619 |   1.0958 |
groupby_multi_count                          |  12.3970 |  11.0126 |   1.1257 |
groupby_frame_median                         |  10.1300 |   8.7897 |   1.1525 |
groupby_ngroups_100_count                    |   0.7610 |   0.6503 |   1.1702 |
groupby_last_float32                         |   5.0333 |   4.2017 |   1.1979 |
groupby_ngroups_10000_mean                   |   3.0813 |   2.5270 |   1.2194 |
groupby_frame_nth_any                        |   9.7840 |   7.9927 |   1.2241 |
groupby_ngroups_10000_median                 |   4.3104 |   3.3850 |   1.2734 |
groupby_ngroups_10000_first                  |   3.9930 |   2.9980 |   1.3319 |
groupby_ngroups_100_prod                     |   0.8580 |   0.6284 |   1.3654 |
groupby_ngroups_10000_sem                    |   6.3477 |   4.4421 |   1.4290 |
-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------

Ratio < 1.0 means the target commit is faster then the baseline.
Seed used: 1234

Target [9f0842d] : ENH: Refactor groupby for Categorical grouper
Base   [9f439f0] : Merge pull request #9380 from behzadnouri/i8grby

bug in groupby when key space exceeds int64 bounds

Results are not so stable, probably because other applications are running on my Mac. For example, I ran groupby_ngroups_10000_sem again, and got following different result.

-------------------------------------------------------------------------------
groupby_ngroups_10000_sem                    |   4.0150 |   4.0263 |   0.9972 |
-------------------------------------------------------------------------------

groupby_nth_float64_any and groupby_nth_float32_any failed with a following traceback, which were recorded on benchmarks.db.

$ sqlite3 benchmarks.db
SQLite version 3.8.5 2014-08-15 22:37:57
Enter ".help" for usage hints.
sqlite> select * from results where checksum = '30b1995bdbe3973901ead3d9f96ff129' or checksum = '643e7456fe9d19192098079f8e257b0a';
30b1995bdbe3973901ead3d9f96ff129|5fd1fbd|2015-01-21 08:43:49.000000|||Traceback (most recent call last):
  File "/usr/local/opt/pyenv/versions/pandas27/lib/python2.7/site-packages/vbench/benchmark.py", line 90, in run
    repeat=self.repeat, force_ms=True)
  File "/usr/local/opt/pyenv/versions/pandas27/lib/python2.7/site-packages/vbench/benchmark.py", line 377, in magic_timeit
    best = min(timer.repeat(repeat, number)) / number
  File "/usr/local/opt/pyenv/versions/2.7.8/lib/python2.7/timeit.py", line 223, in repeat
    t = self.timeit(number)
  File "/usr/local/opt/pyenv/versions/2.7.8/lib/python2.7/timeit.py", line 195, in timeit
    timing = self.inner(it, self.timer)
  File "<magic-timeit>", line 6, in inner
  File "/private/var/folders/yh/7k_x_m811yvfffs2h44qw7cc0000gn/T/tmp1_vz7s/pandas/core/groupby.py", line 894, in nth
    level=self.level, sort=self.sort)
  File "/private/var/folders/yh/7k_x_m811yvfffs2h44qw7cc0000gn/T/tmp1_vz7s/pandas/core/groupby.py", line 2141, in _get_grouper
    level=level, sort=sort, in_axis=in_axis)
  File "/private/var/folders/yh/7k_x_m811yvfffs2h44qw7cc0000gn/T/tmp1_vz7s/pandas/core/groupby.py", line 1870, in __init__
    self.grouper = _convert_grouper(index, grouper)
  File "/private/var/folders/yh/7k_x_m811yvfffs2h44qw7cc0000gn/T/tmp1_vz7s/pandas/core/groupby.py", line 2168, in _convert_grouper
    raise AssertionError('Grouper and axis must be same length')
AssertionError: Grouper and axis must be same length

643e7456fe9d19192098079f8e257b0a|5fd1fbd|2015-01-21 08:43:49.000000|||Traceback (most recent call last):
  File "/usr/local/opt/pyenv/versions/pandas27/lib/python2.7/site-packages/vbench/benchmark.py", line 90, in run
    repeat=self.repeat, force_ms=True)
  File "/usr/local/opt/pyenv/versions/pandas27/lib/python2.7/site-packages/vbench/benchmark.py", line 377, in magic_timeit
    best = min(timer.repeat(repeat, number)) / number
  File "/usr/local/opt/pyenv/versions/2.7.8/lib/python2.7/timeit.py", line 223, in repeat
    t = self.timeit(number)
  File "/usr/local/opt/pyenv/versions/2.7.8/lib/python2.7/timeit.py", line 195, in timeit
    timing = self.inner(it, self.timer)
  File "<magic-timeit>", line 6, in inner
  File "/private/var/folders/yh/7k_x_m811yvfffs2h44qw7cc0000gn/T/tmp1_vz7s/pandas/core/groupby.py", line 894, in nth
    level=self.level, sort=self.sort)
  File "/private/var/folders/yh/7k_x_m811yvfffs2h44qw7cc0000gn/T/tmp1_vz7s/pandas/core/groupby.py", line 2141, in _get_grouper
    level=level, sort=sort, in_axis=in_axis)
  File "/private/var/folders/yh/7k_x_m811yvfffs2h44qw7cc0000gn/T/tmp1_vz7s/pandas/core/groupby.py", line 1870, in __init__
    self.grouper = _convert_grouper(index, grouper)
  File "/private/var/folders/yh/7k_x_m811yvfffs2h44qw7cc0000gn/T/tmp1_vz7s/pandas/core/groupby.py", line 2168, in _convert_grouper
    raise AssertionError('Grouper and axis must be same length')
AssertionError: Grouper and axis must be same length

@ledmonster
Copy link
Author

Add new performance test for this issue.

A test result is like this. Performance improvement is not linear, so with larger DataFrame, performance ratio would be much smaller.


Invoked with :
--ncalls: 3
--repeats: 3


-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
groupby_sum_multiindex                       |   2.0807 |   4.8820 |   0.4262 |
-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------

Ratio < 1.0 means the target commit is faster then the baseline.
Seed used: 1234

Target [9f0842d] : ENH: Refactor groupby for Categorical grouper
Base   [9f439f0] : Merge pull request #9380 from behzadnouri/i8grby

bug in groupby when key space exceeds int64 bounds

def groups(self):
if self._groups is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this makes the grouper lazy. Why did you change this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I replaced "@Property" decorator with "@cache_readonly", so groups still be lazy, I think.

@jreback
Copy link
Contributor

jreback commented Feb 4, 2015

can you incorporate at test by #9344 as well (in that the example both by_levels and by_columns should be the same). I think its reindexing in that case as well (and shouldn't).
thanks

@ledmonster
Copy link
Author

Okay, I'll try to add tests for #9344 on this weekend.

@ledmonster
Copy link
Author

Added a test for #9344. Now the test passes.

@ledmonster
Copy link
Author

Rebased.

@@ -3297,6 +3297,33 @@ def test_groupby_categorical(self):
expected.index.names = ['myfactor', None]
assert_frame_equal(desc_result, expected)

def test_groupby_datetime_categorical(self):
levels = pd.date_range('2014-01-01', periods=4)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add the issue number as a comment here

@jreback
Copy link
Contributor

jreback commented Feb 8, 2015

looks good. minor doc change. pls squash into 1-2 commits. and ping when ready.

@@ -3297,6 +3297,34 @@ def test_groupby_categorical(self):
expected.index.names = ['myfactor', None]
assert_frame_equal(desc_result, expected)

def test_groupby_datetime_categorical(self):
# GH9049: ensure backward compatibility
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is for ensuring backward compatibility, so added a comment like this.

@ledmonster
Copy link
Author

@jreback thank you for reviewing.

squashed to 2 commits (one is for bug fix, and the other is for refactoring), and rebased.

jreback added a commit that referenced this pull request Feb 10, 2015
BUG: Fix not to reindex on non-Categorical groups (GH9049)
@jreback jreback merged commit 0efd4b3 into pandas-dev:master Feb 10, 2015
@jreback
Copy link
Contributor

jreback commented Feb 10, 2015

thanks for this!

@ledmonster
Copy link
Author

My pleasure 😸

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Numeric Operations Arithmetic, Comparison, and Logical operations Performance Memory or execution speed performance
Projects
None yet
2 participants