You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When grouping by several levels of a MultiIndex, groupby evaltuates all possible combinations of the groupby keys. When grouping by column name, it only evaluates what exist in the DataFrame. Also, this behavior does not exist in 0.14.1, but does in all final releases from 0.15.0 on.
This may be a new feature, not a bug, but I couldn't find anything in the docs, open or closed issues, etc. (closest was Issue #8138). If this is the intended behavior, it would be nice to have in the docs.
importpandasaspdimportnumpyasnpdf=pd.DataFrame(np.arange(12).reshape(-1, 3))
df.index=pd.MultiIndex.from_tuples([(1, 1), (1, 2), (3, 4), (5, 6)])
idx_names= ['x', 'y']
df.index.names=idx_names# Adds nan's for (x, y) combinations that aren't in the databy_levels=df.groupby(level=idx_names).mean()
# This does not add missing combinations of the groupby keysby_columns=df.reset_index().groupby(idx_names).mean()
printby_levelsprintby_columns# This passes in 0.14.1, but not >=0.15.0 finalassertby_levels.equals(by_columns)
INSTALLED VERSIONS
commit: None
python: 2.7.7.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
The change in behavior happened in ea0a13c, probably the new _reindex_output function. I understand this is the desired behavior for categoricals, but is it also desired for any groupby on more than one level of a MultiIndex?
I think that was an unintened change in that by default a multi-indexed groupby should not reindex to the cartesian product of the levels (e.g. what a categorical does).
When grouping by several levels of a MultiIndex, groupby evaltuates all possible combinations of the groupby keys. When grouping by column name, it only evaluates what exist in the DataFrame. Also, this behavior does not exist in 0.14.1, but does in all final releases from 0.15.0 on.
This may be a new feature, not a bug, but I couldn't find anything in the docs, open or closed issues, etc. (closest was Issue #8138). If this is the intended behavior, it would be nice to have in the docs.
INSTALLED VERSIONS
commit: None
python: 2.7.7.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
pandas: 0.15.2
nose: 1.3.4
Cython: 0.20.1
numpy: 1.9.1
scipy: 0.15.1
statsmodels: 0.7.0.dev-161a0f8
IPython: 2.3.0
sphinx: 1.2.2
patsy: 0.3.0
dateutil: 1.5
pytz: 2014.9
bottleneck: 0.8.0
tables: 3.1.1
numexpr: 2.3.1
matplotlib: 1.4.2
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: 0.7.5
xlsxwriter: 0.5.5
lxml: 3.3.5
bs4: 4.3.1
html5lib: None
httplib2: None
apiclient: None
rpy2: None
sqlalchemy: 0.9.4
pymysql: None
psycopg2: None
The text was updated successfully, but these errors were encountered: