Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: groupby missing data in index #28097

Merged
merged 3 commits into from
Oct 25, 2019
Merged

Conversation

proost
Copy link
Contributor

@proost proost commented Aug 22, 2019

.groupby intialize "grouper" by _get_grouper_for_level. Base on "code", "level_index" fills with non-NA value if there is NA value. Problem is if there are only NA values, "level_index" can't fill with values.
So, check "level_index" whether all "level_index" values is NA. then "level_index" are all NA values, fill with NaN

@proost
Copy link
Contributor Author

proost commented Aug 23, 2019

@jreback could you merge this pr?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@proost this needs review and will file comments at some point

@simonjayhawkins simonjayhawkins added Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate MultiIndex labels Aug 25, 2019
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't deeply reviewed change but I see a comment a few lines up that says "Handle NA" - is that comment or block of code not applicable? Would like to refactor instead of adding logic if possible

pandas/tests/indexes/test_base.py Outdated Show resolved Hide resolved
pandas/tests/indexes/test_base.py Outdated Show resolved Hide resolved
@WillAyd WillAyd added this to the 1.0 milestone Aug 26, 2019
doc/source/whatsnew/v0.25.1.rst Outdated Show resolved Hide resolved
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls merge master and address comments

pandas/core/indexes/multi.py Outdated Show resolved Hide resolved
pandas/tests/indexes/test_base.py Outdated Show resolved Hide resolved
pandas/tests/indexes/test_base.py Outdated Show resolved Hide resolved
@jreback jreback removed this from the 1.0 milestone Sep 8, 2019
@pep8speaks
Copy link

pep8speaks commented Sep 10, 2019

Hello @proost! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-10-25 04:59:47 UTC

@proost proost force-pushed the update-groupby branch 2 times, most recently from 4a7e25e to ab639ca Compare September 10, 2019 06:58
doc/source/whatsnew/v1.0.0.rst Outdated Show resolved Hide resolved
pandas/tests/groupby/test_grouping.py Outdated Show resolved Hide resolved
pandas/tests/groupby/test_grouping.py Outdated Show resolved Hide resolved
@proost proost force-pushed the update-groupby branch 8 times, most recently from dd5ea8b to 892126f Compare September 14, 2019 15:44
@proost proost requested review from jreback and WillAyd September 16, 2019 01:32
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking pretty good

doc/source/whatsnew/v1.0.0.rst Outdated Show resolved Hide resolved
pandas/tests/groupby/test_grouping.py Outdated Show resolved Hide resolved
@proost proost requested a review from WillAyd September 18, 2019 08:09
@WillAyd WillAyd added this to the 1.0 milestone Sep 18, 2019
def test_groupby_level_index_value_all_na(self):
# issue 20519
df = pd.DataFrame(
[["x", np.nan, 10], [None, np.nan, 20]], columns=["A", "B", "C"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I cast df["B"] = df["B"].astype("datetime64") before doing the set_index, I get a different error on the groupby call (in master). Should that be fixed by this PR? If so, please test.

side-note, I'd find this easier to follow in smaller steps:

df = pd.DataFrame(...)
df = df.set_index(["A", "B"])
gb = df.groupby(level=["A", "B"])
result = gb.sum()

Especially relevant as it is the gb = ... line that raises in master

Copy link
Contributor Author

@proost proost Oct 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel
I'm bit confused. you df["B"] = df["B"].astype("datetime64") means df["B"] = df["B"].astype("datetime64[ns]") write?
in master,
df = pd.DataFrame(...)
df["B"] = df["B"].astype("datetime64[ns]")
df = df.set_index(["A", "B"])
gb = df.groupby(level=["A", "B"])
and
df = pd.DataFrame(...)
df = df.set_index(["A", "B"])
gb = df.groupby(level=["A", "B"])
raise same IndexError :cannot do a non-empty take from an empty axes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm bit confused. you df["B"] = df["B"].astype("datetime64") means df["B"] = df["B"].astype("datetime64[ns]") write?

Yes, I meant to write "datetime64[ns]" and not just "datetime64".

raise same IndexError

Huh, not sure how I got to a different error. My bad.

@WillAyd
Copy link
Member

WillAyd commented Oct 22, 2019

@proost can you fix up merge conflict?

…exError (pandas-dev#20519)

* if all the values in a level of a MultiIndex were missing, fill with numpy nan
@proost proost force-pushed the update-groupby branch 2 times, most recently from 6b9aef6 to 6a01c73 Compare October 25, 2019 01:57
if len(level_index):
grouper = level_index.take(codes)
else:
grouper = level_index.take(codes, fill_value=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, can you eliminate the branch, and just always pass fill_value=True ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If remove the branch, for example,

df = DataFrame([["x", 1, 10], ["y", 2, 20]], columns=["A", "B", "C"]).set_index(["A", "B"])
result = df.groupby(level=["A", "B"]).sum() 

raise exception.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually i think this is incorrect

@WillAyd wasn’t done here

this branch can be simplified i think

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry misread the approval above

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@proost do you mind simplifying this in a follow up PR?

@WillAyd WillAyd merged commit f03ed62 into pandas-dev:master Oct 25, 2019
@WillAyd
Copy link
Member

WillAyd commented Oct 25, 2019

Thanks @proost

@proost proost deleted the update-groupby branch October 25, 2019 16:40
@proost
Copy link
Contributor Author

proost commented Oct 25, 2019

@jreback @WillAyd @jbrockmendel @TomAugspurger
Thank you for you review and feedback

proost added a commit to proost/pandas that referenced this pull request Oct 27, 2019
proost added a commit to proost/pandas that referenced this pull request Oct 28, 2019
proost added a commit to proost/pandas that referenced this pull request Oct 28, 2019
HawkinsBA pushed a commit to HawkinsBA/pandas that referenced this pull request Oct 29, 2019
Reksbril pushed a commit to Reksbril/pandas that referenced this pull request Nov 18, 2019
proost added a commit to proost/pandas that referenced this pull request Dec 19, 2019
proost added a commit to proost/pandas that referenced this pull request Dec 19, 2019
bongolegend pushed a commit to bongolegend/pandas that referenced this pull request Jan 1, 2020
proost added a commit to proost/pandas that referenced this pull request Jan 2, 2020
proost added a commit to proost/pandas that referenced this pull request Jan 15, 2020
keechongtan added a commit to keechongtan/pandas that referenced this pull request Jan 27, 2020
…ndexing-1row-df

* upstream/master: (194 commits)
  DOC Remove Python 2 specific comments from documentation (pandas-dev#31198)
  Follow up PR: pandas-dev#28097 Simplify branch statement (pandas-dev#29243)
  BUG: DatetimeIndex.snap incorrectly setting freq (pandas-dev#31188)
  Move DataFrame.info() to live with similar functions (pandas-dev#31317)
  ENH: accept a dictionary in plot colors (pandas-dev#31071)
  PERF: add shortcut to Timestamp constructor (pandas-dev#30676)
  CLN/MAINT: Clean and annotate stata reader and writers (pandas-dev#31072)
  REF: define _get_slice_axis in correct classes (pandas-dev#31304)
  BUG: DataFrame.floordiv(ser, axis=0) not matching column-wise bheavior (pandas-dev#31271)
  PERF: optimize is_scalar, is_iterator (pandas-dev#31294)
  BUG: Series rolling count ignores min_periods (pandas-dev#30923)
  xfail sparse warning; closes pandas-dev#31310 (pandas-dev#31311)
  REF: DatetimeIndex.get_value wrap DTI.get_loc (pandas-dev#31314)
  CLN: internals.managers (pandas-dev#31316)
  PERF: avoid copies if possible in fill_binop (pandas-dev#31300)
  Add test for multiindex json (pandas-dev#31307)
  BUG: passing TDA and wrong freq to TimedeltaIndex (pandas-dev#31268)
  BUG: inconsistency between PeriodIndex.get_value vs get_loc (pandas-dev#31172)
  CLN: remove _set_subtyp (pandas-dev#31301)
  CI: Updated version of macos image (pandas-dev#31292)
  ...
meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Feb 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate MultiIndex
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Groupby on multiindex with missing data in group keys raises IndexError
7 participants