-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: groupby missing data in index #28097
Conversation
@jreback could you merge this pr? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@proost this needs review and will file comments at some point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't deeply reviewed change but I see a comment a few lines up that says "Handle NA" - is that comment or block of code not applicable? Would like to refactor instead of adding logic if possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls merge master and address comments
85b9eef
to
6576242
Compare
4a7e25e
to
ab639ca
Compare
dd5ea8b
to
892126f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking pretty good
5b89726
to
52cf65a
Compare
def test_groupby_level_index_value_all_na(self): | ||
# issue 20519 | ||
df = pd.DataFrame( | ||
[["x", np.nan, 10], [None, np.nan, 20]], columns=["A", "B", "C"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I cast df["B"] = df["B"].astype("datetime64")
before doing the set_index
, I get a different error on the groupby call (in master). Should that be fixed by this PR? If so, please test.
side-note, I'd find this easier to follow in smaller steps:
df = pd.DataFrame(...)
df = df.set_index(["A", "B"])
gb = df.groupby(level=["A", "B"])
result = gb.sum()
Especially relevant as it is the gb = ...
line that raises in master
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbrockmendel
I'm bit confused. you df["B"] = df["B"].astype("datetime64")
means df["B"] = df["B"].astype("datetime64[ns]")
write?
in master,
df = pd.DataFrame(...)
df["B"] = df["B"].astype("datetime64[ns]")
df = df.set_index(["A", "B"])
gb = df.groupby(level=["A", "B"])
and
df = pd.DataFrame(...)
df = df.set_index(["A", "B"])
gb = df.groupby(level=["A", "B"])
raise same IndexError :cannot do a non-empty take from an empty axes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm bit confused. you df["B"] = df["B"].astype("datetime64") means df["B"] = df["B"].astype("datetime64[ns]") write?
Yes, I meant to write "datetime64[ns]"
and not just "datetime64"
.
raise same IndexError
Huh, not sure how I got to a different error. My bad.
@proost can you fix up merge conflict? |
8d26f5a
to
85e066b
Compare
…exError (pandas-dev#20519) * if all the values in a level of a MultiIndex were missing, fill with numpy nan
6b9aef6
to
6a01c73
Compare
if len(level_index): | ||
grouper = level_index.take(codes) | ||
else: | ||
grouper = level_index.take(codes, fill_value=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, can you eliminate the branch, and just always pass fill_value=True
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If remove the branch, for example,
df = DataFrame([["x", 1, 10], ["y", 2, 20]], columns=["A", "B", "C"]).set_index(["A", "B"])
result = df.groupby(level=["A", "B"]).sum()
raise exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry misread the approval above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@proost do you mind simplifying this in a follow up PR?
Thanks @proost |
@jreback @WillAyd @jbrockmendel @TomAugspurger |
…ndexing-1row-df * upstream/master: (194 commits) DOC Remove Python 2 specific comments from documentation (pandas-dev#31198) Follow up PR: pandas-dev#28097 Simplify branch statement (pandas-dev#29243) BUG: DatetimeIndex.snap incorrectly setting freq (pandas-dev#31188) Move DataFrame.info() to live with similar functions (pandas-dev#31317) ENH: accept a dictionary in plot colors (pandas-dev#31071) PERF: add shortcut to Timestamp constructor (pandas-dev#30676) CLN/MAINT: Clean and annotate stata reader and writers (pandas-dev#31072) REF: define _get_slice_axis in correct classes (pandas-dev#31304) BUG: DataFrame.floordiv(ser, axis=0) not matching column-wise bheavior (pandas-dev#31271) PERF: optimize is_scalar, is_iterator (pandas-dev#31294) BUG: Series rolling count ignores min_periods (pandas-dev#30923) xfail sparse warning; closes pandas-dev#31310 (pandas-dev#31311) REF: DatetimeIndex.get_value wrap DTI.get_loc (pandas-dev#31314) CLN: internals.managers (pandas-dev#31316) PERF: avoid copies if possible in fill_binop (pandas-dev#31300) Add test for multiindex json (pandas-dev#31307) BUG: passing TDA and wrong freq to TimedeltaIndex (pandas-dev#31268) BUG: inconsistency between PeriodIndex.get_value vs get_loc (pandas-dev#31172) CLN: remove _set_subtyp (pandas-dev#31301) CI: Updated version of macos image (pandas-dev#31292) ...
… branch statement
…31689) Co-authored-by: proost <[email protected]>
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
.groupby
intialize "grouper" by _get_grouper_for_level. Base on "code", "level_index" fills with non-NA value if there is NA value. Problem is if there are only NA values, "level_index" can't fill with values.So, check "level_index" whether all "level_index" values is NA. then "level_index" are all NA values, fill with NaN