-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Allow for join between two multi-index dataframe instances #20356
Conversation
fa4dbac
to
d694f25
Compare
'LinkType', 'Distance']) | ||
.set_index(['Origin', 'Destination', 'Period', 'LinkType'])) | ||
|
||
def f(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are there other error conditions to test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test fails and I'm not sure how we should handle join on empty levels
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you show a mini-example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback Here is an example of a join on 2 multilevel indexed (same levels) dfs using two different methods
- pd.merge(df1.reset_index(), df2.reset_index(),...)
- df1.join(df2)
The results differ. Do you think that's an issue? I'm facing a similar issue when I try a multi-level join.
import numpy as np
import pandas as pd
join_type='left'
left_multi=(
pd.DataFrame(
dict(Origin=['A', 'A', 'B', 'B', 'C'],
Destination=[np.nan] * 5,
Trips=[1987, 3647, 2470, 4296, 4444]),
columns=['Origin', 'Destination', 'Trips'])
.set_index(['Origin', 'Destination']))
right_multi=(
pd.DataFrame(
dict(Origin=['A', 'A', 'B', 'B', 'C', 'C', 'E'],
Destination=[np.nan] * 7,
Distance=[100, 80, 90, 80, 75, 35, 55]),
columns=['Origin', 'Destination', 'Distance'])
.set_index(['Origin', 'Destination']))
on_cols = ['Origin', 'Destination']
idx_cols = ['Origin', 'Destination']
expected = (pd.merge(left_multi.reset_index(),
right_multi.reset_index(),
how=join_type, on=on_cols).set_index(idx_cols)
.sort_index())
result = left_multi.join(right_multi, how=join_type).sort_index()
print(expected)
print(result)
pandas/core/reshape/merge.py
Outdated
|
||
# Inject -1 in the labels list where a join was not possible | ||
# IOW indexer[i]=-1 | ||
labels = [restore_labels[i] if i != -1 else -1 for i in indexer] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be a set operation on the arrays i think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry @jreback but I'm not sure what you mean
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this was addressed here (Thanks to @TomAugspurger):
Codecov Report
@@ Coverage Diff @@
## master #20356 +/- ##
==========================================
+ Coverage 92.24% 92.25% +<.01%
==========================================
Files 161 161
Lines 51339 51376 +37
==========================================
+ Hits 47360 47397 +37
Misses 3979 3979
Continue to review full report at Codecov.
|
310bf7a
to
a6c9733
Compare
Hello @harisbal! Thanks for updating the PR.
Comment last updated on November 11, 2018 at 04:31 Hours UTC |
f668710
to
8e5fcf1
Compare
Any progress on this? |
sorry me take a look. i know this has been outstanding for quite some time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you rebase and let's get this in
I'll take a look asap. Cheers |
de6c469
to
50c90cc
Compare
6bd10f4
to
5689f0a
Compare
5689f0a
to
2d61a12
Compare
Any idea why pandas-dev.pandas failed? |
db133f0
to
01ae19e
Compare
01ae19e
to
4d4acc5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good, @jorisvandenbossche @TomAugspurger if you'd have a look
47bb4fe
to
8b44f42
Compare
6d6678b
to
0e9c060
Compare
0e9c060
to
f54c151
Compare
…ex-join # Conflicts: # doc/source/whatsnew/v0.24.0.txt # pandas/core/reshape/merge.py # pandas/tests/reshape/merge/test_multi.py
be4aec7
to
ecaf515
Compare
How's this looking? I haven't checked on the changes in a while, but CI is passing. |
@TomAugspurger I had some more comments. let me have a look again. |
@harisbal can you merge master @TomAugspurger this lgtm. let's merge and can followup on any small issues. |
Merged master. Ping on green. |
Shall I try to merge again? |
I restarted that crashed worker. I haven't seen that failure before. |
All green. Merging. Thanks! |
@jreback @TomAugspurger @WillAyd Thank you so much for everything!! |
* upstream/master: BUG: to_html misses truncation indicators (...) when index=False (pandas-dev#22786) API/DEPR: replace "raise_conflict" with "errors" for df.update (pandas-dev#23657) BUG: Append DataFrame to Series with dateutil timezone (pandas-dev#23685) CLN/CI: Catch that stderr-warning! (pandas-dev#23706) ENH: Allow for join between two multi-index dataframe instances (pandas-dev#20356) Ensure Index._data is an ndarray (pandas-dev#23628) DOC: flake8-per-pr for windows users (pandas-dev#23707) DOC: Handle exceptions when computing contributors. (pandas-dev#23714) DOC: Validate space before colon docstring parameters pandas-dev#23483 (pandas-dev#23506) BUG-22984 Fix truncation of DataFrame representations (pandas-dev#22987)
closes #16162
closes #6360
Allow to join on multiple levels for multi-indexed dataframe instances