-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Multi-level join on multi-indexes #16162
Conversation
pandas/tests/reshape/test_merge.py
Outdated
@@ -1136,14 +1136,14 @@ def test_join_multi_levels(self): | |||
|
|||
def f(): | |||
household.join(portfolio, how='inner') | |||
pytest.raises(ValueError, f) | |||
self.assertRaises(ValueError, f) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not change this. We're using the pytest
framework.
pandas/tests/reshape/test_merge.py
Outdated
|
||
portfolio2 = portfolio.copy() | ||
portfolio2.index.set_names(['household_id', 'foo']) | ||
|
||
def f(): | ||
portfolio2.join(portfolio, how='inner') | ||
pytest.raises(ValueError, f) | ||
self.assertRaises(ValueError, f) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as here.
pandas/tests/reshape/test_merge.py
Outdated
|
||
def f(): | ||
matrix.join(distances2, how='left') | ||
self.assertRaises(TypeError, f) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as here.
pandas/tests/reshape/test_merge.py
Outdated
|
||
def f(): | ||
matrix.join(distances2, how='left') | ||
self.assertRaises(ValueError, f) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as here.
Codecov Report
@@ Coverage Diff @@
## master #16162 +/- ##
==========================================
- Coverage 90.87% 90.85% -0.02%
==========================================
Files 162 162
Lines 50816 50852 +36
==========================================
+ Hits 46178 46202 +24
- Misses 4638 4650 +12
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #16162 +/- ##
=========================================
Coverage ? 91.22%
=========================================
Files ? 163
Lines ? 49837
Branches ? 0
=========================================
Hits ? 45462
Misses ? 4375
Partials ? 0
Continue to review full report at Codecov.
|
Can anyone give some insight on the failure of Travis CI tests? |
Some linting (style) issues, starting at https://travis-ci.org/pandas-dev/pandas/jobs/232494137#L1403 You can |
a6ac2d3
to
4ed3070
Compare
pandas/core/indexes/base.py
Outdated
|
||
return join_index, lidx, ridx | ||
|
||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need the else here
pandas/tests/reshape/test_merge.py
Outdated
@@ -1195,7 +1191,7 @@ def f(): | |||
expected = ( | |||
DataFrame(dict( | |||
household_id=[1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4], | |||
asset_id=["nl0000301109", "nl0000289783", "gb00b03mlx29", | |||
asset_id=["nl0000301109", "nl0000301109", "gb00b03mlx29", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this changed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it was a mistake in the first instance. The household dataframe in this method suggests this correction
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does the existing code not fail then? before this change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part was raising a NotImplementedError so it was never actually evaluated
pandas/tests/reshape/test_merge.py
Outdated
|
||
assert_frame_equal(result, expected) | ||
|
||
def test_join_multi_levels3(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where are the error cases, e.g. mistmatch on multi-levels? misnamed levels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pandas/core/indexes/base.py
Outdated
other_tmp = other.droplevel(rdrop_levels) | ||
|
||
if not (other_tmp.is_unique and self_tmp.is_unique): | ||
raise TypeError(" The index resulting from the overlapping " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need testing for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pandas/core/indexes/base.py
Outdated
self_is_mi = isinstance(self, MultiIndex) | ||
other_is_mi = isinstance(other, MultiIndex) | ||
|
||
def _complete_join(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be a separate method, too confusing here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes implemented
pandas/core/indexes/base.py
Outdated
@@ -3028,27 +3028,61 @@ def join(self, other, how='left', level=None, return_indexers=False, | |||
|
|||
def _join_multi(self, other, how, return_indexers=True): | |||
from .multi import MultiIndex | |||
self_is_mi = isinstance(self, MultiIndex) | |||
other_is_mi = isinstance(other, MultiIndex) | |||
|
|||
# figure out join names | |||
self_names = [n for n in self.names if n is not None] | |||
other_names = [n for n in other.names if n is not None] | |||
overlap = list(set(self_names) & set(other_names)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should remain a set
pandas/core/indexes/base.py
Outdated
# make the indices into mi's that match | ||
if not (self_is_mi and other_is_mi): | ||
if not (self_tmp.is_unique and other_tmp.is_unique): | ||
raise TypeError(" The index resulting from the overlapping " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use the original error message and it should be a ValueError
pandas/core/indexes/base.py
Outdated
join_index = self | ||
elif how == 'right': | ||
join_index = other | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
else is unecessary
pandas/core/indexes/base.py
Outdated
|
||
return join_index, lidx, ridx | ||
else: | ||
jl = overlap[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't need the else here as you are returning
pandas/core/indexes/base.py
Outdated
ldrop_lvls = [l for l in self_names if l not in overlap] | ||
rdrop_lvls = [l for l in other_names if l not in overlap] | ||
|
||
self_is_mi = isinstance(self, MultiIndex) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of doing this inline it should be a separate method, its too long
pandas/core/indexes/base.py
Outdated
raise NotImplementedError("merging with both multi-indexes is not " | ||
"implemented") | ||
def _complete_multi_join(self, other, join_idx, lidx, ridx, dropped_lvls): | ||
new_lvls = join_idx.levels |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doc-string on what this is
pandas/core/indexes/base.py
Outdated
for n in dropped_lvls: | ||
if n in self.names: | ||
idx = lidx | ||
lvls = self.levels[self.names.index(n)].values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't abbreviate on the variable names, use new_levels and new_labels etc
pandas/tests/reshape/test_merge.py
Outdated
result = matrix.join(distances, how='outer') | ||
assert_frame_equal(result, expected) | ||
|
||
# Non-unique resulting index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
separate test for non-uniques
pandas/tests/reshape/test_merge.py
Outdated
Destination=[1, 2, 1, 3, 1], | ||
Period=['AM', 'PM', 'IP', 'AM', 'OP'], | ||
TripPurp=['hbw', 'nhb', 'hbo', 'nhb', 'hbw'], | ||
Trips=[1987, 3647, 2470, 4296, 4444], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in fact why don't you move all of these added tests to a new TestMultiMulti class for testing (and other relevant tests if any).
pls rebase and update |
Hello @harisbal! Thanks for updating the PR.
Comment last updated on March 11, 2018 at 13:40 Hours UTC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rebased
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @jreback
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a number of comments
pandas/core/indexes/base.py
Outdated
other_tmp = other.droplevel(rdrop_levels) | ||
|
||
if not (self_tmp.is_unique and other_tmp.is_unique): | ||
raise ValueError("Join on level between two MultiIndex objects" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test for this?
pandas/core/indexes/base.py
Outdated
elif how == 'right': | ||
join_index = other | ||
|
||
levels = join_index.levels |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move these 3 assignments after the .join, then you can make the 'how' an if/elif (for each of outer/left/right)
if isinstance(result, tuple): | ||
return result[0], result[2], result[1] | ||
return result | ||
flip_order = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
huh?
if you need to do something based on the how, then do it above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The part from jl = list(overlap)[0]
and onwards refers to cases where not both indexes are multiindexes.
This is legacy code that I didn't modify. Shall I try to rewrite?
pandas/core/indexes/base.py
Outdated
result = self._join_level(other, level, how=how, | ||
return_indexers=return_indexers) | ||
|
||
if flip_order: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is non-intuitive here, have no idea what is going one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above. Legacy code that I didn't modify
pandas/core/indexes/base.py
Outdated
return result[0], result[2], result[1] | ||
return result | ||
|
||
def _complete_outer_join(self, other, join_idx, lidx, ridx, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move to this pandas/core/reshape/merge.py (below _get_join_indexers) and just import where you need
pandas/tests/reshape/test_merge.py
Outdated
# Multi-index join tests | ||
# Self join | ||
matrix = ( | ||
pd.DataFrame( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls parametrize or use fixtures for this.
pandas/tests/reshape/test_merge.py
Outdated
|
||
assert_frame_equal(result, expected) | ||
|
||
def test_join_multi_levels3(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename to something more useful
also the docs in merging.rst need updating. pls add a small sub-section in 0.21.0 as well showcasing the new functionaility. |
Thank you @jreback, I'll implement the changes |
@harisbal great. sorry it took so long on this. |
It was actually my fault. Thanks again for all the comments :) |
pandas/core/reshape/merge.py
Outdated
@@ -1066,6 +1066,54 @@ def _get_join_indexers(left_keys, right_keys, sort=False, how='inner', | |||
return join_func(lkey, rkey, count, **kwargs) | |||
|
|||
|
|||
def _complete_outer_join(self, other, join_idx, lidx, ridx, dropped_levels): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
call these left/right (not self, other), add to doc-string
pandas/core/reshape/merge.py
Outdated
Complete the index in case of outer join | ||
|
||
Parameters | ||
---------- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
document like
join_index : Index
the index of the join between common leves
lidx : intp array
left indexer
....
etc
pandas/core/indexes/base.py
Outdated
dropped_lvls = ldrop_levels + rdrop_levels | ||
|
||
# tmp_index is equivalent of index when how='inner' | ||
tmp_index, lidx, ridx = self_tmp.join(other_tmp, how=how, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't use tmp_index
call this something else (and self_tmp)
pandas/core/indexes/base.py
Outdated
# Append to the returned Index the non-overlapping levels | ||
dropped_lvls = ldrop_levels + rdrop_levels | ||
|
||
# tmp_index is equivalent of index when how='inner' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
explicity enumerate all of the how cases
if how == 'innter':
pass
elif how == 'outer':
# outer_join
elif how == 'left'
.....
doc/source/merging.rst
Outdated
@@ -1120,6 +1120,13 @@ This is not Implemented via ``join`` at-the-moment, however it can be done using | |||
labels=['left', 'right'], vertical=False); | |||
plt.close('all'); | |||
|
|||
For previous versions can be done using the following. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rather I would say this is equivalent to
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add some tests versus empty frames (matching levels but empty). test all the hows's
doc/source/merging.rst
Outdated
@@ -1098,7 +1098,8 @@ This is equivalent but less verbose and more memory efficient / faster than this | |||
Joining with two multi-indexes | |||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |||
|
|||
This is not Implemented via ``join`` at-the-moment, however it can be done using the following. | |||
.. versionadded:: 0.21.0 | |||
As of version 0.21.0 joining on two multi-indexes is possible: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a similar example to a sub-section in whatsnew (and put a ref to here). you can use this example and show how we could do it previously (e.g. the below section), and how it will just work now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall I wait to include it in 0.21.1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
going to be for next version (so you can wait to move it is ok), after we tag can merge shortly after.
pandas/core/indexes/base.py
Outdated
verify_integrity=False) | ||
|
||
return multilevel_join_index, lidx, ridx | ||
|
||
if not (self_is_mi and other_is_mi): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this just becomes an else here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or you can just remove the if as well (IOW we have covered both cases).
pandas/core/reshape/merge.py
Outdated
right indexer | ||
dropped_levels : str array | ||
list of non-common levels | ||
Returns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blank line before Returns
pandas/core/reshape/merge.py
Outdated
""" | ||
|
||
if how == 'outer': | ||
levels = join_idx.levels |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe a few comments on what you are doing here (for the outer case), IOW why iterating over the dropped levels, etc.
pandas/tests/reshape/test_merge.py
Outdated
|
||
result = left.join(right, how=how) | ||
tm.assert_frame_equal(result, expected) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible to generically also tests the equivalency (this is just a copy-paste from above, obviously have to adapt here)
result = (merge(household.reset_index(), log_return.reset_index(),
on=['asset_id'], how='inner')
.set_index(['household_id', 'asset_id', 't']))
pandas/tests/reshape/test_merge.py
Outdated
|
||
|
||
class TestJoinMultiMulti(object): | ||
@pytest.fixture |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would simply put the fixture outside the class, right above TestJoinMultiMulti, then you can just pass them in as needed
pandas/tests/reshape/test_merge.py
Outdated
('right', expected_rightj), | ||
('inner', expected_innerj), | ||
('outer', expected_outerj)]) | ||
def test_join_multi_multi(self, how, expected): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e.g. you pass left, right in here
pandas/tests/reshape/test_merge.py
Outdated
.set_index(['Origin', 'Destination', 'Period', | ||
'TripPurp', 'LinkType'])) | ||
|
||
@pytest.mark.parametrize('how, expected', [('left', expected_leftj), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these you actually DO need to evaluate (e.g. expected_left()
) because these are simply calling a function and are not 'parameters' (if they are in the signature they are parameters).
pandas/tests/reshape/test_merge.py
Outdated
class TestMergeCategorical(object): | ||
|
||
def test_identical(self, left): | ||
@pytest.fixture |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would leave these where they were
pandas/tests/reshape/test_merge.py
Outdated
@@ -1439,9 +1597,11 @@ def test_identical(self, left): | |||
index=['X', 'Y_x', 'Y_y']) | |||
assert_series_equal(result, expected) | |||
|
|||
def test_basic(self, left, right): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so just leave this, it is very typical, writing left = self.left()
is not.
@jreback I have implemented the changes but one test is failing and I'm struggling to understand why. Shall I upload so you can give it a look? |
@harisbal what test is failing? l |
can you rebase |
… Series construction. (pandas-dev#19762)
* Added seek to buffer to fix xlwt asv failure * Added conditional to check for seek on xlrd object
f5d9c20
to
2ccbe5b
Compare
# Conflicts: # doc/source/merging.rst # doc/source/whatsnew/v0.23.0.txt # pandas/core/frame.py # pandas/core/generic.py # pandas/core/indexes/base.py # pandas/core/ops.py # pandas/core/reshape/merge.py # pandas/plotting/_misc.py # pandas/tests/reshape/merge/test_merge.py
c562204
to
5da2e44
Compare
This pull request is dirty beyond repair. I'll squash and start a new one |
closes #6360
Allow for merging on multiple levels of multi-indexes