Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Multi-level join on multi-indexes #16162

Closed
wants to merge 221 commits into from

Conversation

harisbal
Copy link
Contributor

@harisbal harisbal commented Apr 27, 2017

closes #6360

Allow for merging on multiple levels of multi-indexes

@harisbal harisbal changed the title ENH: Multi-level merge on multi-indexes ENH: Multi-level join on multi-indexes Apr 27, 2017
@jreback jreback added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Apr 27, 2017
@@ -1136,14 +1136,14 @@ def test_join_multi_levels(self):

def f():
household.join(portfolio, how='inner')
pytest.raises(ValueError, f)
self.assertRaises(ValueError, f)

Copy link
Member

@gfyoung gfyoung Apr 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not change this. We're using the pytest framework.


portfolio2 = portfolio.copy()
portfolio2.index.set_names(['household_id', 'foo'])

def f():
portfolio2.join(portfolio, how='inner')
pytest.raises(ValueError, f)
self.assertRaises(ValueError, f)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as here.


def f():
matrix.join(distances2, how='left')
self.assertRaises(TypeError, f)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as here.


def f():
matrix.join(distances2, how='left')
self.assertRaises(ValueError, f)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as here.

@codecov
Copy link

codecov bot commented May 15, 2017

Codecov Report

Merging #16162 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #16162      +/-   ##
==========================================
- Coverage   90.87%   90.85%   -0.02%     
==========================================
  Files         162      162              
  Lines       50816    50852      +36     
==========================================
+ Hits        46178    46202      +24     
- Misses       4638     4650      +12
Flag Coverage Δ
#multiple 88.63% <100%> (-0.02%) ⬇️
#single 40.3% <0%> (-0.03%) ⬇️
Impacted Files Coverage Δ
pandas/core/indexes/base.py 96.27% <100%> (+0.07%) ⬆️
pandas/plotting/_converter.py 63.54% <0%> (-1.82%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 075eca1...75dea48. Read the comment docs.

@codecov
Copy link

codecov bot commented May 15, 2017

Codecov Report

❗ No coverage uploaded for pull request base (master@0d86742). Click here to learn what that means.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master   #16162   +/-   ##
=========================================
  Coverage          ?   91.22%           
=========================================
  Files             ?      163           
  Lines             ?    49837           
  Branches          ?        0           
=========================================
  Hits              ?    45462           
  Misses            ?     4375           
  Partials          ?        0
Flag Coverage Δ
#multiple 89.02% <100%> (?)
#single 40.3% <3.92%> (?)
Impacted Files Coverage Δ
pandas/core/indexes/base.py 96.37% <100%> (ø)
pandas/core/reshape/merge.py 94.52% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0d86742...5da2e44. Read the comment docs.

@harisbal
Copy link
Contributor Author

Can anyone give some insight on the failure of Travis CI tests?

@TomAugspurger
Copy link
Contributor

Some linting (style) issues, starting at https://travis-ci.org/pandas-dev/pandas/jobs/232494137#L1403

You can pip install flake8 and run it on the files your changing.

@harisbal harisbal force-pushed the multi-index-merge branch 2 times, most recently from a6ac2d3 to 4ed3070 Compare May 31, 2017 22:16

return join_index, lidx, ridx

else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need the else here

@@ -1195,7 +1191,7 @@ def f():
expected = (
DataFrame(dict(
household_id=[1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4],
asset_id=["nl0000301109", "nl0000289783", "gb00b03mlx29",
asset_id=["nl0000301109", "nl0000301109", "gb00b03mlx29",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this changed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it was a mistake in the first instance. The household dataframe in this method suggests this correction

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does the existing code not fail then? before this change

Copy link
Contributor Author

@harisbal harisbal Jun 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part was raising a NotImplementedError so it was never actually evaluated


assert_frame_equal(result, expected)

def test_join_multi_levels3(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where are the error cases, e.g. mistmatch on multi-levels? misnamed levels.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

other_tmp = other.droplevel(rdrop_levels)

if not (other_tmp.is_unique and self_tmp.is_unique):
raise TypeError(" The index resulting from the overlapping "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need testing for this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self_is_mi = isinstance(self, MultiIndex)
other_is_mi = isinstance(other, MultiIndex)

def _complete_join():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be a separate method, too confusing here

Copy link
Contributor Author

@harisbal harisbal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes implemented

@@ -3028,27 +3028,61 @@ def join(self, other, how='left', level=None, return_indexers=False,

def _join_multi(self, other, how, return_indexers=True):
from .multi import MultiIndex
self_is_mi = isinstance(self, MultiIndex)
other_is_mi = isinstance(other, MultiIndex)

# figure out join names
self_names = [n for n in self.names if n is not None]
other_names = [n for n in other.names if n is not None]
overlap = list(set(self_names) & set(other_names))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should remain a set

# make the indices into mi's that match
if not (self_is_mi and other_is_mi):
if not (self_tmp.is_unique and other_tmp.is_unique):
raise TypeError(" The index resulting from the overlapping "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use the original error message and it should be a ValueError

join_index = self
elif how == 'right':
join_index = other
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

else is unecessary


return join_index, lidx, ridx
else:
jl = overlap[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't need the else here as you are returning

ldrop_lvls = [l for l in self_names if l not in overlap]
rdrop_lvls = [l for l in other_names if l not in overlap]

self_is_mi = isinstance(self, MultiIndex)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of doing this inline it should be a separate method, its too long

raise NotImplementedError("merging with both multi-indexes is not "
"implemented")
def _complete_multi_join(self, other, join_idx, lidx, ridx, dropped_lvls):
new_lvls = join_idx.levels
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc-string on what this is

for n in dropped_lvls:
if n in self.names:
idx = lidx
lvls = self.levels[self.names.index(n)].values
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't abbreviate on the variable names, use new_levels and new_labels etc

result = matrix.join(distances, how='outer')
assert_frame_equal(result, expected)

# Non-unique resulting index
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

separate test for non-uniques

Destination=[1, 2, 1, 3, 1],
Period=['AM', 'PM', 'IP', 'AM', 'OP'],
TripPurp=['hbw', 'nhb', 'hbo', 'nhb', 'hbw'],
Trips=[1987, 3647, 2470, 4296, 4444],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in fact why don't you move all of these added tests to a new TestMultiMulti class for testing (and other relevant tests if any).

@jreback
Copy link
Contributor

jreback commented Jul 27, 2017

pls rebase and update

@pep8speaks
Copy link

pep8speaks commented Jul 27, 2017

Hello @harisbal! Thanks for updating the PR.

Line 1433:5: E303 too many blank lines (2)

Comment last updated on March 11, 2018 at 13:40 Hours UTC

Copy link
Contributor Author

@harisbal harisbal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rebased

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a number of comments

other_tmp = other.droplevel(rdrop_levels)

if not (self_tmp.is_unique and other_tmp.is_unique):
raise ValueError("Join on level between two MultiIndex objects"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test for this?

elif how == 'right':
join_index = other

levels = join_index.levels
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move these 3 assignments after the .join, then you can make the 'how' an if/elif (for each of outer/left/right)

if isinstance(result, tuple):
return result[0], result[2], result[1]
return result
flip_order = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh?

if you need to do something based on the how, then do it above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The part from jl = list(overlap)[0] and onwards refers to cases where not both indexes are multiindexes.
This is legacy code that I didn't modify. Shall I try to rewrite?

result = self._join_level(other, level, how=how,
return_indexers=return_indexers)

if flip_order:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is non-intuitive here, have no idea what is going one

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above. Legacy code that I didn't modify

return result[0], result[2], result[1]
return result

def _complete_outer_join(self, other, join_idx, lidx, ridx,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to this pandas/core/reshape/merge.py (below _get_join_indexers) and just import where you need

# Multi-index join tests
# Self join
matrix = (
pd.DataFrame(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls parametrize or use fixtures for this.


assert_frame_equal(result, expected)

def test_join_multi_levels3(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to something more useful

@jreback
Copy link
Contributor

jreback commented Sep 26, 2017

also the docs in merging.rst need updating. pls add a small sub-section in 0.21.0 as well showcasing the new functionaility.

@harisbal
Copy link
Contributor Author

Thank you @jreback, I'll implement the changes

@jreback
Copy link
Contributor

jreback commented Sep 26, 2017

@harisbal great. sorry it took so long on this.

@harisbal
Copy link
Contributor Author

It was actually my fault. Thanks again for all the comments :)

@@ -1066,6 +1066,54 @@ def _get_join_indexers(left_keys, right_keys, sort=False, how='inner',
return join_func(lkey, rkey, count, **kwargs)


def _complete_outer_join(self, other, join_idx, lidx, ridx, dropped_levels):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call these left/right (not self, other), add to doc-string

Complete the index in case of outer join

Parameters
----------
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

document like

join_index : Index
    the index of the join between common leves
lidx : intp array
    left indexer
....

etc

dropped_lvls = ldrop_levels + rdrop_levels

# tmp_index is equivalent of index when how='inner'
tmp_index, lidx, ridx = self_tmp.join(other_tmp, how=how,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't use tmp_index call this something else (and self_tmp)

# Append to the returned Index the non-overlapping levels
dropped_lvls = ldrop_levels + rdrop_levels

# tmp_index is equivalent of index when how='inner'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explicity enumerate all of the how cases

if how == 'innter':
     pass
elif how == 'outer':
    # outer_join
elif how == 'left'
.....

@@ -1120,6 +1120,13 @@ This is not Implemented via ``join`` at-the-moment, however it can be done using
labels=['left', 'right'], vertical=False);
plt.close('all');

For previous versions can be done using the following.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather I would say this is equivalent to

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add some tests versus empty frames (matching levels but empty). test all the hows's

@@ -1098,7 +1098,8 @@ This is equivalent but less verbose and more memory efficient / faster than this
Joining with two multi-indexes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This is not Implemented via ``join`` at-the-moment, however it can be done using the following.
.. versionadded:: 0.21.0
As of version 0.21.0 joining on two multi-indexes is possible:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a similar example to a sub-section in whatsnew (and put a ref to here). you can use this example and show how we could do it previously (e.g. the below section), and how it will just work now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall I wait to include it in 0.21.1?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

going to be for next version (so you can wait to move it is ok), after we tag can merge shortly after.

verify_integrity=False)

return multilevel_join_index, lidx, ridx

if not (self_is_mi and other_is_mi):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this just becomes an else here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or you can just remove the if as well (IOW we have covered both cases).

right indexer
dropped_levels : str array
list of non-common levels
Returns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blank line before Returns

"""

if how == 'outer':
levels = join_idx.levels
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe a few comments on what you are doing here (for the outer case), IOW why iterating over the dropped levels, etc.


result = left.join(right, how=how)
tm.assert_frame_equal(result, expected)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to generically also tests the equivalency (this is just a copy-paste from above, obviously have to adapt here)

result = (merge(household.reset_index(), log_return.reset_index(),
                          on=['asset_id'], how='inner')
                    .set_index(['household_id', 'asset_id', 't']))



class TestJoinMultiMulti(object):
@pytest.fixture
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would simply put the fixture outside the class, right above TestJoinMultiMulti, then you can just pass them in as needed

('right', expected_rightj),
('inner', expected_innerj),
('outer', expected_outerj)])
def test_join_multi_multi(self, how, expected):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g. you pass left, right in here

.set_index(['Origin', 'Destination', 'Period',
'TripPurp', 'LinkType']))

@pytest.mark.parametrize('how, expected', [('left', expected_leftj),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these you actually DO need to evaluate (e.g. expected_left()) because these are simply calling a function and are not 'parameters' (if they are in the signature they are parameters).

class TestMergeCategorical(object):

def test_identical(self, left):
@pytest.fixture
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would leave these where they were

@@ -1439,9 +1597,11 @@ def test_identical(self, left):
index=['X', 'Y_x', 'Y_y'])
assert_series_equal(result, expected)

def test_basic(self, left, right):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so just leave this, it is very typical, writing left = self.left() is not.

@harisbal
Copy link
Contributor Author

@jreback I have implemented the changes but one test is failing and I'm struggling to understand why. Shall I upload so you can give it a look?

@jreback
Copy link
Contributor

jreback commented Oct 19, 2017

@harisbal what test is failing? l

@jreback
Copy link
Contributor

jreback commented Nov 10, 2017

can you rebase

WillAyd and others added 21 commits February 28, 2018 00:04
* Added seek to buffer to fix xlwt asv failure

* Added conditional to check for seek on xlrd object
Rebase
@harisbal harisbal force-pushed the multi-index-merge branch from f5d9c20 to 2ccbe5b Compare March 11, 2018 13:34
harisbal and others added 3 commits March 11, 2018 19:35
# Conflicts:
#	doc/source/merging.rst
#	doc/source/whatsnew/v0.23.0.txt
#	pandas/core/frame.py
#	pandas/core/generic.py
#	pandas/core/indexes/base.py
#	pandas/core/ops.py
#	pandas/core/reshape/merge.py
#	pandas/plotting/_misc.py
#	pandas/tests/reshape/merge/test_merge.py
@harisbal harisbal force-pushed the multi-index-merge branch from c562204 to 5da2e44 Compare March 11, 2018 20:20
@harisbal harisbal closed this Mar 11, 2018
@harisbal
Copy link
Contributor Author

This pull request is dirty beyond repair. I'll squash and start a new one

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: merge multi-index with a multi-index