ENH: merge multi-index with a multi-index #6360

jreback · 2014-02-15T18:45:21Z

#3662 is about merging a single-level index with a mi

This is is about a multi-multi merge

# example described there:
In [11]: df = pd.DataFrame(np.random.randn(10,5))
    ...: df1 = df[range(0, 3)].set_index([0, 2])
    ...: df2 = df[range(2, 5)].set_index([2, 4])
    ...: res = df1.join(df2, how='inner')  # empty. Do I need to specify something here?
    ...: exp = pd.DataFrame({1: df[1], 3: df[3]})
    ...: exp.index = df[2]

The text was updated successfully, but these errors were encountered:

jreback · 2014-02-15T22:15:58Z

Here's a partial implementation: jreback@0c38215

PKEuS · 2014-03-02T21:39:48Z

I tried to get your partial implementation working. As-is, it complained about non-fitting shape. I enabled the commented code then, which I rewrote to this one:
PKEuS@90c307c

The commented code gave these tuples:

[('2', 'gb00b03mlx29', '233'), ('3', 'gb00b03mlx29', '234'), ('3', 'lu0197800237
', '235'), (4.0, nan, '180')]

It now creates this tuples

[('2', 'asset_id', '233'), ('2', 'asset_id', '234'), ('2', 'asset_id', '235'), (
'3', 'asset_id', '233'), ('3', 'asset_id', '234'), ('3', 'asset_id', '235'), ('3
', 'asset_id', '180'), ('3', 'asset_id', '181')]
['household_id', 'asset_id', 't']

Compared to the tuples created by your commented code, it looks correct to me. (You can ignore the line with asset_id=nan for now)

However, it crashes after returning the MultiIndex somewhere in the merging code called after merging the index. This crashs the python interpreter, which makes nosetests abort without printing any results.

Any ideas?

jreback · 2014-03-02T22:06:25Z

the tuples here are not that complicated
it's creating the correct indexers that is tricky
these are the indexers that allow a take_nd from the original to the new ones
essentially it is a mapping that is of the new shape with values that correspond to the original shape
that are possibly duplicated

PKEuS · 2014-03-04T20:43:51Z

Indeed. The reason for the crash were non-fitting sizes of indexers and indexes. Fixed that. Now it creates two objects which are imho identical, but assert_frame_equal returns false. And surprisingly, it creates the same output for "inner" and "outer" merge. However, I would like to have one of these working first.

If I print the two dataframes (result and expected), they look equal:

                               share  log_return
household_id asset_id     t
2            gb00b03mlx29 233   0.60    0.096050
                          234   0.60   -0.065241
                          235   0.60    0.035324
3            gb00b03mlx29 233   0.15    0.096050
                          234   0.15   -0.065241
                          235   0.15    0.035324
             lu0197800237 180   0.60    0.030254
                          181   0.60    0.036997

[8 rows x 2 columns]
                               share  log_return
household_id asset_id     t
2            gb00b03mlx29 233   0.60    0.096050
                          234   0.60   -0.065241
                          235   0.60    0.035324
3            gb00b03mlx29 233   0.15    0.096050
                          234   0.15   -0.065241
                          235   0.15    0.035324
             lu0197800237 180   0.60    0.030254
                          181   0.60    0.036997

[8 rows x 2 columns]

However, assert_frame_equals prints:

======================================================================
FAIL: test_join_multi_levels2 (pandas.tools.tests.test_merge.TestMergeMulti)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Python\pandas\pandas\tools\tests\test_merge.py", line 1121, in test_j
oin_multi_levels2
    assert_frame_equal(result,expected)
  File "C:\Python\pandas\pandas\util\testing.py", line 523, in assert_frame_equa
l
    assert_index_equal(left.index, right.index)
  File "C:\Python\pandas\pandas\util\testing.py", line 467, in assert_index_equa
l
    right.dtype))
nose.proxy.AssertionError: AssertionError: [index] left [object household_id  as
set_id      t
2             gb00b03mlx29  233
                            234
                            235
3             gb00b03mlx29  233
                            234
                            235
              lu0197800237  180
                            181], right [household_id  asset_id      t
2             gb00b03mlx29  233
                            234
                            235
3             gb00b03mlx29  233
                            234
                            235
              lu0197800237  180
                            181 object]

jreback · 2014-03-04T20:50:49Z

hmm...grab the 2 indexes before the assertion and look at them, they should both be multi-indexes with the same names / levels and such, try left.equals(right). If that works, then its the test that is not comparing these correctly; if it fails then its possible your created index is someone not being created quite equal

PKEuS · 2014-03-09T20:41:40Z

The indexes look equal, however, both

print(result.index.equals(expected.index))

and

print(result.equals(expected))

print "False".

Besides from that, I feel I reinvented the wheel somehow, since my code hardcodes an "inner merge". Since achieving an "outer merge" or left/right is harder that way, I guess I should use some already existing method. Those seem to not work with multiindex objects, even if join levels are given. Has anybody an idea about the best way to use the existing mehtods on the multiindex?

PKEuS · 2014-04-05T20:05:30Z

At the moment I have a problem (which is the reason for the test failures) with this line:

t.extend(tuple(np.array(left).take(left_indexer)))

It creates an array from left (which is needed to use the take method), but all non-string indices become strings due to that conversion. Do you have any suggestions on how to solve that?

jreback · 2014-04-05T23:42:00Z

maybe you could put up an example?

PKEuS · 2014-04-10T14:36:45Z

The context of that line: https://github.com/PKEuS/pandas/blob/master/pandas/core/index.py#L1528

The point is: We have a row (type FrozenList, for example: ["foo", 1, "bar"]). We need a tuple out of it, consisting only of those elements indexed by left_indexer. That is, what take() does. This function requires an array, so the FrozenList ["foo", 1, "bar"] is converted to np.array: ["foo", "1", "bar"] is the result (integer 1 becomes string "1").

jreback · 2014-04-10T14:40:14Z

do this

In [1]: l = ['foo',1,'bar']

In [2]: np.array(l)
Out[2]: 
array(['foo', '1', 'bar'], 
      dtype='|S3')

In [3]: np.array(l,dtype='object')
Out[3]: array(['foo', 1, 'bar'], dtype=object)

PKEuS · 2014-04-10T16:35:52Z

Thank you! That fixes this issue.

Now, it seems that only one problem is left: It behaves somehow strange on the missing value in the index (None/np.nan) in the unit tests. _join_level seems to replace missing value by a random value.
Is a missing value even allowed as an index? If not, the test case could just be modified.

jreback · 2014-04-10T16:55:44Z

hmm. well missing values are technically allowed in an Index (I don't find it useful but its allowed).

normally remove them from the indexer first them put them back

something like

indexer = np.array([1,2,5,-1,6,-1])
check = indexer != -1
result = values.take(indexer[check])

its a bit tricky to actually put the nan back in the right place,

have a look at this: https://github.com/pydata/pandas/blob/master/pandas/core/indexing.py#L904

it gets unbelievable complicated when you have duplicates

PKEuS · 2014-04-10T17:14:18Z

I guess, the issue is on the _join_level side? Is it actually worth to care about these cases (missing index, duplicate missing indices)?

jreback · 2014-04-10T17:21:02Z

can you put up the test case that is failing?

hmgaudecker · 2014-04-10T17:25:52Z

I would argue for "yes" in cases where just a subset of the levels of a MultiIndex is missing. The test case above is a real-world example: People answer they own some asset in a survey but I cannot match the name to any actual stock/fund, so the ISIN number is missing. I don't want to have those observations missing from the merged dataset, though.

But obviously you cannot match to anything if the missing value is in one of the levels to merge on. I would think of something as the following:

Check whether there are missings in the index levels that will be used for merging
If so, remove those observations
Merge
Put observations with missing index levels back in

PKEuS · 2014-05-15T18:54:17Z

I squshed my commits into one, rebased it and "fixed" travis failure: PKEuS@b9e81b8

I was unable to implement support for missing values in multiindices. I tried several ways (fixing the function that replaces missings by other values, taking out the NaN-values before merging), but I was not able to get one of these ways working. I think that ticket #5074 is related to this issue, since the problems are somewhat similar.

The unit test test_join_multi_levels2() can be switched to contain missing values by changing "None" to None in two places. It will fail then and could be used that way when fixing the missing value issue.

If you consider this to be ok for merging into master, I would open a pull-request.

jreback · 2014-05-15T18:57:00Z

needs a bit more tests. Also, lots of duplicated looking code (e.g. for the different types of joining). And this should be done in a vectorized way (not python looping).

PKEuS · 2014-05-19T22:05:30Z

Addressed some of these issues in PKEuS@9ab5c05

Regarding doing this in a vectorized way: I have no idea how to achieve that, sorry.

jreback · 2014-05-19T22:16:24Z

can u do a perf comparison of using this method (and the resetting method)

it's possible vectorization can be done later

use a reasonable sized example

PKEuS · 2014-05-21T11:35:14Z

can u do a perf comparison of using this method (and the resetting method)

Actually, i don't really know to what I have to compare this method. Could you explain, please?

jreback · 2014-05-21T13:08:00Z

the benchmarch is resetting,merging,setting (same as the test comparisons), see here: http://pandas-docs.github.io/pandas-docs-travis/merging.html#joining-with-two-multi-indexes

PKEuS · 2014-07-16T12:21:10Z

Benchmark results:

inner strategy:
join: 0.03857225257896634
merge: 0.01556221417412993

outer strategy:
join: 0.5167438638723917
merge: 0.016556284920703934

Setting: two Dataframes 1000x3 (two index columns, 1 data column)

PKEuS · 2014-08-04T16:53:00Z

@jreback: Any comments on these results?

A speedup by ~10% is possible by some micro-optimizations, however, the merge function is still faster.

jreback · 2014-08-04T17:01:44Z

I think their are some vbenchs already for these (or maybe need to add). Can you do that and post the vbench results?

gandhis1 · 2016-10-22T05:15:35Z

Question on this for anyone involved with this? As I understand it, as of 0.19.0, this still hasn't made it in, and one still has to do the manually steps of reset_index(), pd.merge on specific columns, and then set_index(). Is that correct?

jreback added this to the 0.14.0 milestone Feb 15, 2014

jreback added Enhancement labels Feb 15, 2014

jreback modified the milestones: 0.15.0, 0.14.0 Feb 15, 2014

jreback mentioned this issue Feb 15, 2014

Merge on single level of MultiIndex #3662

Closed

roblevy mentioned this issue Apr 21, 2014

Binary operations don't broadcast across multiindex #5645

Closed

thorwhalen mentioned this issue Nov 26, 2014

WIP: add multi-multi index support to join #8898

Closed

jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015

jorisvandenbossche mentioned this issue Jul 19, 2015

Very unintuitive to join two dataframes with indexes that overlap partially #10621

Closed

jreback mentioned this issue Jan 30, 2016

Reindexing doesn't work as expected for MultiIndexs #12181

Closed

jreback added Difficulty Advanced labels Jan 30, 2016

jreback mentioned this issue Jun 9, 2016

ENH: reindex and align with two multiindex dataframes #13409

Closed

jreback mentioned this issue Feb 15, 2017

Join on level between two UNIQUE MultiIndex objects with unique level broadcast #15417

Closed

This was referenced Apr 11, 2017

merge multi-index with a multi-index #15978

Closed

Multi index operations #15979

Closed

Merge multi-index with a multi-index #15980

Closed

harisbal mentioned this issue Apr 28, 2017

ENH: Multi-level join on multi-indexes #16162

Closed

jreback modified the milestones: Contributions Welcome, 0.24.0 Oct 7, 2018

harisbal mentioned this issue Oct 7, 2018

ENH: Allow for join between two multi-index dataframe instances #20356

Merged

TomAugspurger closed this as completed in #20356 Nov 15, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: merge multi-index with a multi-index #6360

ENH: merge multi-index with a multi-index #6360

jreback commented Feb 15, 2014

jreback commented Feb 15, 2014

PKEuS commented Mar 2, 2014

jreback commented Mar 2, 2014

PKEuS commented Mar 4, 2014

jreback commented Mar 4, 2014

PKEuS commented Mar 9, 2014

PKEuS commented Apr 5, 2014

jreback commented Apr 5, 2014

PKEuS commented Apr 10, 2014

jreback commented Apr 10, 2014

PKEuS commented Apr 10, 2014

jreback commented Apr 10, 2014

PKEuS commented Apr 10, 2014

jreback commented Apr 10, 2014

hmgaudecker commented Apr 10, 2014

PKEuS commented May 15, 2014

jreback commented May 15, 2014

PKEuS commented May 19, 2014

jreback commented May 19, 2014

PKEuS commented May 21, 2014

jreback commented May 21, 2014

PKEuS commented Jul 16, 2014

PKEuS commented Aug 4, 2014

jreback commented Aug 4, 2014

gandhis1 commented Oct 22, 2016

ENH: merge multi-index with a multi-index #6360

ENH: merge multi-index with a multi-index #6360

Comments

jreback commented Feb 15, 2014

jreback commented Feb 15, 2014

PKEuS commented Mar 2, 2014

jreback commented Mar 2, 2014

PKEuS commented Mar 4, 2014

jreback commented Mar 4, 2014

PKEuS commented Mar 9, 2014

PKEuS commented Apr 5, 2014

jreback commented Apr 5, 2014

PKEuS commented Apr 10, 2014

jreback commented Apr 10, 2014

PKEuS commented Apr 10, 2014

jreback commented Apr 10, 2014

PKEuS commented Apr 10, 2014

jreback commented Apr 10, 2014

hmgaudecker commented Apr 10, 2014

PKEuS commented May 15, 2014

jreback commented May 15, 2014

PKEuS commented May 19, 2014

jreback commented May 19, 2014

PKEuS commented May 21, 2014

jreback commented May 21, 2014

PKEuS commented Jul 16, 2014

PKEuS commented Aug 4, 2014

jreback commented Aug 4, 2014

gandhis1 commented Oct 22, 2016