Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: merge multi-index with a multi-index #6360

Closed
jreback opened this issue Feb 15, 2014 · 25 comments · Fixed by #20356
Closed

ENH: merge multi-index with a multi-index #6360

jreback opened this issue Feb 15, 2014 · 25 comments · Fixed by #20356
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Enhancement MultiIndex Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@jreback
Copy link
Contributor

jreback commented Feb 15, 2014

#3662 is about merging a single-level index with a mi

This is is about a multi-multi merge

# example described there:
In [11]: df = pd.DataFrame(np.random.randn(10,5))
    ...: df1 = df[range(0, 3)].set_index([0, 2])
    ...: df2 = df[range(2, 5)].set_index([2, 4])
    ...: res = df1.join(df2, how='inner')  # empty. Do I need to specify something here?
    ...: exp = pd.DataFrame({1: df[1], 3: df[3]})
    ...: exp.index = df[2]
@jreback jreback added this to the 0.14.0 milestone Feb 15, 2014
@jreback jreback modified the milestones: 0.15.0, 0.14.0 Feb 15, 2014
@jreback
Copy link
Contributor Author

jreback commented Feb 15, 2014

Here's a partial implementation: jreback@0c38215

@PKEuS
Copy link
Contributor

PKEuS commented Mar 2, 2014

I tried to get your partial implementation working. As-is, it complained about non-fitting shape. I enabled the commented code then, which I rewrote to this one:
PKEuS@90c307c

The commented code gave these tuples:

[('2', 'gb00b03mlx29', '233'), ('3', 'gb00b03mlx29', '234'), ('3', 'lu0197800237
', '235'), (4.0, nan, '180')]

It now creates this tuples

[('2', 'asset_id', '233'), ('2', 'asset_id', '234'), ('2', 'asset_id', '235'), (
'3', 'asset_id', '233'), ('3', 'asset_id', '234'), ('3', 'asset_id', '235'), ('3
', 'asset_id', '180'), ('3', 'asset_id', '181')]
['household_id', 'asset_id', 't']

Compared to the tuples created by your commented code, it looks correct to me. (You can ignore the line with asset_id=nan for now)

However, it crashes after returning the MultiIndex somewhere in the merging code called after merging the index. This crashs the python interpreter, which makes nosetests abort without printing any results.

Any ideas?

@jreback
Copy link
Contributor Author

jreback commented Mar 2, 2014

the tuples here are not that complicated
it's creating the correct indexers that is tricky
these are the indexers that allow a take_nd from the original to the new ones
essentially it is a mapping that is of the new shape with values that correspond to the original shape
that are possibly duplicated

@PKEuS
Copy link
Contributor

PKEuS commented Mar 4, 2014

Indeed. The reason for the crash were non-fitting sizes of indexers and indexes. Fixed that. Now it creates two objects which are imho identical, but assert_frame_equal returns false. And surprisingly, it creates the same output for "inner" and "outer" merge. However, I would like to have one of these working first.

If I print the two dataframes (result and expected), they look equal:

                               share  log_return
household_id asset_id     t
2            gb00b03mlx29 233   0.60    0.096050
                          234   0.60   -0.065241
                          235   0.60    0.035324
3            gb00b03mlx29 233   0.15    0.096050
                          234   0.15   -0.065241
                          235   0.15    0.035324
             lu0197800237 180   0.60    0.030254
                          181   0.60    0.036997

[8 rows x 2 columns]
                               share  log_return
household_id asset_id     t
2            gb00b03mlx29 233   0.60    0.096050
                          234   0.60   -0.065241
                          235   0.60    0.035324
3            gb00b03mlx29 233   0.15    0.096050
                          234   0.15   -0.065241
                          235   0.15    0.035324
             lu0197800237 180   0.60    0.030254
                          181   0.60    0.036997

[8 rows x 2 columns]

However, assert_frame_equals prints:

======================================================================
FAIL: test_join_multi_levels2 (pandas.tools.tests.test_merge.TestMergeMulti)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Python\pandas\pandas\tools\tests\test_merge.py", line 1121, in test_j
oin_multi_levels2
    assert_frame_equal(result,expected)
  File "C:\Python\pandas\pandas\util\testing.py", line 523, in assert_frame_equa
l
    assert_index_equal(left.index, right.index)
  File "C:\Python\pandas\pandas\util\testing.py", line 467, in assert_index_equa
l
    right.dtype))
nose.proxy.AssertionError: AssertionError: [index] left [object household_id  as
set_id      t
2             gb00b03mlx29  233
                            234
                            235
3             gb00b03mlx29  233
                            234
                            235
              lu0197800237  180
                            181], right [household_id  asset_id      t
2             gb00b03mlx29  233
                            234
                            235
3             gb00b03mlx29  233
                            234
                            235
              lu0197800237  180
                            181 object]

@jreback
Copy link
Contributor Author

jreback commented Mar 4, 2014

hmm...grab the 2 indexes before the assertion and look at them, they should both be multi-indexes with the same names / levels and such, try left.equals(right). If that works, then its the test that is not comparing these correctly; if it fails then its possible your created index is someone not being created quite equal

@PKEuS
Copy link
Contributor

PKEuS commented Mar 9, 2014

The indexes look equal, however, both

print(result.index.equals(expected.index))

and

print(result.equals(expected))

print "False".

Besides from that, I feel I reinvented the wheel somehow, since my code hardcodes an "inner merge". Since achieving an "outer merge" or left/right is harder that way, I guess I should use some already existing method. Those seem to not work with multiindex objects, even if join levels are given. Has anybody an idea about the best way to use the existing mehtods on the multiindex?

@PKEuS
Copy link
Contributor

PKEuS commented Apr 5, 2014

At the moment I have a problem (which is the reason for the test failures) with this line:

t.extend(tuple(np.array(left).take(left_indexer)))

It creates an array from left (which is needed to use the take method), but all non-string indices become strings due to that conversion. Do you have any suggestions on how to solve that?

@jreback
Copy link
Contributor Author

jreback commented Apr 5, 2014

maybe you could put up an example?

@PKEuS
Copy link
Contributor

PKEuS commented Apr 10, 2014

The context of that line: https://github.com/PKEuS/pandas/blob/master/pandas/core/index.py#L1528

The point is: We have a row (type FrozenList, for example: ["foo", 1, "bar"]). We need a tuple out of it, consisting only of those elements indexed by left_indexer. That is, what take() does. This function requires an array, so the FrozenList ["foo", 1, "bar"] is converted to np.array: ["foo", "1", "bar"] is the result (integer 1 becomes string "1").

@jreback
Copy link
Contributor Author

jreback commented Apr 10, 2014

do this

In [1]: l = ['foo',1,'bar']

In [2]: np.array(l)
Out[2]: 
array(['foo', '1', 'bar'], 
      dtype='|S3')

In [3]: np.array(l,dtype='object')
Out[3]: array(['foo', 1, 'bar'], dtype=object)

@PKEuS
Copy link
Contributor

PKEuS commented Apr 10, 2014

Thank you! That fixes this issue.

Now, it seems that only one problem is left: It behaves somehow strange on the missing value in the index (None/np.nan) in the unit tests. _join_level seems to replace missing value by a random value.
Is a missing value even allowed as an index? If not, the test case could just be modified.

@jreback
Copy link
Contributor Author

jreback commented Apr 10, 2014

hmm. well missing values are technically allowed in an Index (I don't find it useful but its allowed).

normally remove them from the indexer first them put them back

something like

indexer = np.array([1,2,5,-1,6,-1])
check = indexer != -1
result = values.take(indexer[check])

its a bit tricky to actually put the nan back in the right place,

have a look at this: https://github.com/pydata/pandas/blob/master/pandas/core/indexing.py#L904

it gets unbelievable complicated when you have duplicates

@PKEuS
Copy link
Contributor

PKEuS commented Apr 10, 2014

I guess, the issue is on the _join_level side? Is it actually worth to care about these cases (missing index, duplicate missing indices)?

@jreback
Copy link
Contributor Author

jreback commented Apr 10, 2014

can you put up the test case that is failing?

@hmgaudecker
Copy link

I would argue for "yes" in cases where just a subset of the levels of a MultiIndex is missing. The test case above is a real-world example: People answer they own some asset in a survey but I cannot match the name to any actual stock/fund, so the ISIN number is missing. I don't want to have those observations missing from the merged dataset, though.

But obviously you cannot match to anything if the missing value is in one of the levels to merge on. I would think of something as the following:

  • Check whether there are missings in the index levels that will be used for merging
  • If so, remove those observations
  • Merge
  • Put observations with missing index levels back in

@PKEuS
Copy link
Contributor

PKEuS commented May 15, 2014

I squshed my commits into one, rebased it and "fixed" travis failure: PKEuS@b9e81b8

I was unable to implement support for missing values in multiindices. I tried several ways (fixing the function that replaces missings by other values, taking out the NaN-values before merging), but I was not able to get one of these ways working. I think that ticket #5074 is related to this issue, since the problems are somewhat similar.

The unit test test_join_multi_levels2() can be switched to contain missing values by changing "None" to None in two places. It will fail then and could be used that way when fixing the missing value issue.

If you consider this to be ok for merging into master, I would open a pull-request.

@jreback
Copy link
Contributor Author

jreback commented May 15, 2014

needs a bit more tests. Also, lots of duplicated looking code (e.g. for the different types of joining). And this should be done in a vectorized way (not python looping).

@PKEuS
Copy link
Contributor

PKEuS commented May 19, 2014

Addressed some of these issues in PKEuS@9ab5c05

Regarding doing this in a vectorized way: I have no idea how to achieve that, sorry.

@jreback
Copy link
Contributor Author

jreback commented May 19, 2014

can u do a perf comparison of using this method (and the resetting method)

it's possible vectorization can be done later

use a reasonable sized example

@PKEuS
Copy link
Contributor

PKEuS commented May 21, 2014

can u do a perf comparison of using this method (and the resetting method)

Actually, i don't really know to what I have to compare this method. Could you explain, please?

@jreback
Copy link
Contributor Author

jreback commented May 21, 2014

the benchmarch is resetting,merging,setting (same as the test comparisons), see here: http://pandas-docs.github.io/pandas-docs-travis/merging.html#joining-with-two-multi-indexes

@PKEuS
Copy link
Contributor

PKEuS commented Jul 16, 2014

Benchmark results:

inner strategy:
join: 0.03857225257896634
merge: 0.01556221417412993

outer strategy:
join: 0.5167438638723917
merge: 0.016556284920703934

Setting: two Dataframes 1000x3 (two index columns, 1 data column)

@PKEuS
Copy link
Contributor

PKEuS commented Aug 4, 2014

@jreback: Any comments on these results?

A speedup by ~10% is possible by some micro-optimizations, however, the merge function is still faster.

@jreback
Copy link
Contributor Author

jreback commented Aug 4, 2014

I think their are some vbenchs already for these (or maybe need to add). Can you do that and post the vbench results?

@gandhis1
Copy link
Contributor

Question on this for anyone involved with this? As I understand it, as of 0.19.0, this still hasn't made it in, and one still has to do the manually steps of reset_index(), pd.merge on specific columns, and then set_index(). Is that correct?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Enhancement MultiIndex Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
4 participants