-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: merge multi-index with a multi-index #6360
Comments
Here's a partial implementation: jreback@0c38215 |
I tried to get your partial implementation working. As-is, it complained about non-fitting shape. I enabled the commented code then, which I rewrote to this one: The commented code gave these tuples:
It now creates this tuples
Compared to the tuples created by your commented code, it looks correct to me. (You can ignore the line with asset_id=nan for now) However, it crashes after returning the MultiIndex somewhere in the merging code called after merging the index. This crashs the python interpreter, which makes nosetests abort without printing any results. Any ideas? |
the tuples here are not that complicated |
Indeed. The reason for the crash were non-fitting sizes of indexers and indexes. Fixed that. Now it creates two objects which are imho identical, but assert_frame_equal returns false. And surprisingly, it creates the same output for "inner" and "outer" merge. However, I would like to have one of these working first. If I print the two dataframes (result and expected), they look equal:
However, assert_frame_equals prints:
|
hmm...grab the 2 indexes before the assertion and look at them, they should both be multi-indexes with the same names / levels and such, try |
The indexes look equal, however, both
and
print "False". Besides from that, I feel I reinvented the wheel somehow, since my code hardcodes an "inner merge". Since achieving an "outer merge" or left/right is harder that way, I guess I should use some already existing method. Those seem to not work with multiindex objects, even if join levels are given. Has anybody an idea about the best way to use the existing mehtods on the multiindex? |
At the moment I have a problem (which is the reason for the test failures) with this line:
It creates an array from left (which is needed to use the take method), but all non-string indices become strings due to that conversion. Do you have any suggestions on how to solve that? |
maybe you could put up an example? |
The context of that line: https://github.com/PKEuS/pandas/blob/master/pandas/core/index.py#L1528 The point is: We have a row (type FrozenList, for example: ["foo", 1, "bar"]). We need a tuple out of it, consisting only of those elements indexed by left_indexer. That is, what take() does. This function requires an array, so the FrozenList ["foo", 1, "bar"] is converted to np.array: ["foo", "1", "bar"] is the result (integer 1 becomes string "1"). |
do this
|
Thank you! That fixes this issue. Now, it seems that only one problem is left: It behaves somehow strange on the missing value in the index (None/np.nan) in the unit tests. _join_level seems to replace missing value by a random value. |
hmm. well missing values are technically allowed in an Index (I don't find it useful but its allowed). normally remove them from the indexer first them put them back something like
its a bit tricky to actually put the nan back in the right place, have a look at this: https://github.com/pydata/pandas/blob/master/pandas/core/indexing.py#L904 it gets unbelievable complicated when you have duplicates |
I guess, the issue is on the _join_level side? Is it actually worth to care about these cases (missing index, duplicate missing indices)? |
can you put up the test case that is failing? |
I would argue for "yes" in cases where just a subset of the levels of a MultiIndex is missing. The test case above is a real-world example: People answer they own some asset in a survey but I cannot match the name to any actual stock/fund, so the ISIN number is missing. I don't want to have those observations missing from the merged dataset, though. But obviously you cannot match to anything if the missing value is in one of the levels to merge on. I would think of something as the following:
|
I squshed my commits into one, rebased it and "fixed" travis failure: PKEuS@b9e81b8 I was unable to implement support for missing values in multiindices. I tried several ways (fixing the function that replaces missings by other values, taking out the NaN-values before merging), but I was not able to get one of these ways working. I think that ticket #5074 is related to this issue, since the problems are somewhat similar. The unit test test_join_multi_levels2() can be switched to contain missing values by changing "None" to None in two places. It will fail then and could be used that way when fixing the missing value issue. If you consider this to be ok for merging into master, I would open a pull-request. |
needs a bit more tests. Also, lots of duplicated looking code (e.g. for the different types of joining). And this should be done in a vectorized way (not python looping). |
Addressed some of these issues in PKEuS@9ab5c05 Regarding doing this in a vectorized way: I have no idea how to achieve that, sorry. |
can u do a perf comparison of using this method (and the resetting method) it's possible vectorization can be done later use a reasonable sized example |
Actually, i don't really know to what I have to compare this method. Could you explain, please? |
the benchmarch is resetting,merging,setting (same as the test comparisons), see here: http://pandas-docs.github.io/pandas-docs-travis/merging.html#joining-with-two-multi-indexes |
Benchmark results: inner strategy: outer strategy: Setting: two Dataframes 1000x3 (two index columns, 1 data column) |
@jreback: Any comments on these results? A speedup by ~10% is possible by some micro-optimizations, however, the merge function is still faster. |
I think their are some vbenchs already for these (or maybe need to add). Can you do that and post the vbench results? |
Question on this for anyone involved with this? As I understand it, as of 0.19.0, this still hasn't made it in, and one still has to do the manually steps of |
#3662 is about merging a single-level index with a mi
This is is about a multi-multi merge
The text was updated successfully, but these errors were encountered: