-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: New feature allowing merging on combination of columns and index levels drops levels of index #20452
Comments
This is the expected behavior. RationaleWhen index levels are not included as In [1]: df1.reset_index('abc').merge(df2.reset_index('abc'), on='abc')
Out[1]:
abc v1 v2
0 a 0 100
1 a 0 200
2 a 1 100
3 a 1 200
4 a 2 100
5 a 2 200
6 a 3 100
7 a 3 200
8 b 4 300
9 b 4 400
10 b 5 300
11 b 5 400
12 b 6 300
13 b 6 400
14 b 7 300
15 b 7 400
16 c 8 500
17 c 8 600
18 c 9 500
19 c 9 600
20 c 10 500
21 c 10 600
22 c 11 500
23 c 11 600 Now (in 0.23), when index levels are included as In[2]: df1.merge(df2, on='abc')
Out[2]:
v1 v2
abc
a 0 100
a 0 200
a 1 100
a 1 200
a 2 100
a 2 200
a 3 100
a 3 200
b 4 300
b 4 400
b 5 300
b 5 400
b 6 300
b 6 400
b 7 300
b 7 400
c 8 500
c 8 600
c 9 500
c 9 600
c 10 500
c 10 600
c 11 500
c 11 600 Preserving all index levels (even those not referenced as DocumentationIn terms of documentation, there is a note in the sphinx docs for Merging on a combination of columns and index levels that says:
I think this is accurate, but it could be more explicit regarding the fate of the remaining index levels. Thanks for pointing out the errant New in version version in the documentation! I guess this one was missed when 0.22 turned into 0.23. Care to contribute some changes to the documentation that would have helped clarify the expected behavior for you? |
I agree here with @jmmease here. |
@jreback I think this is about whether levels of the row axis So would we consider an option to What @jmmease is saying is that when you do a |
So with 0.23 development, here is an interesting behavior for In [7]: df1.join(df2, on=['abc','xy'], how='inner')
Out[7]:
v1 v2
num
a x 1 0 100
2 1 100
y 1 2 200
2 3 200
b x 1 4 300
2 5 300
y 1 6 400
2 7 400
c x 1 8 500
2 9 500
y 1 10 600
2 11 600
In [8]: df1.join(df2, on='xy', how='inner')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
#### Traceback omitted
ValueError: len(left_on) must equal the number of levels in the index of "right"
In [9]: df2.join(df1, on=['abc','xy'], how='inner')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
### Traceback omitted
ValueError: len(left_on) must equal the number of levels in the index of "right" So the first The second So explaining all of this behavior in the docs is a bit of a challenge. |
I have a fix for the bug I reported in the previous comment about |
Code Sample, a copy-pastable example if possible
Problem description
It seems that the new feature implemented in #17484 that allows merging on a combination of columns and index levels can drop index levels, which is really non-intuitive. In the first example, the index level named "num" gets dropped, while in the last example, both "abc" and "xy" are dropped.
If this is the desired behavior, then it needs to be carefully documented.
N.B. There is also an error in the docs of merging.rst that says this feature was introduced in v.0.22, but it will be introduced in v0.23
I'm guessing @jmmease will need to look at this.
Expected Output
Output of
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None
pandas: 0.23.0.dev0+657.g01882ba5b
pytest: 3.4.0
pip: 9.0.1
setuptools: 38.5.1
Cython: 0.25.1
numpy: 1.14.1
scipy: 1.0.0
pyarrow: 0.8.0
xarray: None
IPython: 6.2.1
sphinx: 1.7.1
patsy: 0.5.0
dateutil: 2.6.1
pytz: 2018.3
blosc: 1.5.1
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.2.0
openpyxl: 2.5.0
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: 1.0.2
lxml: 4.1.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.5
pymysql: 0.8.0
psycopg2: None
jinja2: 2.10
s3fs: 0.1.3
fastparquet: None
pandas_gbq: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: