Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge with how='inner' does not always preserve the order of the left keys #18776

Closed
haimivan opened this issue Dec 14, 2017 · 8 comments · Fixed by #54611
Closed

Merge with how='inner' does not always preserve the order of the left keys #18776

haimivan opened this issue Dec 14, 2017 · 8 comments · Fixed by #54611
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@haimivan
Copy link

see also:
https://stackoverflow.com/questions/47793302/python-pandas-dataframe-merge-strange-sort-order-for-how-inner

I do not understand the sort order for Python Pandas DataFrame merge function with how="inner". Example:

import pandas as pd

df2 = pd.DataFrame({'a': (6, 7, 8, 6), 'b': ("w", "x", "y", "z")})
print(df2)
print("left:")
dfMerge2 = pd.merge(df2, df2, on='a', how="left")
print(dfMerge2)
dfMerge = pd.merge(df2, df2, on='a', how="inner")
print("inner:")
print(dfMerge)

Result:

   a  b
0  6  w
1  7  x
2  8  y
3  6  z
left:
   a b_x b_y
0  6   w   w
1  6   w   z
2  7   x   x
3  8   y   y
4  6   z   w
5  6   z   z
inner:
   a b_x b_y
0  6   w   w
1  6   w   z
2  6   z   w
3  6   z   z
4  7   x   x
5  8   y   y

I would expect that for how="inner" the order of the resulting rows with

6 z w and

6 z z

would be the same as with how="left", as the documentation https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html says:

  • left: use only keys from left frame, similar to a SQL left outer join; preserve key order
  • inner: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]
INSTALLED VERSIONS

commit: None
python: 3.6.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-103-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.20.3
pytest: None
pip: 9.0.1
setuptools: 36.4.0
Cython: None
numpy: 1.13.1
scipy: 0.19.1
xarray: None
IPython: None
sphinx: None
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: 1.1.13
pymysql: 0.7.9.None
psycopg2: None
jinja2: None
s3fs: None
pandas_gbq: None
pandas_datareader: None

@gfyoung gfyoung added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Usage Question and removed Usage Question labels Dec 15, 2017
@dhimmel
Copy link
Contributor

dhimmel commented Jan 8, 2018

I just noticed the same issue with pandas 0.22.0:

pandas-inner-merge-order

The expected behavior would be for rows ordering of affil_map_df to be preserved. Instead, it seems that instead the order of affiliation_df or perhaps the sorted affiliation column was used.

This behavior does not match the documentation:

pandas/pandas/core/frame.py

Lines 147 to 148 in a00154d

* inner: use intersection of keys from both frames, similar to a SQL inner
join; preserve the order of the left keys

To me, the documented behavior is intuitive and the actual behavior should be updated?

@jschendel
Copy link
Member

To expand on this, the issue appears to occur when the merge key is non-unique.

Setup:

In [2]: pd.__version__
Out[2]: '0.24.0.dev0+88.gdefdb34'

In [3]: df = pd.DataFrame({'key': [7, 6, 8, 6], 'other': ['foo', 'bar', 'baz', 'qux']})

In [4]: df
Out[4]:
   key other
0    7   foo
1    6   bar
2    8   baz
3    6   qux

Non-unique merge key causes improper ordering:

In [5]: pd.merge(df, df, how='inner', on='key')
Out[5]:
   key other_x other_y
0    7     foo     foo
1    6     bar     bar
2    6     bar     qux
3    6     qux     bar
4    6     qux     qux
5    8     baz     baz

Restricting to a unique portion seems fine:

In [6]: pd.merge(df.loc[:2], df.loc[:2], how='inner', on='key')
Out[6]:
   key other_x other_y
0    7     foo     foo
1    6     bar     bar
2    8     baz     baz

Using how='left' maintains proper order:

In [7]: pd.merge(df, df, how='left', on='key')
Out[7]:
   key other_x other_y
0    7     foo     foo
1    6     bar     bar
2    6     bar     qux
3    8     baz     baz
4    6     qux     bar
5    6     qux     qux

@jschendel jschendel changed the title Python Pandas DataFrame Merge: Strange Sort Order for How = Inner Merge with how='inner' and non-unique join key does not preserve the order of the left keys Jun 12, 2018
@TartySG
Copy link

TartySG commented Jun 12, 2018

@jschendel
It has more to do with the order in which the merge encounters the values rather than "non-uniqueness".
The order is the same as the first time it encounters the join key in the column.
So a unique value will only be encountered once, hence the absence of noticeable change in order.

df = pd.DataFrame([['A', 1],
                   ['B', 2],
                   ['B', 3],
                   ['A', 4]
                  ], columns=['Col1', 'Col2'])

	Col1	Col2
0	A	1
1	B	2
2	B	3
3	A	4

df['Col1'] = pd.Categorical(df.Col1, categories=['A','B'], ordered=True)
pd.merge(df, df, on='Col1', how='inner')

	Col1	Col2_x	Col2_y
0	A	1	1
1	A	1	4
2	A	4	1
3	A	4	4
4	B	2	2
5	B	2	3
6	B	3	2
7	B	3	3

will produce a merge with all 'A' first.
on the other hand :

df = pd.DataFrame([['B', 2],
                   ['A', 1],
                   ['B', 3],
                   ['A', 4]
                  ], columns=['Col1', 'Col2'])

	Col1	Col2
0	B	2
1	A	1
2	B	3
3	A	4

df['Col1'] = pd.Categorical(df.Col1, categories=['A','B'], ordered=True)
pd.merge(df, df, on='Col1', how='inner')

	Col1	Col2_x	Col2_y
0	B	2	2
1	B	2	3
2	B	3	2
3	B	3	3
4	A	1	1
5	A	1	4
6	A	4	1
7	A	4	4

will produce a merge with all the 'B' first, regardless of the "Order" of the categorical data (or any ordered type e.g. interger).

@jschendel
Copy link
Member

jschendel commented Jun 12, 2018

Yes, looks like I was a bit premature attributing the issue to non-uniquness.

@jschendel jschendel changed the title Merge with how='inner' and non-unique join key does not preserve the order of the left keys Merge with how='inner' does not always preserve the order of the left keys Jun 12, 2018
@asishm
Copy link
Contributor

asishm commented Dec 25, 2020

@phofl Can this be fixed with similar changes as PR #37406 ?

@phofl
Copy link
Member

phofl commented Dec 29, 2020

@asishm No, the Cython function perfoming the actual inner join does not support sort. In #37406 the sort keyword was not passed through so this was an easy fix. Not quitte sure why this was not implemented.

@mroeschke mroeschke added the Bug label Jun 12, 2021
@rickbeeloo
Copy link

This still appears to be an issue

@jreback
Copy link
Contributor

jreback commented Jan 20, 2022

@rickbeeloo hence the open status

pull requests to patch are welcome

Sky9x pushed a commit to Sky9x/mbta-gtfs-data that referenced this issue Sep 9, 2024
<https://cdn.mbtace.com/archive/20220325.zip>

Ensures that, when a transfers or pathways validation error occurs, the correct row's information is displayed in the error message. In Pandas, inner merges can result in non-documented reordering: See pandas-dev/pandas#18776 to follow the open issue.
Sky9x pushed a commit to Sky9x/mbta-gtfs-data that referenced this issue Sep 17, 2024
Ensures that, when a transfers or pathways validation error occurs, the correct row's information is displayed in the error message. In Pandas, inner merges can result in non-documented reordering: See pandas-dev/pandas#18776 to follow the open issue.

https://cdn.mbtace.com/archive/20220325.zip
Sky9x pushed a commit to Sky9x/mbta-gtfs-data that referenced this issue Dec 1, 2024
Ensures that, when a transfers or pathways validation error occurs, the correct row's information is displayed in the error message. In Pandas, inner merges can result in non-documented reordering: See pandas-dev/pandas#18776 to follow the open issue.

https://cdn.mbtace.com/archive/20220325.zip
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants