-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge with how='inner' does not always preserve the order of the left keys #18776
Comments
I just noticed the same issue with pandas 0.22.0: The expected behavior would be for rows ordering of This behavior does not match the documentation: Lines 147 to 148 in a00154d
To me, the documented behavior is intuitive and the actual behavior should be updated? |
To expand on this, the issue appears to occur when the merge key is non-unique. Setup: In [2]: pd.__version__
Out[2]: '0.24.0.dev0+88.gdefdb34'
In [3]: df = pd.DataFrame({'key': [7, 6, 8, 6], 'other': ['foo', 'bar', 'baz', 'qux']})
In [4]: df
Out[4]:
key other
0 7 foo
1 6 bar
2 8 baz
3 6 qux Non-unique merge key causes improper ordering: In [5]: pd.merge(df, df, how='inner', on='key')
Out[5]:
key other_x other_y
0 7 foo foo
1 6 bar bar
2 6 bar qux
3 6 qux bar
4 6 qux qux
5 8 baz baz Restricting to a unique portion seems fine: In [6]: pd.merge(df.loc[:2], df.loc[:2], how='inner', on='key')
Out[6]:
key other_x other_y
0 7 foo foo
1 6 bar bar
2 8 baz baz Using In [7]: pd.merge(df, df, how='left', on='key')
Out[7]:
key other_x other_y
0 7 foo foo
1 6 bar bar
2 6 bar qux
3 8 baz baz
4 6 qux bar
5 6 qux qux |
@jschendel df = pd.DataFrame([['A', 1],
['B', 2],
['B', 3],
['A', 4]
], columns=['Col1', 'Col2'])
Col1 Col2
0 A 1
1 B 2
2 B 3
3 A 4
df['Col1'] = pd.Categorical(df.Col1, categories=['A','B'], ordered=True)
pd.merge(df, df, on='Col1', how='inner')
Col1 Col2_x Col2_y
0 A 1 1
1 A 1 4
2 A 4 1
3 A 4 4
4 B 2 2
5 B 2 3
6 B 3 2
7 B 3 3 will produce a merge with all 'A' first. df = pd.DataFrame([['B', 2],
['A', 1],
['B', 3],
['A', 4]
], columns=['Col1', 'Col2'])
Col1 Col2
0 B 2
1 A 1
2 B 3
3 A 4
df['Col1'] = pd.Categorical(df.Col1, categories=['A','B'], ordered=True)
pd.merge(df, df, on='Col1', how='inner')
Col1 Col2_x Col2_y
0 B 2 2
1 B 2 3
2 B 3 2
3 B 3 3
4 A 1 1
5 A 1 4
6 A 4 1
7 A 4 4 will produce a merge with all the 'B' first, regardless of the "Order" of the categorical data (or any ordered type e.g. interger). |
Yes, looks like I was a bit premature attributing the issue to non-uniquness. |
This still appears to be an issue |
@rickbeeloo hence the open status pull requests to patch are welcome |
<https://cdn.mbtace.com/archive/20220325.zip> Ensures that, when a transfers or pathways validation error occurs, the correct row's information is displayed in the error message. In Pandas, inner merges can result in non-documented reordering: See pandas-dev/pandas#18776 to follow the open issue.
Ensures that, when a transfers or pathways validation error occurs, the correct row's information is displayed in the error message. In Pandas, inner merges can result in non-documented reordering: See pandas-dev/pandas#18776 to follow the open issue. https://cdn.mbtace.com/archive/20220325.zip
Ensures that, when a transfers or pathways validation error occurs, the correct row's information is displayed in the error message. In Pandas, inner merges can result in non-documented reordering: See pandas-dev/pandas#18776 to follow the open issue. https://cdn.mbtace.com/archive/20220325.zip
see also:
https://stackoverflow.com/questions/47793302/python-pandas-dataframe-merge-strange-sort-order-for-how-inner
I do not understand the sort order for Python Pandas DataFrame
merge
function withhow="inner"
. Example:Result:
I would expect that for
how="inner"
the order of the resulting rows with6 z w
and6 z z
would be the same as with
how="left"
, as the documentation https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html says:Output of
pd.show_versions()
[paste the output of
pd.show_versions()
here below this line]INSTALLED VERSIONS
commit: None
python: 3.6.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.0-103-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.20.3
pytest: None
pip: 9.0.1
setuptools: 36.4.0
Cython: None
numpy: 1.13.1
scipy: 0.19.1
xarray: None
IPython: None
sphinx: None
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: 1.1.13
pymysql: 0.7.9.None
psycopg2: None
jinja2: None
s3fs: None
pandas_gbq: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: