a.merge(b, left_on='name', right_index=True) merges on index whereas a.merge(b, on='name') where name is the index of b merges on column #32067

arobrien · 2020-02-17T23:50:58Z

Code Sample

a = pd.DataFrame({'a':[10,11,15,21],'b':[7,6,4,2]}).set_index('a')
b = pd.DataFrame({'b':[7,6,4,2],'c':['a','b','c','d']}).set_index('b')
a.merge(b, on='b')

Problem description

If I want to merge two DataFrame's, where column 'name' of a matches index named 'name' on b, with the old syntax I needed to write a.merge(b, left_on='name', right_index=True). This would produce a new DataFrame with the index of a retained as an index, the shared column/index retained as a column, and other columns of a and b also retained (as shown in expected output).

With recent versions, however, I could just use a.merge(b, on='name'), but it drops the index of a (merging on column mode instead of merging on index).

The documentation states (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html):

The join is done on columns or indexes. If joining columns on columns, the
DataFrame indexes will be ignored. Otherwise if joining indexes on indexes
or indexes on a column or columns, the index will be passed on.

Perhaps this could be clarified in the doc, but the behaviour is at least inconsistent. For me it is more intuitive to perform an index merge internally in this case.

Expected Output

	b	c
a
10	7	a
11	6	b
15	4	c
21	2	d

Output of `pd.show_versions()`

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS

commit : None
python : 3.7.6.final.0
python-bits : 64
OS : Windows
OS-release : 7
machine : AMD64
processor : Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : None.None

pandas : 1.0.1
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 45.2.0.post20200210
Cython : 0.29.15
pytest : None
hypothesis : None
sphinx : 2.4.1
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.8.4 (dt dec pq3 ext lo64)
jinja2 : 2.11.1
IPython : 7.12.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.1.3
numexpr : 2.7.1
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : 1.3.13
tables : 3.6.1
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : None

The text was updated successfully, but these errors were encountered:

arobrien · 2020-02-18T00:19:22Z

This example also merges on index:

a = pd.DataFrame({'a':[10,11,15,21],'b':[7,6,4,2]}).set_index('a')
b = pd.DataFrame({'a':[10,11,15,21],'c':['a','b','c','d']}).set_index('a')
a.merge(b, on='a')

MarcoGorelli · 2020-02-24T11:53:23Z

Thanks @arobrien for the report

Looking into this - in pandas/core/reshape/merge.py, when we get here

            (left_indexer, right_indexer) = self._get_join_indexers()

            if self.right_index:
                if len(self.left) > 0:
                    join_index = self._create_join_index(
                        self.left.index,
                        self.right.index,
                        left_indexer,
                        right_indexer,
                        how="right",
                    )
                else:
                    join_index = self.right.index.take(right_indexer)
                    left_indexer = np.array([-1] * len(join_index))
            elif self.left_index:
                if len(self.right) > 0:
                    join_index = self._create_join_index(
                        self.right.index,
                        self.left.index,
                        right_indexer,
                        left_indexer,
                        how="left",
                    )
                else:
                    join_index = self.left.index.take(left_indexer)
                    right_indexer = np.array([-1] * len(join_index))
            else:
                join_index = Index(np.arange(len(left_indexer)))

we have self.right_index is False, and so the join index is set to Index(np.arange(len(left_indexer)))

phofl · 2020-05-28T21:17:23Z

@MarcoGorelli Is this the desired behavior? I would have expected the same as @arobrien

MarcoGorelli · 2020-05-30T10:31:13Z

@phofl I don't think it's expected, IIRC I'd left my comment above because I thought it might be useful to anyone interested in debugging this

MarcoGorelli added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Feb 18, 2020

mroeschke added the Bug label Jul 28, 2021

topper-123 mentioned this issue May 4, 2023

BUG: Unexpected Behaviour pd.merge using left_on and right_index=True with single level MultiIndex #52331

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

a.merge(b, left_on='name', right_index=True) merges on index whereas a.merge(b, on='name') where name is the index of b merges on column #32067

a.merge(b, left_on='name', right_index=True) merges on index whereas a.merge(b, on='name') where name is the index of b merges on column #32067

arobrien commented Feb 17, 2020

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS

arobrien commented Feb 18, 2020

MarcoGorelli commented Feb 24, 2020

phofl commented May 28, 2020

MarcoGorelli commented May 30, 2020

a.merge(b, left_on='name', right_index=True) merges on index whereas a.merge(b, on='name') where name is the index of b merges on column #32067

a.merge(b, left_on='name', right_index=True) merges on index whereas a.merge(b, on='name') where name is the index of b merges on column #32067

Comments

arobrien commented Feb 17, 2020

Code Sample

Problem description

Expected Output

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line] INSTALLED VERSIONS

arobrien commented Feb 18, 2020

MarcoGorelli commented Feb 24, 2020

phofl commented May 28, 2020

MarcoGorelli commented May 30, 2020

Output of `pd.show_versions()`

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS