DataFrame.join left_index right_index inverted #22449

mariano22 · 2018-08-21T18:05:13Z

Code Sample, a copy-pastable example if possible

import numpy as np
import pandas as pd

df_left = pd.DataFrame(data=['X'],columns=['C'],index=[22])
df_right = pd.DataFrame(data=['X'],columns=['C'],index=[999])
merge = pd.merge(df_left,df_right,on=['C'], left_index=True)

print merge.index

Problem description

The copied code print a DataFrame where the key is 999. As I understand from the documentation where left_index=True the keys from the left DataFrame should be used as join keys.
My output:
Int64Index([999], dtype='int64')
Expected output:
Int64Index([22], dtype='int64')

INSTALLED VERSIONS ------------------ commit: None python: 2.7.12.final.0 python-bits: 64 OS: Linux OS-release: 4.15.0-32-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: None.None

pandas: 0.23.3
pytest: None
pip: 18.0
setuptools: 20.7.0
Cython: None
numpy: 1.15.0
scipy: None
pyarrow: None
xarray: None
IPython: 5.8.0
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: 1.1.0
xlwt: None
xlsxwriter: 1.0.5
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

TomAugspurger · 2018-08-22T15:02:05Z

I don't understand your expected output. As you say, left_index means you're joining the index [22] from left with the ['X'] from right, which correctly raises an error if typed like

In [17]: pd.merge(df_left, df_right, left_index=True, right_on="C")
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-17-b87fd30f8f6f> in <module>()
----> 1 pd.merge(df_left, df_right, left_index=True, right_on="C")

~/sandbox/pandas/pandas/core/reshape/merge.py in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
     59                          right_index=right_index, sort=sort, suffixes=suffixes,
     60                          copy=copy, indicator=indicator,
---> 61                          validate=validate)
     62     return op.get_result()
     63

~/sandbox/pandas/pandas/core/reshape/merge.py in __init__(self, left, right, how, on, left_on, right_on, axis, left_index, right_index, sort, suffixes, copy, indicator, validate)
    548         # validate the merge keys dtypes. We may need to coerce
    549         # to avoid incompat dtypes
--> 550         self._maybe_coerce_merge_keys()
    551
    552         # If argument passed to validate,

~/sandbox/pandas/pandas/core/reshape/merge.py in _maybe_coerce_merge_keys(self)
    970             elif ((is_numeric_dtype(lk) and not is_bool_dtype(lk))
    971                     and not is_numeric_dtype(rk)):
--> 972                 raise ValueError(msg)
    973             elif (not is_numeric_dtype(lk)
    974                     and (is_numeric_dtype(rk) and not is_bool_dtype(rk))):

ValueError: You are trying to merge on int64 and object columns. If you wish to proceed you should use pd.concat

I think the root issue is that you're specifying what you're joining on twice for the left dataframe, once with left and once with left_index. I'm not sure what the expected behavior is, but that case should perhaps raise.

TColl · 2019-09-06T09:33:39Z

I agree with @mariano22 - the left_index and right_index logic seems inverted in this specific use case where on is being used to select the merge columns instead of specifying columns for left and right individually.

I'm using this technique with an inner merge to combine two datasets, using the common columns across both datasets (on=[col for col in left.columns if col in right.columns]). I create a new df containing data combined from both sources, but retaining the keys from either left or right in order to subsequently modify some data in the original dataframes. In my case, passing left_index=True gives me the common records correctly, but with the keys preserved from right, whereas right_index=True gives the same common records but with key preservation from left.

In this use case I would expect left_index=True to preserve keys from left, and right_index=True to preserve keys from right, but the opposite is happening.

Passing both left_index=True and right_index=True at the same time gets even weirder, but that's another issue...

TomAugspurger · 2019-09-19T21:16:06Z

Perhaps I'm misunderstanding, but I view

pd.merge(left, right, left_index=True, right_on="on")

as an alias for

pd.merge(left.reset_index(), right, left_on=left.index.name, right_on="on")

Are you proposing changing that? Am I missing something?

TColl · 2019-11-04T15:53:40Z

I think we're not quite on the same page here.

on allows us to specify the merge columns to use in both dataframes via one argument, in this application we're using on instead of left_on and/or right_on.

In this situation, if left_index or right_index are included (but left_on and right_on are excluded), the behaviour mentioned in this issue occurs.

here's an example:

import pandas as pd

age_data = {
    'Name':['Ash','Bob','Charlie'],
    'ID':[1,2,3],
    'Age':[18,80,55]
}
height_data = {
    'Name':['Ash','Charlie','Derek'],
    'ID':[1,3,4],
    'Height':[140,162,180]
}

ages = pd.DataFrame(data=age_data, index=[1,2,3])
heights = pd.DataFrame(data=height_data, index=[91,92,93])

common_columns = ['Name', 'ID']
               
common_records_left_index = pd.merge(ages, heights, how='inner', on=common_columns, left_index=True)
common_records_right_index = pd.merge(ages, heights, how='inner', on=common_columns, right_index=True)

Here, we should end up with two dataframes which both contain the combined age and height data for Ash and Charlie (as they're the only records with both an age and a height provided), with index values as follows:

common_records_left_index should have index keys of 1 and 3 (preserved from the ages dataframe), and
common_records_right_index should have index keys of 91 and 93 (preserved from the heights dataframe)

However, the opposite case is true - left_index=True preserves the keys from the right dataframe during the merge, and right_index=True preserves the keys from the left dataframe.

phofl · 2020-11-21T23:49:46Z

The op raises MergeError, which is correct because of dtype issues, and

common_records_left_index = pd.merge(ages, heights, how='inner', on=common_columns, left_index=True)
common_records_right_index = pd.merge(ages, heights, how='inner', on=common_columns, right_index=True)

raises too, because on and right_index is not allowed. I think this is a closing candidate?

TColl · 2020-11-23T08:31:26Z

I don't think this should be closed as the issue still persists - working through my example given above still yields the behaviour I documented and doesn't raise MergeError (at least in v.1.1.2)

If I've not been clear enough, please let me know and I'll do my best to explain in more detail 👍

jreback · 2020-11-23T12:08:00Z

@TColl this persists on master?

mariano22 · 2020-11-23T12:34:59Z

I think that it can be closed.

Thanks @TomAugspurger for the explanaition, I was misunderstanding the documentation. I think that maybe @TColl was interpreting the same as I did: I thought left_index/right_index indicates which indexes you would use in the result dataframe (to construct result.index). But it's to indicate a join key (as on, left_on, rigth_on does). With this BugFix if you use 'on' + 'left_index' it fails because you are using two differents way of specifying the join key for the left data frame, am I right?

I still don't know how to specify which index to use to construct the result index. But @TColl can do:

import pandas as pd

age_data = {
    'Name':['Ash','Bob','Charlie'],
    'ID':[1,2,3],
    'Age':[18,80,55]
}
height_data = {
    'Name':['Ash','Charlie','Derek'],
    'ID':[1,3,4],
    'Height':[140,162,180]
}

ages = pd.DataFrame(data=age_data, index=[1,2,3])
heights = pd.DataFrame(data=height_data, index=[91,92,93])

common_columns = ['Name', 'ID']
               
common_records_left_index = pd.merge(ages.reset_index(), heights, how='inner', on=common_columns).set_index('index')
common_records_right_index = pd.merge(ages, heights.reset_index(), how='inner', on=common_columns).set_index('index')

I think the names are kind of confusing. Maybe instead of left_index/right_index I would had choosed something like on_left_index/on_right_index despite the fact is much more verbose. I saw this misunderstanding in many practitioners. But the documentation is clear.

TColl · 2020-11-23T16:09:14Z

@jreback just confirming this has been fixed on master (sorry for being lazy and testing on an older version earlier!)

I agree this issue can now be closed, though I agree with @mariano22 that on_left_index would be a lot less confusing than left_index

jreback · 2020-11-23T16:22:29Z

would take a PR for a test that replicates the OP

TomAugspurger added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Error Reporting Incorrect or improved errors from pandas labels Aug 22, 2018

jreback mentioned this issue Nov 21, 2020

BUG: Fix inconsistent ordering between left and right in merge #37406

Merged

5 tasks

phofl added good first issue Needs Tests Unit test(s) needed to prevent regressions labels Nov 23, 2020

mroeschke mentioned this issue May 29, 2021

TST: More old issues #41712

Merged

15 tasks

mroeschke added this to the 1.3 milestone May 29, 2021

jreback closed this as completed in #41712 May 31, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataFrame.join left_index right_index inverted #22449

DataFrame.join left_index right_index inverted #22449

mariano22 commented Aug 21, 2018

TomAugspurger commented Aug 22, 2018

TColl commented Sep 6, 2019

TomAugspurger commented Sep 19, 2019

TColl commented Nov 4, 2019

phofl commented Nov 21, 2020

TColl commented Nov 23, 2020

jreback commented Nov 23, 2020

mariano22 commented Nov 23, 2020 •

edited

Loading

TColl commented Nov 23, 2020 •

edited

Loading

jreback commented Nov 23, 2020

DataFrame.join left_index right_index inverted #22449

DataFrame.join left_index right_index inverted #22449

Comments

mariano22 commented Aug 21, 2018

Code Sample, a copy-pastable example if possible

Problem description

TomAugspurger commented Aug 22, 2018

TColl commented Sep 6, 2019

TomAugspurger commented Sep 19, 2019

TColl commented Nov 4, 2019

phofl commented Nov 21, 2020

TColl commented Nov 23, 2020

jreback commented Nov 23, 2020

mariano22 commented Nov 23, 2020 • edited Loading

TColl commented Nov 23, 2020 • edited Loading

jreback commented Nov 23, 2020

mariano22 commented Nov 23, 2020 •

edited

Loading

TColl commented Nov 23, 2020 •

edited

Loading