Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrame.join left_index right_index inverted #22449

Closed
mariano22 opened this issue Aug 21, 2018 · 10 comments · Fixed by #41712
Closed

DataFrame.join left_index right_index inverted #22449

mariano22 opened this issue Aug 21, 2018 · 10 comments · Fixed by #41712
Labels
Error Reporting Incorrect or improved errors from pandas good first issue Needs Tests Unit test(s) needed to prevent regressions Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@mariano22
Copy link

Code Sample, a copy-pastable example if possible

import numpy as np
import pandas as pd

df_left = pd.DataFrame(data=['X'],columns=['C'],index=[22])
df_right = pd.DataFrame(data=['X'],columns=['C'],index=[999])
merge = pd.merge(df_left,df_right,on=['C'], left_index=True)

print merge.index

Problem description

The copied code print a DataFrame where the key is 999. As I understand from the documentation where left_index=True the keys from the left DataFrame should be used as join keys.
My output:
Int64Index([999], dtype='int64')
Expected output:
Int64Index([22], dtype='int64')

INSTALLED VERSIONS ------------------ commit: None python: 2.7.12.final.0 python-bits: 64 OS: Linux OS-release: 4.15.0-32-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: None.None

pandas: 0.23.3
pytest: None
pip: 18.0
setuptools: 20.7.0
Cython: None
numpy: 1.15.0
scipy: None
pyarrow: None
xarray: None
IPython: 5.8.0
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: 1.1.0
xlwt: None
xlsxwriter: 1.0.5
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@TomAugspurger
Copy link
Contributor

I don't understand your expected output. As you say, left_index means you're joining the index [22] from left with the ['X'] from right, which correctly raises an error if typed like

In [17]: pd.merge(df_left, df_right, left_index=True, right_on="C")
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-17-b87fd30f8f6f> in <module>()
----> 1 pd.merge(df_left, df_right, left_index=True, right_on="C")

~/sandbox/pandas/pandas/core/reshape/merge.py in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
     59                          right_index=right_index, sort=sort, suffixes=suffixes,
     60                          copy=copy, indicator=indicator,
---> 61                          validate=validate)
     62     return op.get_result()
     63

~/sandbox/pandas/pandas/core/reshape/merge.py in __init__(self, left, right, how, on, left_on, right_on, axis, left_index, right_index, sort, suffixes, copy, indicator, validate)
    548         # validate the merge keys dtypes. We may need to coerce
    549         # to avoid incompat dtypes
--> 550         self._maybe_coerce_merge_keys()
    551
    552         # If argument passed to validate,

~/sandbox/pandas/pandas/core/reshape/merge.py in _maybe_coerce_merge_keys(self)
    970             elif ((is_numeric_dtype(lk) and not is_bool_dtype(lk))
    971                     and not is_numeric_dtype(rk)):
--> 972                 raise ValueError(msg)
    973             elif (not is_numeric_dtype(lk)
    974                     and (is_numeric_dtype(rk) and not is_bool_dtype(rk))):

ValueError: You are trying to merge on int64 and object columns. If you wish to proceed you should use pd.concat

I think the root issue is that you're specifying what you're joining on twice for the left dataframe, once with left and once with left_index. I'm not sure what the expected behavior is, but that case should perhaps raise.

@TomAugspurger TomAugspurger added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Error Reporting Incorrect or improved errors from pandas labels Aug 22, 2018
@TColl
Copy link

TColl commented Sep 6, 2019

I agree with @mariano22 - the left_index and right_index logic seems inverted in this specific use case where on is being used to select the merge columns instead of specifying columns for left and right individually.

I'm using this technique with an inner merge to combine two datasets, using the common columns across both datasets (on=[col for col in left.columns if col in right.columns]). I create a new df containing data combined from both sources, but retaining the keys from either left or right in order to subsequently modify some data in the original dataframes. In my case, passing left_index=True gives me the common records correctly, but with the keys preserved from right, whereas right_index=True gives the same common records but with key preservation from left.

In this use case I would expect left_index=True to preserve keys from left, and right_index=True to preserve keys from right, but the opposite is happening.

Passing both left_index=True and right_index=True at the same time gets even weirder, but that's another issue...

@TomAugspurger
Copy link
Contributor

Perhaps I'm misunderstanding, but I view

pd.merge(left, right, left_index=True, right_on="on")

as an alias for

pd.merge(left.reset_index(), right, left_on=left.index.name, right_on="on")

Are you proposing changing that? Am I missing something?

@TColl
Copy link

TColl commented Nov 4, 2019

I think we're not quite on the same page here.

on allows us to specify the merge columns to use in both dataframes via one argument, in this application we're using on instead of left_on and/or right_on.

In this situation, if left_index or right_index are included (but left_on and right_on are excluded), the behaviour mentioned in this issue occurs.

here's an example:

import pandas as pd

age_data = {
    'Name':['Ash','Bob','Charlie'],
    'ID':[1,2,3],
    'Age':[18,80,55]
}
height_data = {
    'Name':['Ash','Charlie','Derek'],
    'ID':[1,3,4],
    'Height':[140,162,180]
}

ages = pd.DataFrame(data=age_data, index=[1,2,3])
heights = pd.DataFrame(data=height_data, index=[91,92,93])

common_columns = ['Name', 'ID']
               
common_records_left_index = pd.merge(ages, heights, how='inner', on=common_columns, left_index=True)
common_records_right_index = pd.merge(ages, heights, how='inner', on=common_columns, right_index=True)

Here, we should end up with two dataframes which both contain the combined age and height data for Ash and Charlie (as they're the only records with both an age and a height provided), with index values as follows:

  • common_records_left_index should have index keys of 1 and 3 (preserved from the ages dataframe), and
  • common_records_right_index should have index keys of 91 and 93 (preserved from the heights dataframe)

However, the opposite case is true - left_index=True preserves the keys from the right dataframe during the merge, and right_index=True preserves the keys from the left dataframe.

@phofl
Copy link
Member

phofl commented Nov 21, 2020

The op raises MergeError, which is correct because of dtype issues, and

common_records_left_index = pd.merge(ages, heights, how='inner', on=common_columns, left_index=True)
common_records_right_index = pd.merge(ages, heights, how='inner', on=common_columns, right_index=True)

raises too, because on and right_index is not allowed. I think this is a closing candidate?

@TColl
Copy link

TColl commented Nov 23, 2020

I don't think this should be closed as the issue still persists - working through my example given above still yields the behaviour I documented and doesn't raise MergeError (at least in v.1.1.2)

If I've not been clear enough, please let me know and I'll do my best to explain in more detail 👍

@jreback
Copy link
Contributor

jreback commented Nov 23, 2020

@TColl this persists on master?

@mariano22
Copy link
Author

mariano22 commented Nov 23, 2020

I think that it can be closed.

Thanks @TomAugspurger for the explanaition, I was misunderstanding the documentation. I think that maybe @TColl was interpreting the same as I did: I thought left_index/right_index indicates which indexes you would use in the result dataframe (to construct result.index). But it's to indicate a join key (as on, left_on, rigth_on does). With this BugFix if you use 'on' + 'left_index' it fails because you are using two differents way of specifying the join key for the left data frame, am I right?

I still don't know how to specify which index to use to construct the result index. But @TColl can do:

import pandas as pd

age_data = {
    'Name':['Ash','Bob','Charlie'],
    'ID':[1,2,3],
    'Age':[18,80,55]
}
height_data = {
    'Name':['Ash','Charlie','Derek'],
    'ID':[1,3,4],
    'Height':[140,162,180]
}

ages = pd.DataFrame(data=age_data, index=[1,2,3])
heights = pd.DataFrame(data=height_data, index=[91,92,93])

common_columns = ['Name', 'ID']
               
common_records_left_index = pd.merge(ages.reset_index(), heights, how='inner', on=common_columns).set_index('index')
common_records_right_index = pd.merge(ages, heights.reset_index(), how='inner', on=common_columns).set_index('index')

I think the names are kind of confusing. Maybe instead of left_index/right_index I would had choosed something like on_left_index/on_right_index despite the fact is much more verbose. I saw this misunderstanding in many practitioners. But the documentation is clear.

@TColl
Copy link

TColl commented Nov 23, 2020

@jreback just confirming this has been fixed on master (sorry for being lazy and testing on an older version earlier!)

I agree this issue can now be closed, though I agree with @mariano22 that on_left_index would be a lot less confusing than left_index

@jreback
Copy link
Contributor

jreback commented Nov 23, 2020

would take a PR for a test that replicates the OP

@phofl phofl added good first issue Needs Tests Unit test(s) needed to prevent regressions labels Nov 23, 2020
@mroeschke mroeschke mentioned this issue May 29, 2021
15 tasks
@mroeschke mroeschke added this to the 1.3 milestone May 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Error Reporting Incorrect or improved errors from pandas good first issue Needs Tests Unit test(s) needed to prevent regressions Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants