-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DataFrame.join left_index right_index inverted #22449
Comments
I don't understand your expected output. As you say, In [17]: pd.merge(df_left, df_right, left_index=True, right_on="C")
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-17-b87fd30f8f6f> in <module>()
----> 1 pd.merge(df_left, df_right, left_index=True, right_on="C")
~/sandbox/pandas/pandas/core/reshape/merge.py in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate)
59 right_index=right_index, sort=sort, suffixes=suffixes,
60 copy=copy, indicator=indicator,
---> 61 validate=validate)
62 return op.get_result()
63
~/sandbox/pandas/pandas/core/reshape/merge.py in __init__(self, left, right, how, on, left_on, right_on, axis, left_index, right_index, sort, suffixes, copy, indicator, validate)
548 # validate the merge keys dtypes. We may need to coerce
549 # to avoid incompat dtypes
--> 550 self._maybe_coerce_merge_keys()
551
552 # If argument passed to validate,
~/sandbox/pandas/pandas/core/reshape/merge.py in _maybe_coerce_merge_keys(self)
970 elif ((is_numeric_dtype(lk) and not is_bool_dtype(lk))
971 and not is_numeric_dtype(rk)):
--> 972 raise ValueError(msg)
973 elif (not is_numeric_dtype(lk)
974 and (is_numeric_dtype(rk) and not is_bool_dtype(rk))):
ValueError: You are trying to merge on int64 and object columns. If you wish to proceed you should use pd.concat I think the root issue is that you're specifying what you're joining on twice for the left dataframe, once with |
I agree with @mariano22 - the I'm using this technique with an inner merge to combine two datasets, using the common columns across both datasets ( In this use case I would expect Passing both |
Perhaps I'm misunderstanding, but I view pd.merge(left, right, left_index=True, right_on="on") as an alias for pd.merge(left.reset_index(), right, left_on=left.index.name, right_on="on") Are you proposing changing that? Am I missing something? |
I think we're not quite on the same page here.
In this situation, if here's an example: import pandas as pd
age_data = {
'Name':['Ash','Bob','Charlie'],
'ID':[1,2,3],
'Age':[18,80,55]
}
height_data = {
'Name':['Ash','Charlie','Derek'],
'ID':[1,3,4],
'Height':[140,162,180]
}
ages = pd.DataFrame(data=age_data, index=[1,2,3])
heights = pd.DataFrame(data=height_data, index=[91,92,93])
common_columns = ['Name', 'ID']
common_records_left_index = pd.merge(ages, heights, how='inner', on=common_columns, left_index=True)
common_records_right_index = pd.merge(ages, heights, how='inner', on=common_columns, right_index=True) Here, we should end up with two dataframes which both contain the combined age and height data for Ash and Charlie (as they're the only records with both an age and a height provided), with index values as follows:
However, the opposite case is true - |
The op raises MergeError, which is correct because of dtype issues, and
raises too, because on and right_index is not allowed. I think this is a closing candidate? |
I don't think this should be closed as the issue still persists - working through my example given above still yields the behaviour I documented and doesn't raise If I've not been clear enough, please let me know and I'll do my best to explain in more detail 👍 |
@TColl this persists on master? |
I think that it can be closed. Thanks @TomAugspurger for the explanaition, I was misunderstanding the documentation. I think that maybe @TColl was interpreting the same as I did: I thought left_index/right_index indicates which indexes you would use in the result dataframe (to construct result.index). But it's to indicate a join key (as on, left_on, rigth_on does). With this BugFix if you use 'on' + 'left_index' it fails because you are using two differents way of specifying the join key for the left data frame, am I right? I still don't know how to specify which index to use to construct the result index. But @TColl can do:
I think the names are kind of confusing. Maybe instead of left_index/right_index I would had choosed something like on_left_index/on_right_index despite the fact is much more verbose. I saw this misunderstanding in many practitioners. But the documentation is clear. |
@jreback just confirming this has been fixed on master (sorry for being lazy and testing on an older version earlier!) I agree this issue can now be closed, though I agree with @mariano22 that |
would take a PR for a test that replicates the OP |
Code Sample, a copy-pastable example if possible
Problem description
The copied code print a DataFrame where the key is 999. As I understand from the documentation where left_index=True the keys from the left DataFrame should be used as join keys.
My output:
Int64Index([999], dtype='int64')
Expected output:
Int64Index([22], dtype='int64')
pandas: 0.23.3
pytest: None
pip: 18.0
setuptools: 20.7.0
Cython: None
numpy: 1.15.0
scipy: None
pyarrow: None
xarray: None
IPython: 5.8.0
sphinx: None
patsy: None
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: 1.1.0
xlwt: None
xlsxwriter: 1.0.5
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: