Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Right/outer merge behaviour on left column and right index is unexpected #17257

Open
ianepreston opened this issue Aug 15, 2017 · 13 comments
Open
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@ianepreston
Copy link

ianepreston commented Aug 15, 2017

Code Sample, a copy-pastable example if possible

import pandas as pd
big_index = [123, 124, 125, 126, 127, 128, 129, 130]
big_dat = {'year': pd.Series([2000, 2000, 2000, 2001, 2002, 2002, 2002, 2004], index=big_index),
          'other': pd.Series(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'], index=big_index)}
big_df = pd.DataFrame(big_dat)

year_index = [2003, 2000, 2001, 2002]
year_dat = {'a': pd.Series([1, 2, 3, 4], index=year_index),
            'b': pd.Series([5, 6, 7, 8], index=year_index)}
year_df = pd.DataFrame(year_dat)

merged_right = pd.merge(
        big_df,
        year_df,
        how='right',
        left_on='year',
        right_index=True
        )
merged_outer = pd.merge(
        big_df,
        year_df,
        how='outer',
        left_on='year',
        right_index=True
        )
merged_right
Out[5]: 
    other  year  a  b
123     a  2000  2  6
124     b  2000  2  6
125     c  2000  2  6
126     d  2001  3  7
127     e  2002  4  8
128     f  2002  4  8
129     g  2002  4  8
130   NaN  2003  1  5

merged_outer
Out[6]: 
    other  year    a    b
123     a  2000  2.0  6.0
124     b  2000  2.0  6.0
125     c  2000  2.0  6.0
126     d  2001  3.0  7.0
127     e  2002  4.0  8.0
128     f  2002  4.0  8.0
129     g  2002  4.0  8.0
130     h  2004  NaN  NaN
130   NaN  2003  1.0  5.0

Problem description

merged outer and merged right both return year 2003, a 1, b 5 associated with index 130.

Expected Output

Either a NaN index entry or an error/warning. There's no obvious reason to associate that data with 130 index

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.1.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: en
LOCALE: None.None

pandas: 0.20.2
pytest: 3.0.7
pip: 9.0.1
setuptools: 27.2.0
Cython: 0.25.2
numpy: 1.12.1
scipy: 0.19.0
xarray: None
IPython: 5.3.0
sphinx: 1.5.6
patsy: 0.4.1
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: 3.2.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.4.7
xlrd: 1.0.0
xlwt: 1.2.0
xlsxwriter: 0.9.6
lxml: 3.7.3
bs4: 4.6.0
html5lib: 0.999
sqlalchemy: 1.1.9
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

@gfyoung gfyoung added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Aug 15, 2017
@gfyoung
Copy link
Member

gfyoung commented Aug 15, 2017

@cornucrapia : Thanks for reporting this! Could you instead print out the repr version of the DataFrame that you are getting from the merge? That would be easier for us to read.

@ianepreston
Copy link
Author

@gfyoung absolutely. See below.

repr(merged_right)
Out[3]: '    other  year  a  b\n123     a  2000  2  6\n124     b  2000  2  6\n125     c  2000  2  6\n126     d  2001  3  7\n127     e  2002  4  8\n128     f  2002  4  8\n129     g  2002  4  8\n130   NaN  2003  1  5'

repr(merged_outer)
Out[4]: '    other  year    a    b\n123     a  2000  2.0  6.0\n124     b  2000  2.0  6.0\n125     c  2000  2.0  6.0\n126     d  2001  3.0  7.0\n127     e  2002  4.0  8.0\n128     f  2002  4.0  8.0\n129     g  2002  4.0  8.0\n130     h  2004  NaN  NaN\n130   NaN  2003  1.0  5.0'

@gfyoung
Copy link
Member

gfyoung commented Aug 15, 2017

@cornucrapia : Sorry, I misspoke 😄. Can you just print out the DataFrame i.e.:

In[5]: merged_right
Out[5]: <DataFrame>

@ianepreston
Copy link
Author

merged_right
Out[5]: 
    other  year  a  b
123     a  2000  2  6
124     b  2000  2  6
125     c  2000  2  6
126     d  2001  3  7
127     e  2002  4  8
128     f  2002  4  8
129     g  2002  4  8
130   NaN  2003  1  5

merged_outer
Out[6]: 
    other  year    a    b
123     a  2000  2.0  6.0
124     b  2000  2.0  6.0
125     c  2000  2.0  6.0
126     d  2001  3.0  7.0
127     e  2002  4.0  8.0
128     f  2002  4.0  8.0
129     g  2002  4.0  8.0
130     h  2004  NaN  NaN
130   NaN  2003  1.0  5.0

@gfyoung
Copy link
Member

gfyoung commented Aug 15, 2017

Ah, that's a little better to read now! I do agree: I too am a little perplexed why there would be such a row, as I don't know where that would show up in either such join.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jan 24, 2019

I believe this is the same example

import pandas as pd
left = pd.DataFrame({'a': [1, 2, 3], 'key': [0, 1, 1]})
right = pd.DataFrame({'b': [1, 2, 3]})
result = left.merge(right, left_on='key', right_index=True, how='right')

output

In [5]: left
Out[5]:
   a  key
0  1    0
1  2    1
2  3    1

In [6]: right
Out[6]:
   b
0  1
1  2
2  3

In [7]: result
Out[7]:
     a  key  b
0  1.0    0  1
1  2.0    1  2
2  3.0    1  2
2  NaN    2  3

I may have expected something like

In [11]: pd.merge(left, right.rename_axis('key').reset_index(), on='key', how='right').set_index('key', drop=False)
Out[11]:
       a  key  b
key
0    1.0    0  1
1    2.0    1  2
1    3.0    1  2
2    NaN    2  3

https://github.com/pandas-dev/pandas/pull/24904/files is adding a test with that. I'll add a note to that test that it's probably buggy, and to refer here.

@phofl
Copy link
Member

phofl commented May 28, 2020

@TomAugspurger I would expect the output

     a  key  b
0  1.0    0  1
1  2.0    1  2
1  3.0    1  2
2  NaN    NaN  3

for the operation

left = pd.DataFrame({'a': [1, 2, 3], 'key': [0, 1, 1]})
right = pd.DataFrame({'b': [1, 2, 3]})
result = left.merge(right, left_on='key', right_index=True, how='right')

Could you give me a hint, why you would expect a 2 in the key column (not index) in the last row and not NaN?

@TomAugspurger
Copy link
Contributor

Is that from the index in right being 2 there?

@phofl
Copy link
Member

phofl commented May 28, 2020

I would have expected, that the key column is the original key column, not influenced by the index. But I am not sure if this is the desired behavior.

I would expected index = index of right df, a and key cols with values from left df and b with values from right df.

I am currently working on the index issue, which is unrelated with the key column problem. But I would try to work on this after figuring the index issue out, if I know how this should work.

@TomAugspurger
Copy link
Contributor

I thought right_index=True means that the index would be involved, but I might be mistaken.

@phofl
Copy link
Member

phofl commented May 28, 2020

Yes you join the column key onto the index of the right DataFrame. The index has values [0, 1, 2] while the key column has [0, 1, 1]. I would expected the result [(0, 0), (1, 1), (1, 1), (2, NaN)] with the index as first component of the tuples and the key column as the second. The index value 2 has no match in the key column, so I would expect that the key column is NaN in this row.

@TomAugspurger
Copy link
Contributor

Ah right. Mine is probably mistaken.

@phofl
Copy link
Member

phofl commented May 28, 2020

Ok thanks.

I have a small follow up question about the initial example. I would expect that the index there is the index of the right DataFrame (year_df) because we have a right join. It seems wrong that we get the left index after a right join.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants