Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

join type changes silently when left-joining with a list of dataframes #19607

Closed
elrubio opened this issue Feb 8, 2018 · 2 comments · Fixed by #19624
Closed

join type changes silently when left-joining with a list of dataframes #19607

elrubio opened this issue Feb 8, 2018 · 2 comments · Fixed by #19624
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@elrubio
Copy link
Contributor

elrubio commented Feb 8, 2018

Code sample, copy-pastable

import pandas as pd

# Note that index of d2 is _not_ unique
d1 = pd.DataFrame({'k': ['K0', 'K1', 'K2'], 'v': [1, 2, 3]}).set_index('k')
d2 = pd.DataFrame({'k': ['K0', 'K0', 'K3'], 'v': [4, 5, 6]}).set_index('k')

print('DataFrame d1:', d1, sep="\n")
print('DataFrame d2:', d2, sep="\n")

print("d1.join(d2, how='left') looks fine:")
d = d1.join(d2, how='left', lsuffix='_left', rsuffix='_right')
print(d)

print("But with d1.join([d2], how='left', ...), K3 shows up in index but shouldn't:")
d = d1.join([d2], how='left', lsuffix='_left', rsuffix='_right')
print(d)

print("pd.merge(d1, d2, how='left') works as expected:")
d = pd.merge(d1, d2, how='left', left_index=True, right_index=True)
print(d)

print("d1.join([d3], ...) also works as expected when indices are unique:")
d3 = pd.DataFrame({'k': ['K0', 'K4', 'K3'], 'v': [4, 5, 6]}).set_index('k')
d = d1.join(d3, how='left', lsuffix='_left', rsuffix='_right')
print(d)

# Output:
#
# DataFrame d1:
#     v
# k    
# K0  1
# K1  2
# K2  3
#
# DataFrame d2:
#     v
# k    
# K0  4
# K0  5
# K3  6
#
# d1.join(d2, how='left') looks fine:
#     v_left  v_right
# k                  
# K0       1      4.0
# K0       1      5.0
# K1       2      NaN
# K2       3      NaN
#
# But with d1.join([df], how='left', ...), K3 shows up in index but shouldn't:
#     v_x  v_y
# k           
# K0  1.0  4.0
# K0  1.0  5.0
# K1  2.0  NaN
# K2  3.0  NaN
# K3  NaN  6.0        <----- UNEXPECTED
#
# pd.merge(d1, d2, how='left') works as expected:
#     v_x  v_y
# k           
# K0    1  4.0
# K0    1  5.0
# K1    2  NaN
# K2    3  NaN
#
# d1.join([d3], ...) also works as expected when indices are unique:
#     v_left  v_right
# k                  
# K0       1      4.0
# K1       2      NaN
# K2       3      NaN

Problem description

df.join() silently changes a 'left' join to an 'outer' join when these conditions are met:

  • join() is called with a sequence of dataframes, e.g. d1.join([d2], how='left', ...)
  • some dataframe has a non-unique index

Note that the equivalent call to merge(d1, d2, how='left', ...) works just fine.

When looking at the join() code, I see concat() being discarded when dataframe indices are not unique and merge() being called instead. This would work perfectly if not how='left' had just been changed to how='outer' a couple of lines above.

Expected Output

d1.join(d2, how='left', ...) should give the same result as pd.merge(d1, d2, how='left', ...). The join type should not be changed.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.3.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 44 Stepping 2, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.21.0
pytest: None
pip: 9.0.1
setuptools: 36.5.0.post20170921
Cython: None
numpy: 1.13.3
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.1.0
openpyxl: 2.4.8
xlrd: 1.1.0
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@jreback
Copy link
Contributor

jreback commented Feb 9, 2018

I think this is the same issue as #17257. maybe also related to #16304. Welcome to have a look / PR to fix. These should be the same.

In [13]: pd.merge(d1, d2, left_index=True, right_index=True, how='left')
Out[13]: 
    v_x  v_y
k           
K0    1  4.0
K0    1  5.0
K1    2  NaN
K2    3  NaN

In [14]: d1.join([d2], how='left')
Out[14]: 
    v_x  v_y
k           
K0  1.0  4.0
K0  1.0  5.0
K1  2.0  NaN
K2  3.0  NaN
K3  NaN  6.0

@jreback jreback added Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode Difficulty Intermediate labels Feb 9, 2018
@jreback jreback added this to the Next Major Release milestone Feb 9, 2018
@elrubio
Copy link
Contributor Author

elrubio commented Feb 9, 2018

Well, I had a look at #17257 before reporting this. IMHO, they are not related: there, it's the merge() method itself that shows unexpected behaviour, while here, some helper method of join() simply calls merge() with the wrong argument how='outer'. If merge() had been called correctly using how='left', everything would be fine.

Seems easy to fix, but might break existing code. I will look into how to fix it..

elrubio added a commit to elrubio/pandas that referenced this issue Feb 9, 2018
Prevents changing the join type from 'left' to 'outer' when merge() is used. Fixes pandas-dev#19607.
elrubio added a commit to elrubio/pandas that referenced this issue Feb 9, 2018
elrubio added a commit to elrubio/pandas that referenced this issue Feb 9, 2018
@jreback jreback modified the milestones: Next Major Release, 0.23.0 Feb 10, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants