Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

str.cat does not align on index? #18657

Closed
h-vetinari opened this issue Dec 6, 2017 · 5 comments
Closed

str.cat does not align on index? #18657

h-vetinari opened this issue Dec 6, 2017 · 5 comments
Labels
API Design Strings String extension data type and string data
Milestone

Comments

@h-vetinari
Copy link
Contributor

h-vetinari commented Dec 6, 2017

The implicit index-matching of pandas for operations between different DataFrame/Series is great and most of the times, it just works. It does so consistently enough, that the expectation (for me) is that different Series will be aligned before an operation is performed.

For some reason, str.cat does not seem to do so.

import pandas as pd # 0.21.0
import numpy as np # 1.13.3
col = pd.Series(['a','b','c','d','e','f','g','h','i','j'])

# choose random subsets
ss1 = [8, 1, 2, 0, 6] # list(col.sample(5).index) 
ss2 = [4, 0, 9, 2, 6] # list(col.sample(5).index)

# perform str.cat
col.loc[ss1].str.cat(col.loc[ss2], sep = '').sort_index()
# 0    ac <-- UNMATCHED!
# 1    ba <-- UNMATCHED!
# 2    cj <-- UNMATCHED!
# 6    gg <-- correct by sheer luck
# 8    ie <-- UNMATCHED!

# compared for example with Boolean operations on unmatched series
# (matching indices and returning Series with union of both indices),
# this is inconsistent!
b = col.loc[ss1].astype(bool) & col.loc[ss2].astype(bool)
b
# 0     True
# 1    False
# 2     True
# 4    False
# 6     True
# 8    False
# 9    False

# if we manually align the Series
# (easy here by masking from the Series we just subsampled, hard in practice),
# then the NaNs are handled as expected:
m = col.where(np.isin(col.index, ss1)).str.cat(col.where(np.isin(col.index, ss2)), sep = '')
m
# 0     aa
# 1    NaN
# 2     cc
# 3    NaN
# 4    NaN
# 5    NaN
# 6     gg
# 7    NaN
# 8    NaN
# 9    NaN

# based on the normal pandas-behaviour for unmatched Series
# (for example as for Boolean "and" above), the following would be
# the expected result of col.loc[ss1].str.cat(col.loc[ss2], sep = '').sort_index() !
m.loc[b.index]
# 0     aa <-- MATCHED!
# 1    NaN
# 2     cc <-- MATCHED!
# 4    NaN
# 6     gg <-- MATCHED!
# 8    NaN
# 9    NaN
@TomAugspurger TomAugspurger added API Design Strings String extension data type and string data labels Dec 6, 2017
@TomAugspurger TomAugspurger added this to the Next Major Release milestone Dec 6, 2017
@TomAugspurger
Copy link
Contributor

Yes, this should probably align when others is a Series. It'll require some care to get right, since it's an API breaking change.

I'd favor a keyword like align for controlling it. For now it'll be None by default and if others is a series it will not align, but warn that the default will change in the future to align.

@TomAugspurger
Copy link
Contributor

Do we have any other string operations that take a Series / DataFrame and should be handled similarly?

@h-vetinari
Copy link
Contributor Author

Cool! I think your proposal for dealing with the API-breaking sounds good, except maybe calling it align = False by default first, and changing to True at some point in the future (not unlike the expand-keyword that .str.extract gained in 0.18)

Looking at https://pandas-docs.github.io/pandas-docs-travis/text.html#method-summary,
it seems that .str.cat is the only method that deals with two different Series. Maybe that's why the alignment for it fell through the cracks?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Dec 6, 2017 via email

@h-vetinari
Copy link
Contributor Author

@TomAugspurger @jreback
This can now be closed.

@jreback jreback modified the milestones: Next Major Release, 0.23.0 May 3, 2018
@jreback jreback closed this as completed May 3, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Strings String extension data type and string data
Projects
None yet
Development

No branches or pull requests

3 participants