Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

str.cat should return categorical data for categorical caller #20845

Open
h-vetinari opened this issue Apr 27, 2018 · 2 comments
Open

str.cat should return categorical data for categorical caller #20845

h-vetinari opened this issue Apr 27, 2018 · 2 comments
Labels
Categorical Categorical Data Type Enhancement Needs Discussion Requires discussion from core team before further action Strings String extension data type and string data

Comments

@h-vetinari
Copy link
Contributor

The str.cat-accessor works for Series and Index, and returns an object of the corresponding type:

s = pd.Series(['a', 'b', 'a'])
t = pd.Index(['a', 'b', 'a'])
## all of the following return the same Series
s.str.cat(s)
s.str.cat(t)
s.str.cat(s.values)
s.str.cat(list(s))
# 0    aa
# 1    bb
# 2    aa
# dtype: object

## all of the following return the same Index
t.str.cat(s)
t.str.cat(t)
t.str.cat(s.values)
t.str.cat(list(s))
# Index(['aa', 'bb', 'aa'], dtype='object')

But the data loses its property of being a category after str.cat, which is inconsistent, IMO

sc = s.astype('category')
tc = pd.Index(['a', 'b', 'a'], dtype='category') # conversion does not work, see #20843
sc.str.cat(s)
# 0    aa
# 1    bb
# 2    aa
# dtype: object
## as opposed to:
sc.str.cat(s).astype('category')
# 0    aa
# 1    bb
# 2    aa
# dtype: category
# Categories (2, object): [aa, bb]
tc.str.cat(s) # crashes, see # 20842

xref #20842 #20843

@WillAyd
Copy link
Member

WillAyd commented Apr 30, 2018

The return type here is part of the documentation (though perhaps could be improved):

https://pandas.pydata.org/pandas-docs/stable/categorical.html#string-and-datetime-accessors

FWIW I don't really see how you could return a Categorical after a concatenation and make guarantees about the returned metadata (ordering comes to mind here). IMO doing concat on a large array of values would in most cases generate a ton of unique values and defeat the purpose of a Categorical in the first place.

@h-vetinari
Copy link
Contributor Author

@WillAyd
Thanks for that reference in the docs (had seen it only in individual doc-strings). However, I don't think it's fair to assume what kind of data would result - I can imagine several cases where this would be sensible. I still find something worth considering, but at least there's an easy solution with .astype('category').

@jbrockmendel jbrockmendel added Strings String extension data type and string data Categorical Categorical Data Type labels Aug 1, 2018
@mroeschke mroeschke added Enhancement Needs Discussion Requires discussion from core team before further action labels May 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Enhancement Needs Discussion Requires discussion from core team before further action Strings String extension data type and string data
Projects
None yet
Development

No branches or pull requests

4 participants