Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

str.cat not working with binary data on Python3 #22721

Closed
h-vetinari opened this issue Sep 15, 2018 · 5 comments · Fixed by #22725
Closed

str.cat not working with binary data on Python3 #22721

h-vetinari opened this issue Sep 15, 2018 · 5 comments · Fixed by #22725
Labels
Needs Discussion Requires discussion from core team before further action
Milestone

Comments

@h-vetinari
Copy link
Contributor

h-vetinari commented Sep 15, 2018

In Python 2, the following works:

s = pd.Series(np.array(list('abc'), 'S1').astype(object))
t = pd.Series(np.array(list('def'), 'S1').astype(object))
s.str.cat(t)
# 0    b'ad'
# 1    b'be'
# 2    b'cf'
# dtype: object

whereas in Python 3, s.str.cat(t) yields:

TypeError: sequence item 0: expected str instance, bytes found

Clearly, Python 3 fundamentally changed the separation between bytes and strings, but in principle, there is no reason (other than implementation) why this should not have the same result in Python 3 as in Python 2.

@WillAyd
Copy link
Member

WillAyd commented Sep 15, 2018

Not sure I agree here at the .str accessor is for handling strings not bytes. Let's see what others think though

@WillAyd WillAyd added the Needs Discussion Requires discussion from core team before further action label Sep 15, 2018
@jreback
Copy link
Contributor

jreback commented Sep 15, 2018

yeah this is correct
.str is strictly for strings and not bytes

@h-vetinari
Copy link
Contributor Author

h-vetinari commented Sep 16, 2018

@WillAyd @jreback

The example is out of pandas/tests/test_strings.test_method_on_bytes - it works on Python 2. I see the point that Python 2 didn't have the right separation between bytes/str and I'm fine with not having this functionality. I was just working on this functionality by refactoring the str.cat internals (#22725), and found that this now works in Python 3 as well (so I opened this issue before the PR).

Beyond that, it's IMO the .str-accessor that should be enforcing the correct types, but since there's no dedicated string-dtype yet, the methods that do work for other object data (e.g. lists) are also used like that (e.g. people use .str.len() to get the different length of a Series of lists).

I guess the better place to have this discussion is #22725, because the implementation would have to be adapted to restore the current test behaviour.

@jreback jreback added this to the 0.24.0 milestone Oct 5, 2018
@h-vetinari
Copy link
Contributor Author

@jreback
I haven't edited the OP of #22725 yet, but I'll open another issue to supersede this based on your review that the .str accessor should not work for bytes.

In other words, this does not need the 0.24 milestone, IMO

@h-vetinari
Copy link
Contributor Author

Superseded by #23011

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants