-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unicode : change df.to_string() and friends to always return unicode objects #2224
Conversation
This seems pretty reasonable. Should I take a chance merging this for 0.9.1? I've encountered the bug you fixed here before |
I would at least wait a few days before merging this (perhaps @jseabold or someone else would like |
I guess the question is what code will break because the string is coming back as unicode. Obviously if you had |
it depends whether you consider this a bug fix or a breaking change. I'm fine with 0.10 though. |
Let wait 'til 0.10. Let's merge it into master as soon as the release is out though. |
Agreed... |
This would be great. As of right now, you have to do something dirty (at least that's the only way I found it works) like |
I took this a step further, Realizing that the unicode issue really matters only So:
Yell if something broke. |
@aldanor I see you deleted your comment but I checked that your example works now, at least on my environment... |
@wesm Thanks, sounds good. I just didn't want to confuse everyone cause I wasn't sure this wasn't something specific to my environment. I will try and test it again soon as I can. |
…e force_unicode #2225 using pprint_thing will try to decode using utf-8 as a fallback, but by these functions will now return unicode() rather then str() objects.
…ter, Index.format, etc'
…g strings) we need to keep everything unicode at the bottom levels, so that we can combine strings with other unicode strings at the I/O choke-points, otherwise python tries to coerce bytestring into unicode using 'ascii' encoding, and we get UnicodeDecodeError DOC: add note about formatters needing to return unicode )if returning strings) we need to keep everything unicode at the bottom levels, so that we can combine strings with other unicode strings at the I/O choke-points, otherwise python tries to coerce bytestring into unicode using 'ascii' encoding, and we get UnicodeDecodeError
…f/series containing unicode
…ries,df,panel - If you put in proper unicode data, you're good. - If you put in utf-8 bytestrings you should still be good (it works if rendering is wrapped by pprint_thing, I may have missed a few spots). - If you put in non utf-8 bytestrings, with the encoding unknown, and expect unicode(x) or str(x) to do the right thing - you're doing it wrong.
Added str/unicode/bytes support for |
takeback |
closes #2225
Note: Although all the tests pass with minor fixes, this PR has an above-average chance of
breaking things for people who have relied on broken behaviour thus far.
df.tidy_repr
combines several strings to produce a result. when one component is unicodeand other other is a non-ascii bytestring, it tries to convert the latter back to a unicode string
using the 'ascii' codec and fails.
I suggest that
_get_repr
->to_string
should always return unicode, as implemented by this PR,and that the
force_unicode
argument be deprecated everyhwere.The
force_unicode
argument into_string
conflates two things:The first is now no longer necessary since
pprint_thing
already resorts to the same hackof using utf-8 (with errors='replace') as a fallback.
I believe making the latter optional is wrong, precisely because it brings about situations
like the test case above.
to_string
, like all internal functions , should utilize unicode objects, whenever feasible.