-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
convert numeric column to dedicated pd.StringDtype()
#31204
Comments
Overlaps with #22384, which is trying to solve this problem in general. In the meantime, we can add support for this in IntegerArray.astype. |
@TomAugspurger Was skimming issues and saw this one. I'm wondering if pd.Series([str(x) if not pd.isna(x) else pd.NA for x in s], dtype="string") So since we know that the underlying objects in the EA have to support If you agree, I can look into doing a PR |
I don't think we should preempt the array from having a chance to perform
the conversion.
And code-wise, I don't think we'd want a special case for this in
NDFrame.astype.
…On Fri, Jan 24, 2020 at 10:19 AM Irv Lustig ***@***.***> wrote:
@TomAugspurger <https://github.com/TomAugspurger> Was skimming issues and
saw this one.
I'm wondering if Series.astype('string') should be treated as a special
case independent of the underlying dtype of the underlying Series. That's
because the following code should always work assuming s is a Series:
pd.Series([str(x) if not pd.isna(x) else pd.NA for x in s], dtype="string")
So since we know that the underlying objects in the EA have to support
str(), there is a straightforward way of doing that conversion.
If you agree, I can look into doing a PR
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#31204?email_source=notifications&email_token=AAKAOIWWCMHG5VDJ2ALQ6KTQ7MIHNA5CNFSM4KKFMCQ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJ3JVCQ#issuecomment-578198154>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKAOIWKL7DYNRJPOCG5XFLQ7MIHNANCNFSM4KKFMCQQ>
.
|
Well, I'm suggesting that we do want a special case just for
|
Right, that's definitely desirable. But I don't think NDFrame.astype is the place for the fix. |
For example, consider SparseArray. It implements astype such that the result is also a SparseArray and preserves the sparsity. So |
I guess this depends on the semantics of If I have an EA of type "current_dtype" and I write
I think you are saying that the design we have supports (1), and I'm suggesting a design corresponding to (2). Now, the reason that I prefer (2) is that when I construct a Another possible design would be to have a property of EA's called something like |
We have another issue for an astype dispatch mechanism. |
I want to chime in just to give another use-case from the duplicated issue #31839 . In addition to conversion to s = pd.Series(['0', pd.NA], dtype='string')
s.astype('object').replace(pd.NA, np.nan).astype('float64').astype('Int8') It should be possible to do simply |
Your solution can give rounding errors when dealing with large integers. (I've been bitten by this when importing production data. The batch numbers were too large to fit exactly in a float64) s = pd.Series(["0", pd.NA, str(2 ** 60 + 2)], dtype="string")
s.to_frame().assign(
a=s.astype("object")
.replace(pd.NA, np.nan)
.astype("float64")
.astype("Int64"),
b=s.apply(lambda x: int(x) if pd.notnull(x) else x).astype("Int64"),
)
This explicitly loops over the column, so is not ideal performance wise |
@vadella, thanks for the code example. I had the same problem too and I currently side-step it by loading the data directly in string format. Another reason why it is important that this conversion is handled by pandas internally. |
Code Sample, a copy-pastable example if possible
raises
TypeError: data type not understood
while
raises
ValueError: StringArray requires a sequence of strings or missing values.
If you first do
astype(str)
:and
work as expected:
While
astype(object)
raises in both casesValueError: StringArray requires a sequence of strings or missing values.
Problem description
I can understand the
ValueError
, since you don't feed strings to theStringArray
. Best for me would be if theastype("string")
converts it to strings, or if theastype(str)
would return aStringArray
, but in any case, I would expect bothpd.Series(range(5, 10), dtype="Int64").astype("string")
andpd.Series(range(5, 10)).astype("string")
to raise the same error.Expected Output
or
ValueError: StringArray requires a sequence of strings or missing values.
Output of
pd.show_versions()
[paste the output of
pd.show_versions()
here below this line]The text was updated successfully, but these errors were encountered: