-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: astype fill_value for SparseArray.astype #23547
BUG: astype fill_value for SparseArray.astype #23547
Conversation
I don't think we have a specific issue for this. This fixes strange things like ```python In [1]: import pandas as pd; import numpy as np In [2]: a = pd.SparseArray([0, 1]) In [3]: a.astype(bool) Out[3]: [0, True] Fill: 0 IntIndex Indices: array([1], dtype=int32) ```
Hello @TomAugspurger! Thanks for updating the PR.
Comment last updated on November 07, 2018 at 17:12 Hours UTC |
@@ -614,7 +614,7 @@ def __array__(self, dtype=None, copy=True): | |||
# Can't put pd.NaT in a datetime64[ns] | |||
fill_value = np.datetime64('NaT') | |||
try: | |||
dtype = np.result_type(self.sp_values.dtype, fill_value) | |||
dtype = np.result_type(self.sp_values.dtype, type(fill_value)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was having trouble with string fill values.
pandas/core/arrays/sparse.py
Outdated
dtype).item() | ||
dtype = SparseDtype(dtype, fill_value=fill_value) | ||
|
||
# Typically we'll just astype the sp_values to dtype.subtype, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is kind ugly, but it's backwards compatible, consistent with the rest of pandas, and does what we need.
Basically, unless we want to support actual numpy string dtypes (which we probably don't), then we need a way of differentiating between array.astype(object)
and array.astype(str)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you make this a method on the Dtype itself to avoid cluttering this up here? maybe dtype.astype_type
alternatively . we could actually add .astype_nansafe(value, copy=False)
as a Dtype method (kind of makes sense actually)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was playing around with that earlier (didn't push it though). I called it SparseDtype.astype
. I'll give it another shot and see what it de-duplicates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, can you clarify what you had in mind for astype_nansafe(value, copy=False)
? What would value here be? An array or a scalar?
I'll have a followup PR soon (hopefully today) for ensuring that the dtype of SparseArray.sp_values
is consistent with the type of SparseArray.dtype.fill_value
. I think my SparseDtype.astype
is more useful there. It wouldn't make sense for using here, since we're astyping the actual array of values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right I think adding a method that returns the dtype of the .astype
values on the Dtype iself is what I am looking. The conversion still happens in the Array. Basically the code you added here should be on the Dtype object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to add two methods
SparseDtype.astype
: convert from a SparseDtype to a new dtype, taking care to astypeself.fill_value
if needed.SparseDtype._subtype_with_str
to hold the logic for determining what the "real" subtype is, if we actually want str.
SparseDtype.astype
seems reasonably useful to users, so I made it public.
Codecov Report
@@ Coverage Diff @@
## master #23547 +/- ##
==========================================
+ Coverage 92.23% 92.23% +<.01%
==========================================
Files 161 161
Lines 51324 51334 +10
==========================================
+ Hits 47339 47349 +10
Misses 3985 3985
Continue to review full report at Codecov.
|
pandas/core/arrays/sparse.py
Outdated
@@ -284,6 +284,83 @@ def is_dtype(cls, dtype): | |||
return True | |||
return isinstance(dtype, np.dtype) or dtype == 'Sparse' | |||
|
|||
def astype(self, dtype): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you make a method on the base Dtype class as well which just returns .dtype
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as this will make it an offical part of the interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any other types that this would be useful for? IMO it's not important enough to add to the interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
anything with a subtype? so Categorical and Interval?
return dtype | ||
|
||
@property | ||
def _subtype_with_str(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is only for Sparse which is ok
We wouldn't be able to share with IntervalDtype, since the astype there
needs to look at the
actual values to know whether or not it's going to succeed.
I guess we have something similar on `CategoricalDtype` called
`CategoricalDtype.update_dtype`.
Not the best name in the world, but I suppose we should match it (we won't
be able to share code
though).
…On Sun, Nov 11, 2018 at 1:24 PM Jeff Reback ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In pandas/core/arrays/sparse.py
<#23547 (comment)>:
> @@ -284,6 +284,83 @@ def is_dtype(cls, dtype):
return True
return isinstance(dtype, np.dtype) or dtype == 'Sparse'
+ def astype(self, dtype):
anything with a subtype? so Categorical and Interval?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#23547 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIve4UpFry1GK6cM6TErGcZfkwJFjks5uuHmAgaJpZM4YS1QX>
.
|
ok that sounds good |
* upstream/master: BUG: Don't over-optimize memory with jagged CSV (pandas-dev#23527) DEPR: Deprecate usecols as int in read_excel (pandas-dev#23635) More helpful Stata string length error. (pandas-dev#23629) BUG: astype fill_value for SparseArray.astype (pandas-dev#23547) CLN: datetimelike arrays: isort, small reorg (pandas-dev#23587) CI: Check in the CI that assert_raises_regex is not being used (pandas-dev#23627) CLN:Remove unused **kwargs from user facing methods (pandas-dev#23249) DOC: Enhancing pivot / reshape docs (pandas-dev#21038) TST: Fix xfailing DataFrame arithmetic tests by transposing (pandas-dev#23620)
…fixed * upstream/master: DOC: avoid SparseArray.take error (pandas-dev#23637) CLN: remove incorrect usages of com.AbstractMethodError (pandas-dev#23625) DOC: Adding validation of the section order in docstrings (pandas-dev#23607) BUG: Don't over-optimize memory with jagged CSV (pandas-dev#23527) DEPR: Deprecate usecols as int in read_excel (pandas-dev#23635) More helpful Stata string length error. (pandas-dev#23629) BUG: astype fill_value for SparseArray.astype (pandas-dev#23547) CLN: datetimelike arrays: isort, small reorg (pandas-dev#23587) CI: Check in the CI that assert_raises_regex is not being used (pandas-dev#23627) CLN:Remove unused **kwargs from user facing methods (pandas-dev#23249)
I don't think we have a specific issue for this. This is not a fix / change for #23125
This fixes strange things like
restoring the behavior of 0.23.x