-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: more permissive conversion to StringDtype #33465
Conversation
#33254 has a discussion about the appropriate strictness of |
ea91afb
to
ee184f6
Compare
ee184f6
to
3f37b7d
Compare
Ther doctest failures are very strange, it doesn't make sense that they're failing in my PR, but not in other PRs... From the discussion in #33254 it looks like a dedicated any comments on the current PR? |
The implementation itself looks good to me. |
b3c81cb
to
92f15b8
Compare
I've updated the PR:
The last bullet point is because of a situation where you combine two |
17c7ff2
to
8efc0d0
Compare
pandas/core/arrays/sparse/dtype.py
Outdated
@@ -320,7 +320,7 @@ def update_dtype(self, dtype): | |||
dtype = pandas_dtype(dtype) | |||
|
|||
if not isinstance(dtype, cls): | |||
fill_value = astype_nansafe(np.array(self.fill_value), dtype).item() | |||
fill_value = astype_nansafe(np.array([self.fill_value]), dtype)[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ExtensionArrays don't have an .item
method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you remember why this was needed? (Sparse currently doesn't support using EAs under the hood, so in principle, we should never get an extension dtype here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The situation that fails if we do .item is:
dtype is a StringDtype and cls is a SparseDtype. I'm not very familiar with sparse arrays, but it seems reasonable that string arrays can be converted to sparse arrays?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems reasonable that string arrays can be converted to sparse arrays?
In principle, yes, but in practice they don't support EAs as underlying values (the SparseArray.__init__
converts any input to a numpy array). So I think this should raise an informative error when dtype
is an extensiondtype and not a numpy dtype.
2a835e4
to
eb341bf
Compare
@@ -172,15 +172,16 @@ def test_combine_le(self, data_repeated): | |||
orig_data1, orig_data2 = data_repeated(2) | |||
s1 = pd.Series(orig_data1) | |||
s2 = pd.Series(orig_data2) | |||
result = s1.combine(s2, lambda x1, x2: x1 <= x2) | |||
result = s1.combine(s2, lambda x1, x2: x1 <= x2, dtype="boolean") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inferring to a different dtype than the original in a general way is quite difficult, and it's better to be explicit in such cases IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although I agree in general (certainly when using combine
for constructing the expected result), but is this change required in this specific case? (I would assume nothing changed related to this?) Because I might still be good to test the case without specifying the dtype explicitly, as that is public API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem here is that almost anything can be converted to a string. I've made a change though. It's inferring dtype, so not ideal performance-wise in the case where the series is a StringDtype and the result might not be.
If the user adds dtype manually, the performance is ok.
doc/source/whatsnew/v1.1.0.rst
Outdated
@@ -89,6 +115,7 @@ Other enhancements | |||
- :meth:`DataFrame.sample` will now also allow array-like and BitGenerator objects to be passed to ``random_state`` as seeds (:issue:`32503`) | |||
- :meth:`MultiIndex.union` will now raise `RuntimeWarning` if the object inside are unsortable, pass `sort=False` to suppress this warning (:issue:`33015`) | |||
- :class:`Series.dt` and :class:`DatatimeIndex` now have an `isocalendar` accessor that returns a :class:`DataFrame` with year, week, and day calculated according to the ISO 8601 calendar (:issue:`33206`). | |||
- :meth:`Series.combine` has gained a ``dtype`` argument. If supplied, the combined series will get that dtype (:issue:`33465`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not averse to this, but can we do as a separate change?
@@ -431,6 +431,11 @@ def astype(self, dtype, copy=True): | |||
array : ndarray | |||
NumPy ndarray with 'dtype' for its dtype. | |||
""" | |||
from pandas.core.arrays.string_ import StringDtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so we atcually ever hit this base type? I think we override this everywhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
floats arrays are PandasArrays, and those get their astype from ExtensionsArray.
>>> pd.array([1.5, 2.5])
<PandasArray>
[1.5, 2.5]
Length: 2, dtype: float64
Allowing any ExtensionArray by default to convert to StringArray seems reasonable to me (and if subclassers don't want that, they can make their own astype implementation disallowing StringArrays).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kk, can you add a commment to this effect here (agree with your statement), but comment for future readers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a commment to this effect here
All methods in this base class are there for subclasses to (potentially) use, so I don't think a comment about that is needed
(a comment about always being able to astype to string dtype is fine though)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i understand I would move the impl that is currently in Decimal to here as it correctly handles the astype from the same type (whereas this one will coerce to a numpy array)
expected = pd.Series( | ||
[a <= b for (a, b) in zip(list(orig_data1), list(orig_data2))] | ||
[a <= b for (a, b) in zip(list(orig_data1), list(orig_data2))], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens if we don't specify dtype? can you add examples for this as well (its possible we should also raise in some cases)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made a change to infer, i.e. give the same result as before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this, @topper-123 (and sorry for the slow review). As I said in #33254, I think we can fix this bug regardless of the progress of the bigger discussion in that issue.
pandas/core/arrays/sparse/dtype.py
Outdated
@@ -320,7 +320,7 @@ def update_dtype(self, dtype): | |||
dtype = pandas_dtype(dtype) | |||
|
|||
if not isinstance(dtype, cls): | |||
fill_value = astype_nansafe(np.array(self.fill_value), dtype).item() | |||
fill_value = astype_nansafe(np.array([self.fill_value]), dtype)[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you remember why this was needed? (Sparse currently doesn't support using EAs under the hood, so in principle, we should never get an extension dtype here)
@@ -152,13 +157,13 @@ class StringArray(PandasArray): | |||
['This is', 'some text', <NA>, 'data.'] | |||
Length: 4, dtype: string | |||
|
|||
Unlike ``object`` dtype arrays, ``StringArray`` doesn't allow non-string | |||
values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is important to keep the original intent of this part, in some way. As it was meant to explain that the StringArray will only contain strings (as opposed to object dtype, which is not strict here). This could maybe also shown with setting a non-string value (arr[0] = 1
) or so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How you changed it still doesn't explain the original intent, as mentioned above, IMO. I would explain both (only strings allowed + showing the automatic conversion)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I've tried it differently.
pandas/core/series.py
Outdated
@@ -2682,6 +2682,11 @@ def combine(self, other, func, fill_value=None) -> "Series": | |||
The value to assume when an index is missing from | |||
one Series or the other. The default specifies to use the | |||
appropriate NaN value for the underlying dtype of the Series. | |||
dtype : str, numpy.dtype, or ExtensionDtype, optional | |||
Data type for the output Series. If not specified, this will be | |||
inferred from the combined data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on adding such a keyword!
@@ -172,15 +172,16 @@ def test_combine_le(self, data_repeated): | |||
orig_data1, orig_data2 = data_repeated(2) | |||
s1 = pd.Series(orig_data1) | |||
s2 = pd.Series(orig_data2) | |||
result = s1.combine(s2, lambda x1, x2: x1 <= x2) | |||
result = s1.combine(s2, lambda x1, x2: x1 <= x2, dtype="boolean") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although I agree in general (certainly when using combine
for constructing the expected result), but is this change required in this specific case? (I would assume nothing changed related to this?) Because I might still be good to test the case without specifying the dtype explicitly, as that is public API.
9bcb6a8
to
3d7b050
Compare
@topper-123 workflow related question: can you please not force-push, but just add new commits (and merge master for updating with master)? That makes it a lot easier to review what has been updated since a previous review (and ensures github's UI to help with this actually works) |
Ok, no problem, I’ll stop force-pushing. I thought that made the commits more tidy and easier to read, but if that’s wrong, I’ll make regular commits in the future. |
doc/source/whatsnew/v1.1.0.rst
Outdated
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
Previously, declaring or converting to :class:`StringDtype` was in general only possible if the data was already only ``str`` or nan-like. | ||
For example: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you use the
Previous
Current
way of formatting (see other notes)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would personally just leave out the "previous". Saying that it didn't work before, and only showing the new feature seems sufficient to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use the prescribed format, its a standard that we have and consistency is key
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's only a standard when there is actually a "Previous" to show. For bugs, we also don't show the error message you got before it was fixed. I personally see the fact that you couldn't do astype("string")
to convert to strings as a bug (or not yet implemented feature, given how recent StringDtype is)
doc/source/whatsnew/v1.1.0.rst
Outdated
All dtypes can now be converted to ``StringDtype`` | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
Previously, declaring or converting to :class:`StringDtype` was in general only possible if the data was already only ``str`` or nan-like. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
list the closed issues here at the end of the first statement
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok.
@@ -431,6 +431,11 @@ def astype(self, dtype, copy=True): | |||
array : ndarray | |||
NumPy ndarray with 'dtype' for its dtype. | |||
""" | |||
from pandas.core.arrays.string_ import StringDtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kk, can you add a commment to this effect here (agree with your statement), but comment for future readers
pandas/core/series.py
Outdated
@@ -2768,6 +2768,10 @@ def combine(self, other, func, fill_value=None) -> "Series": | |||
if is_categorical_dtype(self.dtype): | |||
pass | |||
elif is_extension_array_dtype(self.dtype): | |||
# Everything can be be converted to strings, but we may not want to convert |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
realy don't add things here, if you need to modify maybe_cast_to_extesion_array then its ok
this becomes hugely complicated to follow the path otherwise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I would move this into maybe_cast_to_extension_array
(we will be able to remove this once the "strict" from_scalars
is implemented, and having the logic in maybe_cast_to_extension_array
will make it easier to not forget to update that)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, found a way to avoid changing series.py at all.
1b3a8f5
to
239d0cc
Compare
Updated. |
doc/source/user_guide/text.rst
Outdated
|
||
or convert from existing pandas data: | ||
|
||
s1 = pd.Series([1,2, np.nan], dtype="Int64") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this won't render (needs an ipython block here as well)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
@@ -431,6 +431,11 @@ def astype(self, dtype, copy=True): | |||
array : ndarray | |||
NumPy ndarray with 'dtype' for its dtype. | |||
""" | |||
from pandas.core.arrays.string_ import StringDtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i understand I would move the impl that is currently in Decimal to here as it correctly handles the astype from the same type (whereas this one will coerce to a numpy array)
thanks @topper-123 very nice! |
pd.StringDtype()
#31204black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
This is a proposal to make using
StringDtype
more permissive and be usable inplace ofdtype=str
.ATM converting to StringDtype will only accept arrays that are
str
already, meaning you will often have to useastype(str).astype("string")
to be sure not to get errors, which can be tedious. For example these fail in master and work in this PR:etc. now work. Previously the above all gave errors.
The proposed solution:
ExtensionArray._from_sequence
is in master explicitly stated to expect a sequence of scalars of the correct type. This makes it not usable for type conversion.I've therefore added a new
ExtensionArray._from_sequence_of_any_type
that accepts scalars of any type and may change the scalars to the correct type before passing them on to_from_sequence
. Currently it just routes everything through toExtensionArray._from_sequence
, except inStringArray._from_sequence_of_any_type
, where it massages the input scalars into strings before passing them on.Obviously tests and doc updates are still missing, but I would appreciate feedback if this solution looks ok and I'll add tests and doc updates if this is ok.