Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: more permissive conversion to StringDtype #33465

Merged
merged 10 commits into from
May 26, 2020

Conversation

topper-123
Copy link
Contributor

@topper-123 topper-123 commented Apr 10, 2020

This is a proposal to make using StringDtype more permissive and be usable inplace of dtype=str.

ATM converting to StringDtype will only accept arrays that are str already, meaning you will often have to use astype(str).astype("string") to be sure not to get errors, which can be tedious. For example these fail in master and work in this PR:

>>> pd.Series([1,2, np.nan], dtype="string")
0       1
1       2
2    <NA>
dtype: string
>>> pd.array([1,2, np.nan], dtype="string")
<StringArray>
['1', '2', <NA>]
Length: 3, dtype: string
>>> pd.Series([1,2, np.nan]).astype("string")
0     1.0
1     2.0
2    <NA>
dtype: string
>>> pd.Series([1,2, np.nan], dtype="Int64").astype("string")
0       1
1       2
2    <NA>
dtype: string

etc. now work. Previously the above all gave errors.

The proposed solution:

ExtensionArray._from_sequence is in master explicitly stated to expect a sequence of scalars of the correct type. This makes it not usable for type conversion.

I've therefore added a new ExtensionArray._from_sequence_of_any_type that accepts scalars of any type and may change the scalars to the correct type before passing them on to _from_sequence. Currently it just routes everything through to ExtensionArray._from_sequence, except in StringArray._from_sequence_of_any_type, where it massages the input scalars into strings before passing them on.

Obviously tests and doc updates are still missing, but I would appreciate feedback if this solution looks ok and I'll add tests and doc updates if this is ok.

@TomAugspurger
Copy link
Contributor

#33254 has a discussion about the appropriate strictness of _from_sequence.

@jreback jreback added NA - MaskedArrays Related to pd.NA and nullable extension arrays Strings String extension data type and string data ExtensionArray Extending pandas with custom dtypes or arrays. and removed NA - MaskedArrays Related to pd.NA and nullable extension arrays labels Apr 10, 2020
@topper-123
Copy link
Contributor Author

topper-123 commented Apr 11, 2020

Ther doctest failures are very strange, it doesn't make sense that they're failing in my PR, but not in other PRs...

From the discussion in #33254 it looks like a dedicated _from_sequence_of_any_type would not be the consensus approach, but maybe push its content to _from_sequence and maybe add a _from_scalars, but it's doesnt seem like the issue is resolved in the other discussion?

any comments on the current PR?

@jorisvandenbossche
Copy link
Member

The implementation itself looks good to me.
Awaiting resolution of the discussion in #33254, I think in principle we can already add this behaviour to _from_sequence, it would at least fix the conversion issues we have right now.

@topper-123
Copy link
Contributor Author

topper-123 commented Apr 12, 2020

I've updated the PR:

  • Moved the new functionality to StringArray._from_sequence
  • added tests to BaseCastingTests.test_astype_string
  • made changes to make the tests pass
  • updated docs.
  • Added a dtype argument to Series.combine.

The last bullet point is because of a situation where you combine two string Series and want the result to be in another extension dtype is very difficult to do in a inferred way. I think it's better and safer in those cases to just supply the desired dtype.

@topper-123 topper-123 force-pushed the astype_string branch 2 times, most recently from 17c7ff2 to 8efc0d0 Compare April 12, 2020 16:00
@@ -320,7 +320,7 @@ def update_dtype(self, dtype):
dtype = pandas_dtype(dtype)

if not isinstance(dtype, cls):
fill_value = astype_nansafe(np.array(self.fill_value), dtype).item()
fill_value = astype_nansafe(np.array([self.fill_value]), dtype)[0]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ExtensionArrays don't have an .item method.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you remember why this was needed? (Sparse currently doesn't support using EAs under the hood, so in principle, we should never get an extension dtype here)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The situation that fails if we do .item is:

dtype is a StringDtype and cls is a SparseDtype. I'm not very familiar with sparse arrays, but it seems reasonable that string arrays can be converted to sparse arrays?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems reasonable that string arrays can be converted to sparse arrays?

In principle, yes, but in practice they don't support EAs as underlying values (the SparseArray.__init__ converts any input to a numpy array). So I think this should raise an informative error when dtype is an extensiondtype and not a numpy dtype.

@topper-123 topper-123 force-pushed the astype_string branch 2 times, most recently from 2a835e4 to eb341bf Compare April 14, 2020 07:07
@@ -172,15 +172,16 @@ def test_combine_le(self, data_repeated):
orig_data1, orig_data2 = data_repeated(2)
s1 = pd.Series(orig_data1)
s2 = pd.Series(orig_data2)
result = s1.combine(s2, lambda x1, x2: x1 <= x2)
result = s1.combine(s2, lambda x1, x2: x1 <= x2, dtype="boolean")
Copy link
Contributor Author

@topper-123 topper-123 Apr 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inferring to a different dtype than the original in a general way is quite difficult, and it's better to be explicit in such cases IMO.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I agree in general (certainly when using combine for constructing the expected result), but is this change required in this specific case? (I would assume nothing changed related to this?) Because I might still be good to test the case without specifying the dtype explicitly, as that is public API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem here is that almost anything can be converted to a string. I've made a change though. It's inferring dtype, so not ideal performance-wise in the case where the series is a StringDtype and the result might not be.

If the user adds dtype manually, the performance is ok.

@@ -89,6 +115,7 @@ Other enhancements
- :meth:`DataFrame.sample` will now also allow array-like and BitGenerator objects to be passed to ``random_state`` as seeds (:issue:`32503`)
- :meth:`MultiIndex.union` will now raise `RuntimeWarning` if the object inside are unsortable, pass `sort=False` to suppress this warning (:issue:`33015`)
- :class:`Series.dt` and :class:`DatatimeIndex` now have an `isocalendar` accessor that returns a :class:`DataFrame` with year, week, and day calculated according to the ISO 8601 calendar (:issue:`33206`).
- :meth:`Series.combine` has gained a ``dtype`` argument. If supplied, the combined series will get that dtype (:issue:`33465`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not averse to this, but can we do as a separate change?

@@ -431,6 +431,11 @@ def astype(self, dtype, copy=True):
array : ndarray
NumPy ndarray with 'dtype' for its dtype.
"""
from pandas.core.arrays.string_ import StringDtype
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we atcually ever hit this base type? I think we override this everywhere

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

floats arrays are PandasArrays, and those get their astype from ExtensionsArray.

>>> pd.array([1.5, 2.5])
<PandasArray>
[1.5, 2.5]
Length: 2, dtype: float64

Allowing any ExtensionArray by default to convert to StringArray seems reasonable to me (and if subclassers don't want that, they can make their own astype implementation disallowing StringArrays).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kk, can you add a commment to this effect here (agree with your statement), but comment for future readers

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a commment to this effect here

All methods in this base class are there for subclasses to (potentially) use, so I don't think a comment about that is needed
(a comment about always being able to astype to string dtype is fine though)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i understand I would move the impl that is currently in Decimal to here as it correctly handles the astype from the same type (whereas this one will coerce to a numpy array)

pandas/tests/extension/base/casting.py Show resolved Hide resolved
expected = pd.Series(
[a <= b for (a, b) in zip(list(orig_data1), list(orig_data2))]
[a <= b for (a, b) in zip(list(orig_data1), list(orig_data2))],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if we don't specify dtype? can you add examples for this as well (its possible we should also raise in some cases)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made a change to infer, i.e. give the same result as before.

pandas/tests/extension/decimal/array.py Show resolved Hide resolved
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this, @topper-123 (and sorry for the slow review). As I said in #33254, I think we can fix this bug regardless of the progress of the bigger discussion in that issue.

@@ -320,7 +320,7 @@ def update_dtype(self, dtype):
dtype = pandas_dtype(dtype)

if not isinstance(dtype, cls):
fill_value = astype_nansafe(np.array(self.fill_value), dtype).item()
fill_value = astype_nansafe(np.array([self.fill_value]), dtype)[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you remember why this was needed? (Sparse currently doesn't support using EAs under the hood, so in principle, we should never get an extension dtype here)

pandas/core/arrays/string_.py Outdated Show resolved Hide resolved
@@ -152,13 +157,13 @@ class StringArray(PandasArray):
['This is', 'some text', <NA>, 'data.']
Length: 4, dtype: string

Unlike ``object`` dtype arrays, ``StringArray`` doesn't allow non-string
values.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is important to keep the original intent of this part, in some way. As it was meant to explain that the StringArray will only contain strings (as opposed to object dtype, which is not strict here). This could maybe also shown with setting a non-string value (arr[0] = 1) or so.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How you changed it still doesn't explain the original intent, as mentioned above, IMO. I would explain both (only strings allowed + showing the automatic conversion)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I've tried it differently.

@@ -2682,6 +2682,11 @@ def combine(self, other, func, fill_value=None) -> "Series":
The value to assume when an index is missing from
one Series or the other. The default specifies to use the
appropriate NaN value for the underlying dtype of the Series.
dtype : str, numpy.dtype, or ExtensionDtype, optional
Data type for the output Series. If not specified, this will be
inferred from the combined data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on adding such a keyword!

pandas/tests/extension/base/casting.py Outdated Show resolved Hide resolved
@@ -172,15 +172,16 @@ def test_combine_le(self, data_repeated):
orig_data1, orig_data2 = data_repeated(2)
s1 = pd.Series(orig_data1)
s2 = pd.Series(orig_data2)
result = s1.combine(s2, lambda x1, x2: x1 <= x2)
result = s1.combine(s2, lambda x1, x2: x1 <= x2, dtype="boolean")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I agree in general (certainly when using combine for constructing the expected result), but is this change required in this specific case? (I would assume nothing changed related to this?) Because I might still be good to test the case without specifying the dtype explicitly, as that is public API.

@topper-123 topper-123 force-pushed the astype_string branch 3 times, most recently from 9bcb6a8 to 3d7b050 Compare May 23, 2020 18:56
@jorisvandenbossche
Copy link
Member

@topper-123 workflow related question: can you please not force-push, but just add new commits (and merge master for updating with master)? That makes it a lot easier to review what has been updated since a previous review (and ensures github's UI to help with this actually works)

@topper-123
Copy link
Contributor Author

Ok, no problem, I’ll stop force-pushing. I thought that made the commits more tidy and easier to read, but if that’s wrong, I’ll make regular commits in the future.

doc/source/whatsnew/v1.1.0.rst Show resolved Hide resolved
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Previously, declaring or converting to :class:`StringDtype` was in general only possible if the data was already only ``str`` or nan-like.
For example:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you use the

Previous

Current

way of formatting (see other notes)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would personally just leave out the "previous". Saying that it didn't work before, and only showing the new feature seems sufficient to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use the prescribed format, its a standard that we have and consistency is key

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's only a standard when there is actually a "Previous" to show. For bugs, we also don't show the error message you got before it was fixed. I personally see the fact that you couldn't do astype("string") to convert to strings as a bug (or not yet implemented feature, given how recent StringDtype is)

All dtypes can now be converted to ``StringDtype``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Previously, declaring or converting to :class:`StringDtype` was in general only possible if the data was already only ``str`` or nan-like.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

list the closed issues here at the end of the first statement

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

@@ -431,6 +431,11 @@ def astype(self, dtype, copy=True):
array : ndarray
NumPy ndarray with 'dtype' for its dtype.
"""
from pandas.core.arrays.string_ import StringDtype
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kk, can you add a commment to this effect here (agree with your statement), but comment for future readers

pandas/core/arrays/string_.py Show resolved Hide resolved
@@ -2768,6 +2768,10 @@ def combine(self, other, func, fill_value=None) -> "Series":
if is_categorical_dtype(self.dtype):
pass
elif is_extension_array_dtype(self.dtype):
# Everything can be be converted to strings, but we may not want to convert
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

realy don't add things here, if you need to modify maybe_cast_to_extesion_array then its ok

this becomes hugely complicated to follow the path otherwise

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I would move this into maybe_cast_to_extension_array (we will be able to remove this once the "strict" from_scalars is implemented, and having the logic in maybe_cast_to_extension_array will make it easier to not forget to update that)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, found a way to avoid changing series.py at all.

pandas/tests/extension/decimal/array.py Show resolved Hide resolved
@topper-123 topper-123 force-pushed the astype_string branch 2 times, most recently from 1b3a8f5 to 239d0cc Compare May 26, 2020 09:24
@topper-123
Copy link
Contributor Author

Updated.


or convert from existing pandas data:

s1 = pd.Series([1,2, np.nan], dtype="Int64")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this won't render (needs an ipython block here as well)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@@ -431,6 +431,11 @@ def astype(self, dtype, copy=True):
array : ndarray
NumPy ndarray with 'dtype' for its dtype.
"""
from pandas.core.arrays.string_ import StringDtype
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i understand I would move the impl that is currently in Decimal to here as it correctly handles the astype from the same type (whereas this one will coerce to a numpy array)

pandas/core/arrays/integer.py Show resolved Hide resolved
pandas/tests/extension/decimal/array.py Show resolved Hide resolved
doc/source/whatsnew/v1.1.0.rst Outdated Show resolved Hide resolved
pandas/core/arrays/integer.py Show resolved Hide resolved
pandas/tests/extension/decimal/array.py Show resolved Hide resolved
@jreback jreback added this to the 1.1 milestone May 26, 2020
@jreback jreback merged commit b6ea970 into pandas-dev:master May 26, 2020
@jreback
Copy link
Contributor

jreback commented May 26, 2020

thanks @topper-123 very nice!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

convert numeric column to dedicated pd.StringDtype()
4 participants