API: more permissive conversion to StringDtype #33465

topper-123 · 2020-04-10T19:50:25Z

closes convert numeric column to dedicated pd.StringDtype() #31204
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

This is a proposal to make using StringDtype more permissive and be usable inplace of dtype=str.

ATM converting to StringDtype will only accept arrays that are str already, meaning you will often have to use astype(str).astype("string") to be sure not to get errors, which can be tedious. For example these fail in master and work in this PR:

>>> pd.Series([1,2, np.nan], dtype="string")
0       1
1       2
2    <NA>
dtype: string
>>> pd.array([1,2, np.nan], dtype="string")
<StringArray>
['1', '2', <NA>]
Length: 3, dtype: string
>>> pd.Series([1,2, np.nan]).astype("string")
0     1.0
1     2.0
2    <NA>
dtype: string
>>> pd.Series([1,2, np.nan], dtype="Int64").astype("string")
0       1
1       2
2    <NA>
dtype: string

etc. now work. Previously the above all gave errors.

The proposed solution:

ExtensionArray._from_sequence is in master explicitly stated to expect a sequence of scalars of the correct type. This makes it not usable for type conversion.

I've therefore added a new ExtensionArray._from_sequence_of_any_type that accepts scalars of any type and may change the scalars to the correct type before passing them on to _from_sequence. Currently it just routes everything through to ExtensionArray._from_sequence, except in StringArray._from_sequence_of_any_type, where it massages the input scalars into strings before passing them on.

Obviously tests and doc updates are still missing, but I would appreciate feedback if this solution looks ok and I'll add tests and doc updates if this is ok.

TomAugspurger · 2020-04-10T20:03:34Z

#33254 has a discussion about the appropriate strictness of _from_sequence.

topper-123 · 2020-04-11T08:39:42Z

Ther doctest failures are very strange, it doesn't make sense that they're failing in my PR, but not in other PRs...

From the discussion in #33254 it looks like a dedicated _from_sequence_of_any_type would not be the consensus approach, but maybe push its content to _from_sequence and maybe add a _from_scalars, but it's doesnt seem like the issue is resolved in the other discussion?

any comments on the current PR?

jorisvandenbossche · 2020-04-11T11:31:54Z

The implementation itself looks good to me.
Awaiting resolution of the discussion in #33254, I think in principle we can already add this behaviour to _from_sequence, it would at least fix the conversion issues we have right now.

topper-123 · 2020-04-12T15:52:10Z

I've updated the PR:

Moved the new functionality to StringArray._from_sequence
added tests to BaseCastingTests.test_astype_string
made changes to make the tests pass
updated docs.
Added a dtype argument to Series.combine.

The last bullet point is because of a situation where you combine two string Series and want the result to be in another extension dtype is very difficult to do in a inferred way. I think it's better and safer in those cases to just supply the desired dtype.

topper-123 · 2020-04-12T16:01:43Z

pandas/core/arrays/sparse/dtype.py

@@ -320,7 +320,7 @@ def update_dtype(self, dtype):
        dtype = pandas_dtype(dtype)

        if not isinstance(dtype, cls):
-            fill_value = astype_nansafe(np.array(self.fill_value), dtype).item()
+            fill_value = astype_nansafe(np.array([self.fill_value]), dtype)[0]


ExtensionArrays don't have an .item method.

Do you remember why this was needed? (Sparse currently doesn't support using EAs under the hood, so in principle, we should never get an extension dtype here)

The situation that fails if we do .item is:

dtype is a StringDtype and cls is a SparseDtype. I'm not very familiar with sparse arrays, but it seems reasonable that string arrays can be converted to sparse arrays?

it seems reasonable that string arrays can be converted to sparse arrays?

In principle, yes, but in practice they don't support EAs as underlying values (the SparseArray.__init__ converts any input to a numpy array). So I think this should raise an informative error when dtype is an extensiondtype and not a numpy dtype.

topper-123 · 2020-04-14T08:46:48Z

pandas/tests/extension/base/methods.py

@@ -172,15 +172,16 @@ def test_combine_le(self, data_repeated):
        orig_data1, orig_data2 = data_repeated(2)
        s1 = pd.Series(orig_data1)
        s2 = pd.Series(orig_data2)
-        result = s1.combine(s2, lambda x1, x2: x1 <= x2)
+        result = s1.combine(s2, lambda x1, x2: x1 <= x2, dtype="boolean")


Inferring to a different dtype than the original in a general way is quite difficult, and it's better to be explicit in such cases IMO.

Although I agree in general (certainly when using combine for constructing the expected result), but is this change required in this specific case? (I would assume nothing changed related to this?) Because I might still be good to test the case without specifying the dtype explicitly, as that is public API.

The problem here is that almost anything can be converted to a string. I've made a change though. It's inferring dtype, so not ideal performance-wise in the case where the series is a StringDtype and the result might not be.

If the user adds dtype manually, the performance is ok.

jreback · 2020-05-22T15:46:27Z

doc/source/whatsnew/v1.1.0.rst

@@ -89,6 +115,7 @@ Other enhancements
 - :meth:`DataFrame.sample` will now also allow array-like and BitGenerator objects to be passed to ``random_state`` as seeds (:issue:`32503`)
 - :meth:`MultiIndex.union` will now raise `RuntimeWarning` if the object inside are unsortable, pass `sort=False` to suppress this warning (:issue:`33015`)
 - :class:`Series.dt` and :class:`DatatimeIndex` now have an `isocalendar` accessor that returns a :class:`DataFrame` with year, week, and day calculated according to the ISO 8601 calendar (:issue:`33206`).
+- :meth:`Series.combine` has gained a ``dtype`` argument. If supplied, the combined series will get that dtype (:issue:`33465`)


not averse to this, but can we do as a separate change?

jreback · 2020-05-22T15:47:14Z

pandas/core/arrays/base.py

@@ -431,6 +431,11 @@ def astype(self, dtype, copy=True):
        array : ndarray
            NumPy ndarray with 'dtype' for its dtype.
        """
+        from pandas.core.arrays.string_ import StringDtype


so we atcually ever hit this base type? I think we override this everywhere

floats arrays are PandasArrays, and those get their astype from ExtensionsArray.

>>> pd.array([1.5, 2.5]) <PandasArray> [1.5, 2.5] Length: 2, dtype: float64

Allowing any ExtensionArray by default to convert to StringArray seems reasonable to me (and if subclassers don't want that, they can make their own astype implementation disallowing StringArrays).

kk, can you add a commment to this effect here (agree with your statement), but comment for future readers

can you add a commment to this effect here

All methods in this base class are there for subclasses to (potentially) use, so I don't think a comment about that is needed
(a comment about always being able to astype to string dtype is fine though)

i understand I would move the impl that is currently in Decimal to here as it correctly handles the astype from the same type (whereas this one will coerce to a numpy array)

pandas/tests/extension/base/casting.py

jreback · 2020-05-22T15:50:08Z

pandas/tests/extension/base/methods.py

        expected = pd.Series(
-            [a <= b for (a, b) in zip(list(orig_data1), list(orig_data2))]
+            [a <= b for (a, b) in zip(list(orig_data1), list(orig_data2))],


what happens if we don't specify dtype? can you add examples for this as well (its possible we should also raise in some cases)

I've made a change to infer, i.e. give the same result as before.

pandas/tests/extension/decimal/array.py

jorisvandenbossche

Thanks for this, @topper-123 (and sorry for the slow review). As I said in #33254, I think we can fix this bug regardless of the progress of the bigger discussion in that issue.

jorisvandenbossche · 2020-05-22T15:37:27Z

pandas/core/arrays/sparse/dtype.py

@@ -320,7 +320,7 @@ def update_dtype(self, dtype):
        dtype = pandas_dtype(dtype)

        if not isinstance(dtype, cls):
-            fill_value = astype_nansafe(np.array(self.fill_value), dtype).item()
+            fill_value = astype_nansafe(np.array([self.fill_value]), dtype)[0]


Do you remember why this was needed? (Sparse currently doesn't support using EAs under the hood, so in principle, we should never get an extension dtype here)

pandas/core/arrays/string_.py

jorisvandenbossche · 2020-05-22T15:40:20Z

pandas/core/arrays/string_.py

@@ -152,13 +157,13 @@ class StringArray(PandasArray):
    ['This is', 'some text', <NA>, 'data.']
    Length: 4, dtype: string

-    Unlike ``object`` dtype arrays, ``StringArray`` doesn't allow non-string
-    values.


I think it is important to keep the original intent of this part, in some way. As it was meant to explain that the StringArray will only contain strings (as opposed to object dtype, which is not strict here). This could maybe also shown with setting a non-string value (arr[0] = 1) or so.

How you changed it still doesn't explain the original intent, as mentioned above, IMO. I would explain both (only strings allowed + showing the automatic conversion)

Ok, I've tried it differently.

jorisvandenbossche · 2020-05-22T15:42:45Z

pandas/core/series.py

@@ -2682,6 +2682,11 @@ def combine(self, other, func, fill_value=None) -> "Series":
            The value to assume when an index is missing from
            one Series or the other. The default specifies to use the
            appropriate NaN value for the underlying dtype of the Series.
+        dtype : str, numpy.dtype, or ExtensionDtype, optional
+            Data type for the output Series. If not specified, this will be
+            inferred from the combined data.


+1 on adding such a keyword!

pandas/tests/extension/base/casting.py

jorisvandenbossche · 2020-05-22T15:47:51Z

pandas/tests/extension/base/methods.py

@@ -172,15 +172,16 @@ def test_combine_le(self, data_repeated):
        orig_data1, orig_data2 = data_repeated(2)
        s1 = pd.Series(orig_data1)
        s2 = pd.Series(orig_data2)
-        result = s1.combine(s2, lambda x1, x2: x1 <= x2)
+        result = s1.combine(s2, lambda x1, x2: x1 <= x2, dtype="boolean")


Although I agree in general (certainly when using combine for constructing the expected result), but is this change required in this specific case? (I would assume nothing changed related to this?) Because I might still be good to test the case without specifying the dtype explicitly, as that is public API.

jorisvandenbossche · 2020-05-23T18:58:43Z

@topper-123 workflow related question: can you please not force-push, but just add new commits (and merge master for updating with master)? That makes it a lot easier to review what has been updated since a previous review (and ensures github's UI to help with this actually works)

topper-123 · 2020-05-24T11:37:46Z

Ok, no problem, I’ll stop force-pushing. I thought that made the commits more tidy and easier to read, but if that’s wrong, I’ll make regular commits in the future.

doc/source/whatsnew/v1.1.0.rst

jreback · 2020-05-25T17:07:44Z

doc/source/whatsnew/v1.1.0.rst

+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Previously, declaring or converting to :class:`StringDtype` was in general only possible if the data was already only ``str`` or nan-like.
+For example:


can you use the

Previous

Current

way of formatting (see other notes)

I would personally just leave out the "previous". Saying that it didn't work before, and only showing the new feature seems sufficient to me.

please use the prescribed format, its a standard that we have and consistency is key

It's only a standard when there is actually a "Previous" to show. For bugs, we also don't show the error message you got before it was fixed. I personally see the fact that you couldn't do astype("string") to convert to strings as a bug (or not yet implemented feature, given how recent StringDtype is)

jreback · 2020-05-25T17:07:58Z

doc/source/whatsnew/v1.1.0.rst

+All dtypes can now be converted to ``StringDtype``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Previously, declaring or converting to :class:`StringDtype` was in general only possible if the data was already only ``str`` or nan-like.


list the closed issues here at the end of the first statement

jreback · 2020-05-25T17:08:46Z

pandas/core/arrays/base.py

@@ -431,6 +431,11 @@ def astype(self, dtype, copy=True):
        array : ndarray
            NumPy ndarray with 'dtype' for its dtype.
        """
+        from pandas.core.arrays.string_ import StringDtype


kk, can you add a commment to this effect here (agree with your statement), but comment for future readers

pandas/core/arrays/string_.py

jreback · 2020-05-25T17:10:52Z

pandas/core/series.py

@@ -2768,6 +2768,10 @@ def combine(self, other, func, fill_value=None) -> "Series":
        if is_categorical_dtype(self.dtype):
            pass
        elif is_extension_array_dtype(self.dtype):
+            # Everything can be be converted to strings, but we may not want to convert


realy don't add things here, if you need to modify maybe_cast_to_extesion_array then its ok

this becomes hugely complicated to follow the path otherwise

Yes, I would move this into maybe_cast_to_extension_array (we will be able to remove this once the "strict" from_scalars is implemented, and having the logic in maybe_cast_to_extension_array will make it easier to not forget to update that)

Ok, found a way to avoid changing series.py at all.

pandas/tests/extension/decimal/array.py

topper-123 · 2020-05-26T11:20:38Z

Updated.

jreback · 2020-05-26T12:41:26Z

doc/source/user_guide/text.rst

+
+or convert from existing pandas data:
+
+   s1 = pd.Series([1,2, np.nan], dtype="Int64")


this won't render (needs an ipython block here as well)

jreback · 2020-05-26T12:42:36Z

pandas/core/arrays/base.py

@@ -431,6 +431,11 @@ def astype(self, dtype, copy=True):
        array : ndarray
            NumPy ndarray with 'dtype' for its dtype.
        """
+        from pandas.core.arrays.string_ import StringDtype


i understand I would move the impl that is currently in Decimal to here as it correctly handles the astype from the same type (whereas this one will coerce to a numpy array)

pandas/core/arrays/integer.py

pandas/tests/extension/decimal/array.py

doc/source/whatsnew/v1.1.0.rst

pandas/core/arrays/integer.py

pandas/tests/extension/decimal/array.py

jreback · 2020-05-26T22:20:27Z

thanks @topper-123 very nice!

topper-123 force-pushed the astype_string branch from ea91afb to ee184f6 Compare April 10, 2020 21:20

jreback added NA - MaskedArrays Related to pd.NA and nullable extension arrays Strings String extension data type and string data ExtensionArray Extending pandas with custom dtypes or arrays. and removed NA - MaskedArrays Related to pd.NA and nullable extension arrays labels Apr 10, 2020

topper-123 force-pushed the astype_string branch from ee184f6 to 3f37b7d Compare April 10, 2020 23:40

topper-123 force-pushed the astype_string branch from b3c81cb to 92f15b8 Compare April 12, 2020 15:48

topper-123 force-pushed the astype_string branch 2 times, most recently from 17c7ff2 to 8efc0d0 Compare April 12, 2020 16:00

topper-123 commented Apr 12, 2020

View reviewed changes

topper-123 force-pushed the astype_string branch 2 times, most recently from 2a835e4 to eb341bf Compare April 14, 2020 07:07

topper-123 commented Apr 14, 2020

View reviewed changes

topper-123 mentioned this pull request Apr 16, 2020

API: EA interface - strictness of _from_sequence #33254

Closed

jreback requested changes May 22, 2020

View reviewed changes

jorisvandenbossche reviewed May 22, 2020

View reviewed changes

topper-123 force-pushed the astype_string branch 3 times, most recently from 9bcb6a8 to 3d7b050 Compare May 23, 2020 18:56

jreback requested changes May 25, 2020

View reviewed changes

topper-123 added 4 commits May 25, 2020 23:30

API: More permissive conversion to StringDtype

9b90c3c

clean up rst and doc strings

f94169e

update

681a211

changes

f316e42

topper-123 added 2 commits May 25, 2020 23:30

Change according to comments

89ef931

Small clean-up

053dae4

topper-123 force-pushed the astype_string branch 2 times, most recently from 1b3a8f5 to 239d0cc Compare May 26, 2020 09:24

failed doctest

af0bf4d

topper-123 force-pushed the astype_string branch from 239d0cc to af0bf4d Compare May 26, 2020 09:53

update doc string

c966562

jreback requested changes May 26, 2020

View reviewed changes

jorisvandenbossche reviewed May 26, 2020

View reviewed changes

doc/source/whatsnew/v1.1.0.rst Outdated Show resolved Hide resolved

pandas/core/arrays/integer.py Show resolved Hide resolved

pandas/tests/extension/decimal/array.py Show resolved Hide resolved

minor updates

914349a

topper-123 force-pushed the astype_string branch from c9fef1d to 914349a Compare May 26, 2020 14:28

lint

08ff77a

jorisvandenbossche approved these changes May 26, 2020

View reviewed changes

jreback approved these changes May 26, 2020

View reviewed changes

jreback added this to the 1.1 milestone May 26, 2020

jreback merged commit b6ea970 into pandas-dev:master May 26, 2020

topper-123 deleted the astype_string branch May 27, 2020 05:56

simonjayhawkins mentioned this pull request Aug 1, 2020

BUG: pandas 1.1.0 MemoryError using .astype("string") which worked using pandas 1.0.5 #35499

Closed

3 tasks


		or convert from existing pandas data:

		s1 = pd.Series([1,2, np.nan], dtype="Int64")

API: more permissive conversion to StringDtype #33465

API: more permissive conversion to StringDtype #33465

Conversation

topper-123 commented Apr 10, 2020 • edited Loading

The proposed solution:

TomAugspurger commented Apr 10, 2020

topper-123 commented Apr 11, 2020 • edited Loading

jorisvandenbossche commented Apr 11, 2020

topper-123 commented Apr 12, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

topper-123 Apr 14, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented May 23, 2020

topper-123 commented May 24, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

topper-123 commented May 26, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented May 26, 2020

topper-123 commented Apr 10, 2020 •

edited

Loading

topper-123 commented Apr 11, 2020 •

edited

Loading

topper-123 commented Apr 12, 2020 •

edited

Loading

topper-123 Apr 14, 2020 •

edited

Loading