added pyarrow/numpy dtype literals and allowed `str` | `DtypeObj` as input for `Series.astype` #756

randolf-scholz · 2023-07-21T17:43:19Z

Closes error: Argument 1 to "astype" of "DataFrame" has incompatible type "Literal['int32[pyarrow]']"; #733
Closes error: Argument 1 to "astype" of "Series" has incompatible type "dtype[generic] | ExtensionDtype"; expected "type[object] | ExtensionDtype" [arg-type] #747
Tests added: Please use assert_type() to assert the type of any return value
pandas nullable UInt data types were missing
numpy type code literals
numpy alternative literals (half, short, double, etc.)
pyarrow literals
added DtypeObj | _str to final astype overload (resulting in Series[Any])
added few tests
fixed random bug: def __getattr__(self, name: str) -> S1 accidentally using str instead of _str.

Dr-Irv

For all the possible arguments you added for astype(), can you add an appropriate test in test_series.test_updated_astype() ? We were really careful there to make sure the tests matched all the things as possible arguments.

Dr-Irv · 2023-07-21T18:08:00Z

pandas-stubs/_typing.pyi

    | type[object]
+    | str


I don't think we should accept a general string here. The whole idea of the other aliases is to constrain which strings are acceptable.

str must be acceptable, as e.g. a string might be provided dynamically. Same is if you accept generic dtype object, which one must if s.astye(s.dtype) should work without raising a warning.

The correct way to handle this is to be careful with overload order, going from special to generic.
All of the existing tests still work with this change.

It makes the type check too wide. It allows pd.DataFrame().astype("foobar") to pass. I'd rather have the stubs catch the most used cases (not dynamic strings). Having said that, if we want s.astype(s.dtype) to work, maybe we should just say that s.dtype() returns AsTypeArg

If there was a way to specify a non-literal string, that would be ideal. If it sees a literal, the literal should be verified. I am not aware of a way to achieve that with current typing features.

Another note - I think in subjective cases like this, you have to ask, what is the most common use case. I would contend dynamic/runtime based astype calls are atypical. The majority of the time, it's a static transformation, and the developer is aware of the source dtype and target dtype. In that case, they would hard-code the destination type.

@gandhis1 As an example for dynamic astype(string): I have some data-pipelines for reading csv-files where I store config files with the target schema (e.g. column dtypes).

@gandhis1 @Dr-Irv In python 3.11 LiteralString was added, I wonder id this can be used to raise a typing error if none of the defined literals is matched? This would give best of both worlds.

I asked this here: python/typing#1434

Hm, doesn't seem possible currently. Possibly in the future if both Intersection and Not are supported: python/typing#801 (comment)

@gandhis1 As an example for dynamic astype(string): I have some data-pipelines for reading csv-files where I store config files with the target schema (e.g. column dtypes).

That's an interesting example, but that's a case where the argument you are using is dynamic, and you're trying to check it in a static typing context. If you config files had an incorrect string "foobar"not covered by the stubs, your code would fail at runtime.

IMHO, the goal of the static type checks is to catch things before runtime. So we have two alternatives:

Do not include str (as it is right now in the stubs) as a valid argument. Disadvantage is that you can't use a dynamic string as an argument as in your use case. Advantage is it covers only the valid strings that are statically declared.

Include str as a valid argument. Disadvantage is that we don't help people who use .astype("ing") instead of .astype("int") prior to run time. Advantage is your use case.

I would argue that people use .astype("some-string") more than have a dynamic string as the argument. So that's why I would prefer (1) above.

Dr-Irv · 2023-07-21T18:21:21Z

@randolf-scholz tests are failing

randolf-scholz · 2023-07-21T19:12:10Z

added tests for all the types
added ObjectDtypeArg

ObjectDtypeArg: TypeAlias = (
    # Builtin object type and its string alias
    type[object]  # noqa: Y030
    | Literal["object"]
    # Numpy object type and its string alias
    # https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.object_
    | type[np.object_]
    | Literal["O"]  # NOTE: "object_" not assigned
)

Dr-Irv

Tests are still failing.

I also think we should add the following test:

    if TYPE_CHECKING_INVALID_USAGE:
        s.astype("foobar")  # type: ignore[call-arg] # pyright: ignore[reportGeneralTypeIssues]

This will check that we are not allowing arbitrary strings.

randolf-scholz · 2023-07-24T13:06:03Z

@Dr-Irv I removed the str overload for now, and added the test you suggested. But I think this still should be discussed, possibly when LiteralString is supported by typecheckers one can do some overload trickery to get both dynamical string and typo-checking to work.

Apart from that, since I added ObjectDtypeArg one might consider adding object to S1. (and I'd suggest renaming S1 to something more descriptive like SeriesKind).

Also as a side questions: why do TimestampSeries and TimedeltaSeries, etc. exist? Why don't Series[Timestamp] / Series[Timestamp] work?

Dr-Irv · 2023-07-24T15:36:51Z

@Dr-Irv I removed the str overload for now, and added the test you suggested. But I think this still should be discussed, possibly when LiteralString is supported by typecheckers one can do some overload trickery to get both dynamical string and typo-checking to work.

I'm still seeing AsTypeArg including str, so will ask for that to be removed.

Apart from that, since I added ObjectDtypeArg one might consider adding object to S1.

Let's handle in a separate PR. I think if we do that there is a possibility of other errors getting in.

(and I'd suggest renaming S1 to something more descriptive like SeriesKind).

That's a historical artifact. Open to a PR that would change that.

Also as a side questions: why do TimestampSeries and TimedeltaSeries, etc. exist? Why don't Series[Timestamp] / Series[Timestamp] work?

This is because mypy can't handle the overloads correctly for the arithmetic operators on Series[Timestamp] and Series[Timedelta] . See #718 and links to discussion in pyright there.

Dr-Irv

Tests failing on Windows because of the int32 type. Still need to remove str from AsTypeArg

Dr-Irv · 2023-07-24T15:37:53Z

pandas-stubs/_typing.pyi

-    | type[object]
+    | ObjectDtypeArg
+    | DtypeObj
+    | str


This str still has to be removed.

Dr-Irv · 2023-07-24T15:40:27Z

tests/test_series.py

+    # int32
+    check(assert_type(s.astype(np.intc), "pd.Series[int]"), pd.Series, np.intc)
+    check(assert_type(s.astype("intc"), "pd.Series[int]"), pd.Series, np.intc)
+    check(assert_type(s.astype("int32"), "pd.Series[int]"), pd.Series, np.intc)


I think this should be np.int32 as last param to check so Windows will pass.

Hm, I'm confused now: Reading https://numpy.org/doc/stable/reference/arrays.scalars.html it sounds like intc should be the platform independent type. https://numpy.org/doc/stable/reference/arrays.scalars.html#sized-aliases says

Along with their (mostly) C-derived names, the integer, float, and complex data-types are also available using a bit-width convention so that an array of the right size can always be ensured.

>>> import pandas as pd >>> import numpy as np >>> s = pd.Series([1,2,3]) >>> s.dtype dtype('int64') >>> s32 = s.astype("int32") >>> s32.dtype dtype('int32') >>> type(s32[0]) <class 'numpy.int32'> >>> isinstance(s32[0], np.intc) False

The above is on Windows. When you use s.astype("int32"), then the resulting int is not an instance of np.intc. On Linux and Mac, it is. Not sure I understand why there is a difference, but it is what it is!

I refactored the tests to use pytest.mark.parametrize, because I expect more windows tests to fail, and need to find out which. Also added "<M8[...]" and "<m8[...]" type-codes for timestamp/timedelta.

What happens on windows if you try astype(np.intc) ?

What happens on windows if you try astype(np.intc) ?

>>> sc = s.astype(np.intc) >>> sc.dtype dtype('int32') >>> type(sc[0]) <class 'numpy.intc'>

So I guess on Windows s.astype("int32") and s.astype(np.intc) store objects of different types inside the series.

Dr-Irv · 2023-07-24T19:40:51Z

It looks like the Windows tests are failing, with the new parameterize stuff, you can get a complete list of what is failing. You may need to test the O/S to get the values appropriate for Windows.

randolf-scholz · 2023-07-25T09:52:59Z

There is another thing that came up: pandas supports numpys extended type codes:
https://numpy.org/doc/stable/reference/arrays.dtypes.html#string-dtype-note

There are hundreds of possible combinations, and for some data types unlimited number of literal strings that are allowed. For example s.astype("S25") implicitly creates a bytes datatype with 25-bytes. any positive number of integer bytes is valid.

The question is: how many of these should be hard-coded as literals, to avoid false-postives as currently str will not be allowed.

Dr-Irv · 2023-07-25T12:01:01Z

There is another thing that came up: pandas supports numpys extended type codes: https://numpy.org/doc/stable/reference/arrays.dtypes.html#string-dtype-note

There are hundreds of possible combinations, and for some data types unlimited number of literal strings that are allowed. For example s.astype("S25") implicitly creates a bytes datatype with 25-bytes. any positive number of integer bytes is valid.

The question is: how many of these should be hard-coded as literals, to avoid false-postives as currently str will not be allowed.

I say none. The goal of the pandas stubs is to cover the most common use cases. We try to strike a balance between being too narrow and too wide. IMHO, having str as a valid type for astype() is too wide and reduces the usefulness of the stubs in the most common use cases.

If someone actually uses these more esoteric strings with pandas, then they can open an issue and we can add them one by one.

randolf-scholz · 2023-07-25T14:08:19Z

I found np.sctypeDict which seems like a good list of aliases to support. I added a test to check that all these aliases are tested. The windows stuff is hopefully fixed as well.

Dr-Irv

some things were changed inadvertently. There are some tests you are skipping that don't need to be skipped on Windows. Still have some failures in CI for Windows.

There are cases where the results are different for Windows and Linux, so I think you will have to just do a different check call for those few cases based on platform

tests/test_series.py

randolf-scholz · 2023-07-25T15:48:44Z

Tests pass now

Dr-Irv

Thanks @randolf-scholz . Nice PR

randolf-scholz added 2 commits July 21, 2023 19:41

added pyarrow/numpy dtype literals & allowed str as astype input

b4efffd

removed accidental double float

6e422e1

Dr-Irv requested changes Jul 21, 2023

View reviewed changes

added ObjectDtypeArg and lots of unit tests for literals

f92534b

Dr-Irv requested changes Jul 22, 2023

View reviewed changes

removed str overload

a6de696

re-enabled s.astype(s.dtype) test

b56ea5d

randolf-scholz requested a review from Dr-Irv July 24, 2023 13:15

Dr-Irv requested changes Jul 24, 2023

View reviewed changes

refactored astype-tests to use pytest.mark.parametrize

e0154f5

randolf-scholz added 4 commits July 25, 2023 15:19

added VoidDtype, fixed some test issues

3aae26e

attempted fix for float96/complex192

195e790

added coded for testing that all types are tested

fae83c1

small edit

5ebcdb2

randolf-scholz requested a review from Dr-Irv July 25, 2023 14:05

Dr-Irv requested changes Jul 25, 2023

View reviewed changes

randolf-scholz added 3 commits July 25, 2023 17:23

removed float96, complex192 and fixed integer tests

4ad4325

reverted accidental Series renames

2296a85

removed windows check for test_astype_int

2471ec8

randolf-scholz requested a review from Dr-Irv July 25, 2023 15:29

reordered literals

b6f3829

Dr-Irv reviewed Jul 25, 2023

View reviewed changes

tests/test_series.py Outdated Show resolved Hide resolved

tests/test_series.py Outdated Show resolved Hide resolved

randolf-scholz requested a review from Dr-Irv July 25, 2023 15:48

Dr-Irv approved these changes Jul 25, 2023

View reviewed changes

Dr-Irv merged commit 490914f into pandas-dev:main Jul 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

added pyarrow/numpy dtype literals and allowed `str` | `DtypeObj` as input for `Series.astype` #756

added pyarrow/numpy dtype literals and allowed `str` | `DtypeObj` as input for `Series.astype` #756

randolf-scholz commented Jul 21, 2023 •

edited

Loading

Dr-Irv left a comment

Dr-Irv Jul 21, 2023

randolf-scholz Jul 21, 2023

Dr-Irv Jul 21, 2023

gandhis1 Jul 21, 2023

gandhis1 Jul 21, 2023

randolf-scholz Jul 21, 2023

randolf-scholz Jul 21, 2023

randolf-scholz Jul 21, 2023

randolf-scholz Jul 21, 2023

Dr-Irv Jul 22, 2023

Dr-Irv commented Jul 21, 2023

randolf-scholz commented Jul 21, 2023

Dr-Irv left a comment

randolf-scholz commented Jul 24, 2023 •

edited

Loading

Dr-Irv commented Jul 24, 2023

Dr-Irv left a comment

Dr-Irv Jul 24, 2023

Dr-Irv Jul 24, 2023

randolf-scholz Jul 24, 2023

Dr-Irv Jul 24, 2023

randolf-scholz Jul 24, 2023

randolf-scholz Jul 24, 2023

Dr-Irv Jul 24, 2023

Dr-Irv commented Jul 24, 2023

randolf-scholz commented Jul 25, 2023

Dr-Irv commented Jul 25, 2023

randolf-scholz commented Jul 25, 2023

Dr-Irv left a comment

randolf-scholz commented Jul 25, 2023

Dr-Irv left a comment

added pyarrow/numpy dtype literals and allowed str | DtypeObj as input for Series.astype #756

added pyarrow/numpy dtype literals and allowed str | DtypeObj as input for Series.astype #756

Conversation

randolf-scholz commented Jul 21, 2023 • edited Loading

Dr-Irv left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Dr-Irv commented Jul 21, 2023

randolf-scholz commented Jul 21, 2023

Dr-Irv left a comment

Choose a reason for hiding this comment

randolf-scholz commented Jul 24, 2023 • edited Loading

Dr-Irv commented Jul 24, 2023

Dr-Irv left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Dr-Irv commented Jul 24, 2023

randolf-scholz commented Jul 25, 2023

Dr-Irv commented Jul 25, 2023

randolf-scholz commented Jul 25, 2023

Dr-Irv left a comment

Choose a reason for hiding this comment

randolf-scholz commented Jul 25, 2023

Dr-Irv left a comment

Choose a reason for hiding this comment

added pyarrow/numpy dtype literals and allowed `str` | `DtypeObj` as input for `Series.astype` #756

added pyarrow/numpy dtype literals and allowed `str` | `DtypeObj` as input for `Series.astype` #756

randolf-scholz commented Jul 21, 2023 •

edited

Loading

randolf-scholz commented Jul 24, 2023 •

edited

Loading