Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added pyarrow/numpy dtype literals and allowed str | DtypeObj as input for Series.astype #756

Merged
merged 14 commits into from
Jul 25, 2023

Conversation

randolf-scholz
Copy link
Contributor

@randolf-scholz randolf-scholz commented Jul 21, 2023

Copy link
Collaborator

@Dr-Irv Dr-Irv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For all the possible arguments you added for astype(), can you add an appropriate test in test_series.test_updated_astype() ? We were really careful there to make sure the tests matched all the things as possible arguments.

| type[object]
| str
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should accept a general string here. The whole idea of the other aliases is to constrain which strings are acceptable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

str must be acceptable, as e.g. a string might be provided dynamically. Same is if you accept generic dtype object, which one must if s.astye(s.dtype) should work without raising a warning.

The correct way to handle this is to be careful with overload order, going from special to generic.
All of the existing tests still work with this change.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes the type check too wide. It allows pd.DataFrame().astype("foobar") to pass. I'd rather have the stubs catch the most used cases (not dynamic strings). Having said that, if we want s.astype(s.dtype) to work, maybe we should just say that s.dtype() returns AsTypeArg

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there was a way to specify a non-literal string, that would be ideal. If it sees a literal, the literal should be verified. I am not aware of a way to achieve that with current typing features.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another note - I think in subjective cases like this, you have to ask, what is the most common use case. I would contend dynamic/runtime based astype calls are atypical. The majority of the time, it's a static transformation, and the developer is aware of the source dtype and target dtype. In that case, they would hard-code the destination type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gandhis1 As an example for dynamic astype(string): I have some data-pipelines for reading csv-files where I store config files with the target schema (e.g. column dtypes).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gandhis1 @Dr-Irv In python 3.11 LiteralString was added, I wonder id this can be used to raise a typing error if none of the defined literals is matched? This would give best of both worlds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I asked this here: python/typing#1434

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, doesn't seem possible currently. Possibly in the future if both Intersection and Not are supported: python/typing#801 (comment)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gandhis1 As an example for dynamic astype(string): I have some data-pipelines for reading csv-files where I store config files with the target schema (e.g. column dtypes).

That's an interesting example, but that's a case where the argument you are using is dynamic, and you're trying to check it in a static typing context. If you config files had an incorrect string "foobar"not covered by the stubs, your code would fail at runtime.

IMHO, the goal of the static type checks is to catch things before runtime. So we have two alternatives:

  1. Do not include str (as it is right now in the stubs) as a valid argument. Disadvantage is that you can't use a dynamic string as an argument as in your use case. Advantage is it covers only the valid strings that are statically declared.
  2. Include str as a valid argument. Disadvantage is that we don't help people who use .astype("ing") instead of .astype("int") prior to run time. Advantage is your use case.

I would argue that people use .astype("some-string") more than have a dynamic string as the argument. So that's why I would prefer (1) above.

@Dr-Irv
Copy link
Collaborator

Dr-Irv commented Jul 21, 2023

@randolf-scholz tests are failing

@randolf-scholz
Copy link
Contributor Author

  • added tests for all the types
  • added ObjectDtypeArg
ObjectDtypeArg: TypeAlias = (
    # Builtin object type and its string alias
    type[object]  # noqa: Y030
    | Literal["object"]
    # Numpy object type and its string alias
    # https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.object_
    | type[np.object_]
    | Literal["O"]  # NOTE: "object_" not assigned
)

Copy link
Collaborator

@Dr-Irv Dr-Irv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests are still failing.

I also think we should add the following test:

    if TYPE_CHECKING_INVALID_USAGE:
        s.astype("foobar")  # type: ignore[call-arg] # pyright: ignore[reportGeneralTypeIssues]

This will check that we are not allowing arbitrary strings.

@randolf-scholz
Copy link
Contributor Author

randolf-scholz commented Jul 24, 2023

@Dr-Irv I removed the str overload for now, and added the test you suggested. But I think this still should be discussed, possibly when LiteralString is supported by typecheckers one can do some overload trickery to get both dynamical string and typo-checking to work.

Apart from that, since I added ObjectDtypeArg one might consider adding object to S1. (and I'd suggest renaming S1 to something more descriptive like SeriesKind).

Also as a side questions: why do TimestampSeries and TimedeltaSeries, etc. exist? Why don't Series[Timestamp] / Series[Timestamp] work?

@randolf-scholz randolf-scholz requested a review from Dr-Irv July 24, 2023 13:15
@Dr-Irv
Copy link
Collaborator

Dr-Irv commented Jul 24, 2023

@Dr-Irv I removed the str overload for now, and added the test you suggested. But I think this still should be discussed, possibly when LiteralString is supported by typecheckers one can do some overload trickery to get both dynamical string and typo-checking to work.

I'm still seeing AsTypeArg including str, so will ask for that to be removed.

Apart from that, since I added ObjectDtypeArg one might consider adding object to S1.

Let's handle in a separate PR. I think if we do that there is a possibility of other errors getting in.

(and I'd suggest renaming S1 to something more descriptive like SeriesKind).

That's a historical artifact. Open to a PR that would change that.

Also as a side questions: why do TimestampSeries and TimedeltaSeries, etc. exist? Why don't Series[Timestamp] / Series[Timestamp] work?

This is because mypy can't handle the overloads correctly for the arithmetic operators on Series[Timestamp] and Series[Timedelta] . See #718 and links to discussion in pyright there.

Copy link
Collaborator

@Dr-Irv Dr-Irv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests failing on Windows because of the int32 type. Still need to remove str from AsTypeArg

| type[object]
| ObjectDtypeArg
| DtypeObj
| str
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This str still has to be removed.

# int32
check(assert_type(s.astype(np.intc), "pd.Series[int]"), pd.Series, np.intc)
check(assert_type(s.astype("intc"), "pd.Series[int]"), pd.Series, np.intc)
check(assert_type(s.astype("int32"), "pd.Series[int]"), pd.Series, np.intc)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be np.int32 as last param to check so Windows will pass.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I'm confused now: Reading https://numpy.org/doc/stable/reference/arrays.scalars.html it sounds like intc should be the platform independent type. https://numpy.org/doc/stable/reference/arrays.scalars.html#sized-aliases says

Along with their (mostly) C-derived names, the integer, float, and complex data-types are also available using a bit-width convention so that an array of the right size can always be ensured.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

>>> import pandas as pd
>>> import numpy as np
>>> s = pd.Series([1,2,3])
>>> s.dtype
dtype('int64')
>>> s32 = s.astype("int32")
>>> s32.dtype
dtype('int32')
>>> type(s32[0])
<class 'numpy.int32'>
>>> isinstance(s32[0], np.intc)
False

The above is on Windows. When you use s.astype("int32"), then the resulting int is not an instance of np.intc. On Linux and Mac, it is. Not sure I understand why there is a difference, but it is what it is!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored the tests to use pytest.mark.parametrize, because I expect more windows tests to fail, and need to find out which. Also added "<M8[...]" and "<m8[...]" type-codes for timestamp/timedelta.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens on windows if you try astype(np.intc) ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens on windows if you try astype(np.intc) ?

>>> sc = s.astype(np.intc)
>>> sc.dtype
dtype('int32')
>>> type(sc[0])
<class 'numpy.intc'>

So I guess on Windows s.astype("int32") and s.astype(np.intc) store objects of different types inside the series.

@Dr-Irv
Copy link
Collaborator

Dr-Irv commented Jul 24, 2023

It looks like the Windows tests are failing, with the new parameterize stuff, you can get a complete list of what is failing. You may need to test the O/S to get the values appropriate for Windows.

@randolf-scholz
Copy link
Contributor Author

There is another thing that came up: pandas supports numpys extended type codes:
https://numpy.org/doc/stable/reference/arrays.dtypes.html#string-dtype-note

There are hundreds of possible combinations, and for some data types unlimited number of literal strings that are allowed. For example s.astype("S25") implicitly creates a bytes datatype with 25-bytes. any positive number of integer bytes is valid.

The question is: how many of these should be hard-coded as literals, to avoid false-postives as currently str will not be allowed.

@Dr-Irv
Copy link
Collaborator

Dr-Irv commented Jul 25, 2023

There is another thing that came up: pandas supports numpys extended type codes: https://numpy.org/doc/stable/reference/arrays.dtypes.html#string-dtype-note

There are hundreds of possible combinations, and for some data types unlimited number of literal strings that are allowed. For example s.astype("S25") implicitly creates a bytes datatype with 25-bytes. any positive number of integer bytes is valid.

The question is: how many of these should be hard-coded as literals, to avoid false-postives as currently str will not be allowed.

I say none. The goal of the pandas stubs is to cover the most common use cases. We try to strike a balance between being too narrow and too wide. IMHO, having str as a valid type for astype() is too wide and reduces the usefulness of the stubs in the most common use cases.

If someone actually uses these more esoteric strings with pandas, then they can open an issue and we can add them one by one.

@randolf-scholz randolf-scholz requested a review from Dr-Irv July 25, 2023 14:05
@randolf-scholz
Copy link
Contributor Author

I found np.sctypeDict which seems like a good list of aliases to support. I added a test to check that all these aliases are tested. The windows stuff is hopefully fixed as well.

Copy link
Collaborator

@Dr-Irv Dr-Irv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some things were changed inadvertently. There are some tests you are skipping that don't need to be skipped on Windows. Still have some failures in CI for Windows.

There are cases where the results are different for Windows and Linux, so I think you will have to just do a different check call for those few cases based on platform

tests/test_series.py Outdated Show resolved Hide resolved
tests/test_series.py Outdated Show resolved Hide resolved
tests/test_series.py Outdated Show resolved Hide resolved
tests/test_series.py Outdated Show resolved Hide resolved
tests/test_series.py Outdated Show resolved Hide resolved
tests/test_series.py Show resolved Hide resolved
tests/test_series.py Outdated Show resolved Hide resolved
tests/test_series.py Show resolved Hide resolved
@randolf-scholz randolf-scholz requested a review from Dr-Irv July 25, 2023 15:29
tests/test_series.py Outdated Show resolved Hide resolved
tests/test_series.py Outdated Show resolved Hide resolved
@randolf-scholz randolf-scholz requested a review from Dr-Irv July 25, 2023 15:48
@randolf-scholz
Copy link
Contributor Author

Tests pass now

Copy link
Collaborator

@Dr-Irv Dr-Irv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @randolf-scholz . Nice PR

@Dr-Irv Dr-Irv merged commit 490914f into pandas-dev:main Jul 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants