-
-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
added pyarrow/numpy dtype literals and allowed str
| DtypeObj
as input for Series.astype
#756
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For all the possible arguments you added for astype()
, can you add an appropriate test in test_series.test_updated_astype()
? We were really careful there to make sure the tests matched all the things as possible arguments.
pandas-stubs/_typing.pyi
Outdated
| type[object] | ||
| str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should accept a general string here. The whole idea of the other aliases is to constrain which strings are acceptable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
str
must be acceptable, as e.g. a string might be provided dynamically. Same is if you accept generic dtype object, which one must if s.astye(s.dtype)
should work without raising a warning.
The correct way to handle this is to be careful with overload order, going from special to generic.
All of the existing tests still work with this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes the type check too wide. It allows pd.DataFrame().astype("foobar")
to pass. I'd rather have the stubs catch the most used cases (not dynamic strings). Having said that, if we want s.astype(s.dtype)
to work, maybe we should just say that s.dtype()
returns AsTypeArg
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there was a way to specify a non-literal string, that would be ideal. If it sees a literal, the literal should be verified. I am not aware of a way to achieve that with current typing features.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another note - I think in subjective cases like this, you have to ask, what is the most common use case. I would contend dynamic/runtime based astype
calls are atypical. The majority of the time, it's a static transformation, and the developer is aware of the source dtype and target dtype. In that case, they would hard-code the destination type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gandhis1 As an example for dynamic astype(string): I have some data-pipelines for reading csv-files where I store config files with the target schema (e.g. column dtypes).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gandhis1 @Dr-Irv In python 3.11 LiteralString
was added, I wonder id this can be used to raise a typing error if none of the defined literals is matched? This would give best of both worlds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I asked this here: python/typing#1434
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, doesn't seem possible currently. Possibly in the future if both Intersection
and Not
are supported: python/typing#801 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gandhis1 As an example for dynamic astype(string): I have some data-pipelines for reading csv-files where I store config files with the target schema (e.g. column dtypes).
That's an interesting example, but that's a case where the argument you are using is dynamic, and you're trying to check it in a static typing context. If you config files had an incorrect string "foobar"
not covered by the stubs, your code would fail at runtime.
IMHO, the goal of the static type checks is to catch things before runtime. So we have two alternatives:
- Do not include
str
(as it is right now in the stubs) as a valid argument. Disadvantage is that you can't use a dynamic string as an argument as in your use case. Advantage is it covers only the valid strings that are statically declared. - Include
str
as a valid argument. Disadvantage is that we don't help people who use.astype("ing")
instead of.astype("int")
prior to run time. Advantage is your use case.
I would argue that people use .astype("some-string")
more than have a dynamic string as the argument. So that's why I would prefer (1) above.
@randolf-scholz tests are failing |
ObjectDtypeArg: TypeAlias = (
# Builtin object type and its string alias
type[object] # noqa: Y030
| Literal["object"]
# Numpy object type and its string alias
# https://numpy.org/doc/stable/reference/arrays.scalars.html#numpy.object_
| type[np.object_]
| Literal["O"] # NOTE: "object_" not assigned
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests are still failing.
I also think we should add the following test:
if TYPE_CHECKING_INVALID_USAGE:
s.astype("foobar") # type: ignore[call-arg] # pyright: ignore[reportGeneralTypeIssues]
This will check that we are not allowing arbitrary strings.
@Dr-Irv I removed the Apart from that, since I added Also as a side questions: why do |
I'm still seeing
Let's handle in a separate PR. I think if we do that there is a possibility of other errors getting in.
That's a historical artifact. Open to a PR that would change that.
This is because |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests failing on Windows because of the int32
type. Still need to remove str
from AsTypeArg
pandas-stubs/_typing.pyi
Outdated
| type[object] | ||
| ObjectDtypeArg | ||
| DtypeObj | ||
| str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This str
still has to be removed.
tests/test_series.py
Outdated
# int32 | ||
check(assert_type(s.astype(np.intc), "pd.Series[int]"), pd.Series, np.intc) | ||
check(assert_type(s.astype("intc"), "pd.Series[int]"), pd.Series, np.intc) | ||
check(assert_type(s.astype("int32"), "pd.Series[int]"), pd.Series, np.intc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be np.int32
as last param to check
so Windows will pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I'm confused now: Reading https://numpy.org/doc/stable/reference/arrays.scalars.html it sounds like intc
should be the platform independent type. https://numpy.org/doc/stable/reference/arrays.scalars.html#sized-aliases says
Along with their (mostly) C-derived names, the integer, float, and complex data-types are also available using a bit-width convention so that an array of the right size can always be ensured.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
>>> import pandas as pd
>>> import numpy as np
>>> s = pd.Series([1,2,3])
>>> s.dtype
dtype('int64')
>>> s32 = s.astype("int32")
>>> s32.dtype
dtype('int32')
>>> type(s32[0])
<class 'numpy.int32'>
>>> isinstance(s32[0], np.intc)
False
The above is on Windows. When you use s.astype("int32")
, then the resulting int
is not an instance of np.intc
. On Linux and Mac, it is. Not sure I understand why there is a difference, but it is what it is!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I refactored the tests to use pytest.mark.parametrize
, because I expect more windows tests to fail, and need to find out which. Also added "<M8[...]" and "<m8[...]" type-codes for timestamp/timedelta.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens on windows if you try astype(np.intc)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens on windows if you try
astype(np.intc)
?
>>> sc = s.astype(np.intc)
>>> sc.dtype
dtype('int32')
>>> type(sc[0])
<class 'numpy.intc'>
So I guess on Windows s.astype("int32")
and s.astype(np.intc)
store objects of different types inside the series.
It looks like the Windows tests are failing, with the new parameterize stuff, you can get a complete list of what is failing. You may need to test the O/S to get the values appropriate for Windows. |
There is another thing that came up: pandas supports numpys extended type codes: There are hundreds of possible combinations, and for some data types unlimited number of literal strings that are allowed. For example The question is: how many of these should be hard-coded as literals, to avoid false-postives as currently |
I say none. The goal of the pandas stubs is to cover the most common use cases. We try to strike a balance between being too narrow and too wide. IMHO, having If someone actually uses these more esoteric strings with pandas, then they can open an issue and we can add them one by one. |
I found |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some things were changed inadvertently. There are some tests you are skipping that don't need to be skipped on Windows. Still have some failures in CI for Windows.
There are cases where the results are different for Windows and Linux, so I think you will have to just do a different check call for those few cases based on platform
Tests pass now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @randolf-scholz . Nice PR
Closes error: Argument 1 to "astype" of "DataFrame" has incompatible type "Literal['int32[pyarrow]']"; #733
Closes error: Argument 1 to "astype" of "Series" has incompatible type "dtype[generic] | ExtensionDtype"; expected "type[object] | ExtensionDtype" [arg-type] #747
Tests added: Please use
assert_type()
to assert the type of any return valuepandas nullable
UInt
data types were missingnumpy type code literals
numpy alternative literals (half, short, double, etc.)
pyarrow literals
added
DtypeObj | _str
to finalastype
overload (resulting inSeries[Any]
)added few tests
fixed random bug:
def __getattr__(self, name: str) -> S1
accidentally usingstr
instead of_str
.