Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TYP: remove NaTType as possible result of Timestamp and Timedelta constructor #46171

Merged
merged 20 commits into from
Mar 8, 2022

Conversation

Dr-Irv
Copy link
Contributor

@Dr-Irv Dr-Irv commented Feb 27, 2022

While working on the Microsoft type stubs, I discovered this. With the current code, if you run pyright on the following with default settings for basic type checking (not using what we have in pyproject.toml):

import pandas as pd
print("version ", pd.__version__)
ts:pd.Timestamp =  pd.Timestamp("2022-02-27")
print(ts)

you get this:

c:\Code\pandas_dev\pandas\tstest.py
  c:\Code\pandas_dev\pandas\tstest.py:3:20 - error: Expression of type "Timestamp | NaTType" cannot be assigned to declared type "Timestamp"
    Type "Timestamp | NaTType" cannot be assigned to type "Timestamp"
      "NaTType" is incompatible with "Timestamp" (reportGeneralTypeIssues)

mypy doesn't pick this up. The real issue here is that you generally expect a Timestamp constructor to return a Timestamp, so by returning the Union, we are then forcing end users to cast the result of the constructor, which is pretty inconvenient.

So even though the constructor could return a NaTType, we shouldn't force users to code for that due to static type checking.

This PR just takes out the NaTType as a return type for the constructors.

@Dr-Irv Dr-Irv added Timedelta Timedelta data type Timestamp pd.Timestamp and associated methods Typing type annotations, mypy/pyright type checking labels Feb 27, 2022
@jbrockmendel
Copy link
Member

mypy doesn't pick this up

Aren't the "type: ignore[misc]"s that this removes there specifically because mypy does pick it up?

So even though the constructor could return a NaTType, we shouldn't force users to code for that due to static type checking.

How big a deal is this for users? I'm uncomfortable with intentionally making the annotations less accurate.

The ideal solution is upstream in the type checkers accurately reflecting the capabilites of __new__, right?

cc @simonjayhawkins

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Feb 27, 2022

How big a deal is this for users? I'm uncomfortable with intentionally making the annotations less accurate.

IMHO, this is a big deal. Otherwise, the code above would have to be written as:

from typing import cast
import pandas as pd
print("version ", pd.__version__)
ts:pd.Timestamp =  cast(pd.Timestamp, pd.Timestamp("2022-02-27"))
print(ts)

There's a tradeoff here to consider between how much the static type checker picks up versus what errors would get picked up in runtime.

Right now, pyright integration with VS Code is so good that just using it with defaults picks up lots of potential code errors, and we don't want to ask users to code around it.

The ideal solution is upstream in the type checkers accurately reflecting the capabilites of __new__, right?

Actually, pyright does pick up the __new__ part correctly, which is why it sees the return type of the constructor as Timestamp | NaTType, which then forces the end user to do a cast. mypy on the other hand does the reverse - it doesn't see that __new__() is returning the Union type and then the user doesn't have to change their code in my example.

@twoertwein
Copy link
Member

twoertwein commented Feb 28, 2022

How about using overloads to either return Timestamp or NaTType? Probably cannot catch all cases when NaTType might be returned but at least the obvious ones.

Edit:

Or can NaTType inherit from both Timedelta and Timestamp?

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Feb 28, 2022

I have another solution that seems to be working locally. Idea is as follows:

  1. Change Timestamp to _Timestamp and Timedelta to _Timedelta
  2. Have Timestamp inherit from both _Timestamp and NaTType
  3. Have Timedelta inherit from both _Timedelta and NaTType
  4. For the new Timestamp and Timedelta, define methods appropriately via @overload that do the right thing based on whether the type of the class is _Timestamp/_Timedelta or NaTType

I can push that if people want to take a look, but it's a more substantive change.

Alternatively, I could create a second PR with that approach. Comments welcome.

@jbrockmendel
Copy link
Member

Have Timedelta inherit from both _Timedelta and NaTType

Would need to do some more gymnastics to avoid having Timedelta inherit from datetime.

-0.5 on the original. Making our annotations less accurate for the sake of allowing users to make their annotations less accurate just seems counter-productive.

#24983 might be a better long-term solution.

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Feb 28, 2022

Have Timedelta inherit from both _Timedelta and NaTType

Would need to do some more gymnastics to avoid having Timedelta inherit from datetime.

-0.5 on the original. Making our annotations less accurate for the sake of allowing users to make their annotations less accurate just seems counter-productive.

#24983 might be a better long-term solution.

So I have something that works - it only affects typing, and essentially is treating NaTType as a mixin from a typing perspective.

Question is whether I put that up as a separate PR or push it here.

@twoertwein
Copy link
Member

I cannot comment on whether the type annotations should purposefully be different from the implementation. How do the microsoft/virtus lab stubs handle it?

I understand that from an end-user perspective, the current situation is as bad as it is with read_csv (DataFrame or TextFileReader). Except that for read_csv we can more easily create overloads.

Since mypy does anyways not recognize that __new__ can return a Union, I don't see too much value in keeping the Union. At the same time, I still want to keep the Union since it can return different types (and pyright picks it up).

If Timestamp/Timedelta were very simply (take only one argument), I wouldn't mind the following as a middle ground:

# cases when it is guanteed to be NaTType
# Literal[float("NaN")] is probably not valid
def __new__(value: None | Literal[float("NaN")] | ...) -> NaTType: ...

# might be NaTType, but in most cases is Timestamp
def __new__(value: BaseOffset | ...) -> Timestamp: ...

# in case the argument is a union of the above
def __new__(value: BaseOffset | ... | None | ...) -> Timestamp | NaTType: ...

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Feb 28, 2022

How do the microsoft/virtus lab stubs handle it?

Well, I'm working on the microsoft ones now. I copied over what was in pandas, and discovered this issue, and have submitted a PR there to do what I propose here.

As for Virtus Labs, their stubs do not indicate NATType as a valid result of the constructor of Timestamp or Timedelta .

If Timestamp/Timedelta were very simply (take only one argument), I wouldn't mind the following as a middle ground:

We could do something similar with the current __new__() stubs.

Here's a question - I think the only way pd.NaT could be returned as a result of calling Timestamp or Timedelta is if you passed something like None, np.nan, etc. Other overflows, improper strings, etc. raise an exception. Can someone confirm? So maybe your solution - to type it differently based on the argument is a way to go.

@jbrockmendel
Copy link
Member

I think the only way pd.NaT could be returned as a result of calling Timestamp or Timedelta is if you passed something like None, np.nan, etc. Other overflows, improper strings, etc. raise an exception. Can someone confirm?

A near-complete list:

None, np.nan, NaT, np.datetime64("NaT"), "NaT", "nat", "NAT", "nan", "NaN", "NAN", NaT.value

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Feb 28, 2022

If Timestamp/Timedelta were very simply (take only one argument), I wouldn't mind the following as a middle ground:

# cases when it is guanteed to be NaTType
# Literal[float("NaN")] is probably not valid
def __new__(value: None | Literal[float("NaN")] | ...) -> NaTType: ...

# might be NaTType, but in most cases is Timestamp
def __new__(value: BaseOffset | ...) -> Timestamp: ...

I played around with this, and I don't think this is possible. You can't have different __new__ signatures return different types and make mypy work. The type recognition system in mypy won't return two different types based on the constructor. It's still a constructor for the primary type.

I think to make things good for the users, we have to go with what is in this PR.

Side note: I tried something like this, and pyright gets it right, but mypy doesn't (think of A as _Timestamp, B as NaTType, and C as Timestamp :

from __future__ import annotations
from typing import overload


class A:
    pass


class B:
    pass


class C(A, B):
    @overload
    def __new__(cls, x: int) -> A:
        ...

    @overload
    def __new__(cls, x: str) -> B:
        ...

    def __new__(cls, x: int | str) -> C | A | B:
        if isinstance(x, int):
            return super(A, cls).__new__(cls)
        elif isinstance(x, str):
            return super(B, cls).__new__(cls)
        else:
            raise Exception("bad argument")


y = C(2)
z = C("hey")

reveal_type(y)
reveal_type(z)

pyright (with defaults) reports:

new2.py:34:13 - information: Type of "y" is "A"
new2.py:35:13 - information: Type of "z" is "B"

mypy (with defaults) reports:

new2.py:15: error: Incompatible return type for "__new__" (returns "A", but must return a subtype of "C")
new2.py:19: error: Incompatible return type for "__new__" (returns "B", but must return a subtype of "C")
new2.py:22: error: "__new__" must return a class instance (got "Union[C, A, B]")
new2.py:34: note: Revealed type is "new2.C"
new2.py:35: note: Revealed type is "new2.C"
Found 3 errors in 1 file (checked 1 source file)

@twoertwein
Copy link
Member

twoertwein commented Mar 1, 2022

While I would like to avoid mypy ignores comments, I think your approach is working well:

  • The current implementation already has a mypy ignore comment to account for the missing functionality of mypy. Would just need to add one for each overload.
  • It still works perfectly for pyright :)
  • People using mypy will not notice any change
  • People using pyright get the "most likely" type based on the overloads
  • edit: and as long as mypy doesn't add this functionality, it does not degrade type checking for the pandas CI

From my perspective it looks good. The question is what do we prefer: type annotations being easy to use ("most likely" overloads) or being as strict as possible (always returning the union). Hopefully, the implementation will change in the future so that we don't need to make these trade-offs.

@@ -56,7 +54,7 @@ class Timestamp(datetime):
tzinfo: _tzinfo | None = ...,
*,
fold: int | None = ...,
) -> _DatetimeT | NaTType: ...
) -> _DatetimeT: ...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment (maybe ref this issue) about the reasoning here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in next commit

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls rebase as well

@@ -1767,16 +1767,13 @@ def _format_datetime64_dateonly(
nat_rep: str = "NaT",
date_format: str | None = None,
) -> str:
if x is NaT:
if x is NaT or isinstance(x, NaTType):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is NaTType leaking here? (is this is a typing issue? e.g. maybe cast?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better to just do if isinstance(x, NaTType) here - the typing makes it clear. Will be in next commit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checking for both NaT and NaTType is redundant. the existing check should be marginally more performant than an isintance check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue here is that x is of type NaTType | Timestamp, so if you just do x is NaT then mypy can't conclude that x is Timestamp when the condition is False, but if you do isinstance(x, NaTType), it can make that conclusion when the condition is False.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im aware. i dont think that justifies making the check redundant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's why I just changed it to if isinstance(x, NaTType) in e79d253

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as commented elsewhere, -1 on this change.

@@ -139,7 +140,7 @@
"""


def _parse_date(datestr: str) -> datetime:
def _parse_date(datestr: str) -> datetime | NaTType:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we have a type that is a datetime scalar | NatType already?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably not bc ATM NaTType subclasses datetime

@@ -1286,7 +1291,7 @@ def test_resample_consistency():
tm.assert_series_equal(s10_2, rl)


dates1 = [
dates1: List[Union[datetime, NaTType]] = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh this, should be a first class type (the Union)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create a type in next commit

@jreback jreback added this to the 1.5 milestone Mar 7, 2022
@jreback
Copy link
Contributor

jreback commented Mar 7, 2022

@jbrockmendel ok here?

@jbrockmendel
Copy link
Member

I remain -0.5, not gonna make a fuss about it.

@jreback
Copy link
Contributor

jreback commented Mar 8, 2022

I remain -0.5, not gonna make a fuss about it.

thanks! point noted.

@jreback jreback merged commit 622c92d into pandas-dev:main Mar 8, 2022
@twoertwein
Copy link
Member

Given the current two-tier plan for typing (copy public API from MS stubs and then potentially later merging them with the inline annotations)
image
it might have been cleaner to have kept NaTType as part of the inline annotations, as they fulfil a slightly different purpose compared to the public type stubs.

[Personally, I hope that once the MS stubs are copied over, it doesn't take too long to merge a majority of them - might be optimistic.]

@@ -1767,16 +1767,13 @@ def _format_datetime64_dateonly(
nat_rep: str = "NaT",
date_format: str | None = None,
) -> str:
if x is NaT:
if isinstance(x, NaTType):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1 on code changes to satisfy the type checker. We know that static type checker have issues with singletons and imo that was adequately noted in the removed comment adjacent to the ignore statement.

It may be nbd is some cases, but am always -1 on principle.

) -> _DatetimeT: ...
# GH 46171
# While Timestamp can return pd.NaT, having the constructor return
# a Union with NaTType makes things awkward for users of pandas
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be inclined to return Any in the public api if there are genuine concerns rather that have the incorrect return type.

@simonjayhawkins
Copy link
Member

So even though the constructor could return a NaTType, we shouldn't force users to code for that due to static type checking.

I disagree, users type checking their code could be doing it to ensure more robust code, which is a different requirement to intellisense and other linting productivity tools.

The types should be precise (or less precise such as Any if we have issues to resolve) but not wrong.

@simonjayhawkins
Copy link
Member

given @twoertwein and @jbrockmendel comments, I think this one should be reverted.

@simonjayhawkins
Copy link
Member

From my perspective it looks good. The question is what do we prefer: type annotations being easy to use ("most likely" overloads) or being as strict as possible (always returning the union). Hopefully, the implementation will change in the future so that we don't need to make these trade-offs.

are you aware of other projects that use a "most likely" approach?

@twoertwein
Copy link
Member

are you aware of other projects that use a "most likely" approach?

Unfortunately, no. But I would be happy to make a compromise to avoid having two different type annotations (internal usage vs. public usage). While I'm not a fan of two type annotations, I agree that we should start with the MS stubs to have a smooth transition from a user-perspective.

@simonjayhawkins
Copy link
Member

I agree that we should start with the MS stubs to have a smooth transition from a user-perspective.

yep. a reason that i'm not too actively involved in typing issues at the moment. letting it play out for now.

So if this PR is prep for that then OK, but I would like to also see the impact on our current inline types and code changes to be kept to a minimum.

@burnpanck
Copy link
Contributor

As I noted in the linked issue #24983, I believe removing that type annotation is a bad idea. In particular, the fact that NaTType is not a subclass of Timestamp and/or Timedelta is not just an inconvenience for the static typing of user code, but also a risk; whenever a NaTType is being spit out, user code expecting either a Timedelta or Timestamp might break. Thus, the perceived inconvenience to the user is actually just the type checker doing its job: Alerting users from potential dangers through wrong assumptions. Sweeping that alert under the rug defeats the purpose of type annotations.

@jbrockmendel
Copy link
Member

@mroeschke should we consider reverting for 1.5?

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Aug 16, 2022

@mroeschke should we consider reverting for 1.5?

I wouldn't revert the whole PR. You could just revert the places where __new__() used to return Timestamp | NaT and now returns Timestamp due to this PR.

pandas-stubs always returns Timestamp because otherwise you'd always have to cast the result when using the constructor.

Having said that, it would be far better if NaT was subclassed in some way.

@mroeschke
Copy link
Member

It would be good to have the more correct annotation back for 1.5.

I am not fully caught up what partial changes are needed to restore the old annotation, so if someone could provide a PR to correct that, it would be appreciated. Otherwise, if I am not able to get to it/figure it out I'll just fully revert for 1.5.

@mroeschke mroeschke mentioned this pull request Aug 16, 2022
@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Aug 16, 2022

It would be good to have the more correct annotation back for 1.5.

I am not fully caught up what partial changes are needed to restore the old annotation, so if someone could provide a PR to correct that, it would be appreciated. Otherwise, if I am not able to get to it/figure it out I'll just fully revert for 1.5.

I will try to revert just that part.

@Dr-Irv
Copy link
Contributor Author

Dr-Irv commented Aug 16, 2022

A couple of notes:

  1. The report here that the types were inaccurate was correct, but could not have been discovered without using development code, because the change to have typing not return NaTType was not released in pandas 1.4.x
  2. Since doing this PR, we decided to have a separate pandas-stubs repo for public stubs, so the discussion above about type correctness for the public stubs belongs there (and will likely be hotly debated, if the debate ever starts)

I created a PR #48112 to revert just this typing change in the constructors, but if people are using pandas-stubs, the issue would still persist there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Timedelta Timedelta data type Timestamp pd.Timestamp and associated methods Typing type annotations, mypy/pyright type checking
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants