Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add more accurate typing for DbApiHook.run method #31846

Merged
merged 14 commits into from
Jul 18, 2023

Conversation

dwreeves
Copy link
Contributor

@dwreeves dwreeves commented Jun 11, 2023

Overview

The DbApiHook.run method has incredibly complex typing, which is currently reduced to just the following:

    def run(
        self,
        sql: str | Iterable[str],
        autocommit: bool = False,
        parameters: Iterable | Mapping | None = None,
        handler: Callable | None = None,
        split_statements: bool = True,
        return_last: bool = True,
    ) -> Any | list[Any] | None:

This is incomplete typing for a few reasons:

  • Some of the bools in the kwargs determine the return type.
  • Whether the handler is None determines whether the return type is None.
  • The return type of the handler is generic and determines the return type of the run method.

Addressing the first bullet point would be a bit of work, and would add a lot of complicated typing to each provider hook.

But I figured addressing the 2nd and 3rd bullet points was easy and takes just 2 overloads. This alone helps vastly improve the typing of of the DbApiHook.run method with a minimal amount of added annotation complexity.

Misc.

  • I do not anticipate that this should cause any backwards compatibility concerns except for users who meet all the following criteria:
    • (1) they are relying on the DbiApiHook directly, (2) they are using mypy or another type checker, and (3) they had type checking issues that were not being properly flagged before due to the less precise typing.
    • This may also cause users' IDEs to get upset at them if they have a function that looks like this as a handler, even though it would technically work. This is probably fine to flag to users, even if it works, since the only reason it works is by ignoring the kwargs entirely.
    # This works as a handler because `handler(cursor)` is a valid input, and `foo` is ignored.
    def handler(cursor, foo=None):
  • I did not open an issue because the issues template for feature requests says:

    Features should be small improvements that do not dramatically change Airflow assumptions. Note, that in this case you do not even need to create an issue if you have a code change ready to submit! You can open Pull Request immediately instead.

@dwreeves dwreeves force-pushed the improve-typing-for-dbapihook branch from 1331739 to 985d29b Compare June 11, 2023 17:47
@dwreeves dwreeves force-pushed the improve-typing-for-dbapihook branch from aa86823 to a20554a Compare June 12, 2023 18:16
Copy link
Member

@uranusjr uranusjr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some thoughts and a question

sql: str | Iterable[str],
autocommit: bool = False,
parameters: Iterable | Mapping | None = None,
handler: Callable[[Any], T] = None, # type: ignore[assignment]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
handler: Callable[[Any], T] = None, # type: ignore[assignment]
handler: Callable[[Any], T] | None = None,

I don’t think handler is required for this signature…?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused and I'm curious what you mean.

For the Databricks hook's run, we still have in the method these lines of code:

        if handler is None:
            return None

So, it seems it should still be overloaded. And the # type: ignore[assignment] is needed if we assign defaults to the overloads instead of ....

What am I missing?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So is it the other way around, for the T | list[T] variant, handler must be set to a callable right? I think in that case I’d perfer we simply remove the argument default from the overload signature altogether. Maybe also make it keyword-only (not technically perfect but practically that’s how the argument should be specified anyway).

Copy link
Contributor Author

@dwreeves dwreeves Jun 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So is it the other way around, for the T | list[T] variant, handler must be set to a callable right?

Yes.

I think in that case I’d perfer we simply remove the argument default from the overload signature altogether. Maybe also make it keyword-only (not technically perfect but practically that’s how the argument should be specified anyway).

The problem with doing that is Mypy doesn't allow for it. The mypy-providers returns 18 errors if I do that:

Found 18 errors in 16 files (checked 993 source files)

Here is what I replaced each run typing with (+ Snowflake one has extra kwarg).

    @overload
    def run(
        self,
        sql: str | Iterable[str],
        *,
        autocommit: bool,
        parameters: Iterable | Mapping[str, Any] | None,
        handler: None,
        split_statements: bool,
        return_last: bool,
    ) -> None:
        ...

    @overload
    def run(
        self,
        sql: str | Iterable[str],
        *,
        autocommit: bool,
        parameters: Iterable | Mapping[str, Any] | None,
        handler: Callable[[Any], T],
        split_statements: bool,
        return_last: bool,
    ) -> T | list[T]:
        ...

    def run(
        self,
        sql: str | Iterable[str],
        autocommit: bool = False,
        parameters: Iterable | Mapping[str, Any] | None = None,
        handler: Callable[[Any], T] | None = None,
        split_statements: bool = False,
        return_last: bool = True,
    ) -> T | list[T] | None:

Similarly, Leaving as positional args (i.e. excluding the *) also returns 18 errors.

I believe the only way to avoid Mypy errors are the following 2 options:

  1. To do what I did originally, and set = ... as the kwargs. Although I could be wrong, I do not believe this breaks any IDEs or static type checkers. And Mypy is cool with it, and it doesn't involve using # type: ignore.
  2. Do what I have in the current PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see. PyCharm will use ... here for the overloaded type. I don't see that as wrong per se, but I suppose it depends on what user experience you want.

image image

Comment on lines 164 to 165
return_last: bool = True,
) -> T | list[T]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be further split into Literal[True] returning T and Literal[False] returning list[T]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're correct. I decided initially that I wanted to take baby steps with this PR, but we can go all the way if you'd like!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do whatever you feel comfortable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm gonna be real. I tried breaking out T and list[T] return types based on Literals but I couldn't get Mypy to agree with me that there were no conflicts. I will just leave that to someone else in the future who is more knowledgeable than I am about Mypy. 😅

airflow/providers/databricks/hooks/databricks_sql.py Outdated Show resolved Hide resolved
@dwreeves
Copy link
Contributor Author

The scope of the PR increased a little bit here. Hope that isn't too bad.

  • I replaced all parameters: Iterable | Mapping | Nones with parameters: Iterable | Mapping[str, Any] | Nones.
  • I also type-annotated most (but not all-- some I wasn't sure whether they hooked into a SQLAlchemy connection) parameters kwargs that were un-annotated that I could find.
  • I removed two superfluous run() methods for DbApiHook subclasses. These were just doing return super().run(...) with no other changes. I could not find any reason for these methods not just default to the subclassed version. I am aware that provider package version conflicts can sometimes necessitate doing things like this, but that didn't seem to make sense here. (Let me know if I'm missing something.) The last major changes to the run() methods occur in this PR: df00436 and from what I can tell:
    • The Trino hook used to be a lot of custom code that was replaced with return super().run(...).
    • The Presto hook used to have a super().run() to warn users about a deprecated kwarg (hql) that no longer exists.

@dwreeves dwreeves requested a review from uranusjr June 13, 2023 17:10
def run(
self,
sql: str | Iterable[str],
autocommit: bool = False,
parameters: Iterable | Mapping | None = None,
handler: Callable | None = None,
parameters: Iterable | Mapping[str, Any] | None = None,
Copy link
Contributor Author

@dwreeves dwreeves Jun 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the one package where I'm unsure... 🤔 I'm actually thinking, the old annotation was incorrect and it never allowed for Iterable types:

  • Note that the original annotation was Iterable | Mapping | None. So, Airflow has said for a while that it allows for an iterable.
  • pyexasol isn't using a SQLAlchemy connection object, as far as I can tell.
  • When I look at pyexasol's documentation, and I look at the API https://github.com/exasol/pyexasol/blob/master/docs/REFERENCE.md I don't see any ability to take in a non-mapping iterable type as a valid input. The reason the other connections take in an Iterable is because they all wrap SQLAlchemy. But if you look carefully, you'll notice pyexasol is completely foregoing SQLAlchemy.

It's possible the safest thing to do here would be to not touch this file's parameters type annotations at all, and just leave as Iterable | Mapping | None for the run() method + don't touch anything else.

That said, we may as well not let this little bit of digging go to waste. Someone who is knowledgeable on pyexasol should confirm whether I am correct that the Iterable type should not be here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Annotations in providers can be inaccurate since not all of them are very often used. I’d say we can just change the annotation and see if anyone complains.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh no, mypy yelled at me for violating the Liskov substitution principle when I swap to Mapping[str, Any] | None, of course... 🤦 Ahh. I'm just going to keep as-is. I don't want to mess around too much with this. I hope that is OK.

@dwreeves
Copy link
Contributor Author

dwreeves commented Jun 18, 2023

@uranusjr Coming back to / checking in on this.


I think the main outstanding thing with this PR is in regards of what to do with default arguments in overloads.

  • (1) Excluding the defaults entirely (something you suggested) causes MyPy to fail due to incompatible typing.
  • The current approach includes (2) overloaded defaults that match the actual defaults, as per your initial request. It requires adding some # type: ignores though.
  • Another approach that MyPy will pass is to use (3) ellipses in overloaded defaults. This does not require any # type: ignores.

So right now we're on 2, originally I started with 3, and the latest communication I received was to switch to 1 but this causes MyPy to fail without extensive use of # type: ignores.

Due to 1 not really working, I think we should stick with either 2 or 3. Let me know which of them you'd like.


An updated summary of everything else would be:

  • I'm not sure if I can add additional overloads. I could not figure out how to overload defaults with the Literal[True]s without MyPy describing conflicts. I imagine this is my fault for not understanding something. In any case, I think the currently proposed changes are a vast improvement over the status quo, and perhaps I or someone else can figure out the additional typing at a later date.
  • I also removed some superfluous .run() methods in a few provider packages. I did this to keep the code changes simpler overall, as my would have necessitated adding a ton of overloads to those files. Looking at the code and the code history of the provider package releases (for trino, presto, and common-sql), I do not foresee any problems with this change (also, the tests pass). But it's not literally zero risk either.

And that's about it.

Let me know what you want to do. 😄

Copy link
Member

@uranusjr uranusjr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is about as good as we can get. This kind of one or many depending on flag kind of thing is by design difficult to type (pro tip: don’t design API like this).

@uranusjr
Copy link
Member

(Forgot to respond to the comment)

default arguments in overloads

Using ... is probably the best we can do, and slightly better than ignore comments. Not perfect, but unless we explode the overloads (cross-product with and without argument variants) it’s the best we can do.

@dwreeves
Copy link
Contributor Author

@uranusjr should be all set.

@dwreeves dwreeves force-pushed the improve-typing-for-dbapihook branch from 6a77dae to d03e1bc Compare June 20, 2023 01:05
@dwreeves
Copy link
Contributor Author

dwreeves commented Jul 3, 2023

Fixed merge conflicts and merged changes that have passing CI.

Copy link
Contributor

@dstandish dstandish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think i saw anything that would change behavior so looks ok to me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants