Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include a static type checker? #514

Closed
XzzX opened this issue Nov 28, 2024 · 21 comments · Fixed by #533
Closed

Include a static type checker? #514

XzzX opened this issue Nov 28, 2024 · 21 comments · Fixed by #533
Labels
enhancement New feature or request

Comments

@XzzX
Copy link
Contributor

XzzX commented Nov 28, 2024

Since pyiron_workflow uses type hints we should include a static type checker like (https://mypy.readthedocs.io/en/stable/index.html)[mypy] in the CI.

@XzzX XzzX added the enhancement New feature or request label Nov 28, 2024
@liamhuber
Copy link
Member

This is exclusively for static typing though, in which case we still need our existing type checking? Then for me including it vs recommending it is just a matter of how big the extra set of dependencies is. I didn't look at the whole tree, but the dependency list for the main package is non-trivial, so I'm neither closed to this nor immediately convinced.

@XzzX
Copy link
Contributor Author

XzzX commented Nov 28, 2024

This is for our code base of pyiron_workflow not the the workflow graphs. For the workflow graphs we have to do it ourselves. The dependencies do not really matter as it is only used in the CI during development.

@XzzX
Copy link
Contributor Author

XzzX commented Nov 28, 2024

For example:
pyiron_workflow/nodes/composite.py:390: error: Item "str" of "Node | str" has no attribute "inputs" [union-attr]
It tells you, if the type is str, the statement xxx.inputs will fail, because str does not have an attribute inputs.

@liamhuber
Copy link
Member

Not a great example -- this is a false positive. Check right above this in line 355.

@XzzX
Copy link
Contributor Author

XzzX commented Nov 28, 2024

No, it is not. It is a static type checker. You will get runtime type checking by the python interpreter.

The question is: Do you want the type hints to be a auto-completion helper for the user/IDE or a typing system?

@liamhuber
Copy link
Member

No, it is not. It is a static type checker. You will get runtime type checking by the python interpreter.

Ah, sorry, I had misunderstood the strictness of "static". It's not merely that it doesn't guarantee runtime behaviour, but it isn't parsing any of the transformation operations. Totally fair.

The question is: Do you want the type hints to be a auto-completion helper for the user/IDE or a typing system?

I mean, yes, of course, these are obviously pros. My only question is at what cost of cons? For instance,

def replace_child(
    self, owned_node: Node | str, replacement: Node | type[Node]
) -> Node:
    if isinstance(owned_node, str):
        owned_node = self.children[owned_node]
    ... # Proceed as though `owned_node` is a `Node` _in the scope of this function_

Is perfectly readable. Perhaps it's lack of imagination on my part, but if I am understanding correctly then to get the static linter happy with this would require something like

def _make_sure_its_a_node(self, node: Node | str) -> Node:
    if isinstance(node, str):
        node = self.children[node]
    return node

def replace_child(
    self, owned_node: Node | str, replacement: Node | type[Node]
) -> Node:
    self._replace_child(
        self._make_sure_its_a_node(owned_node),
        replacement
    )

def _replace_child(
    self, owned_node: Node, replacement: Node | type[Node]
) -> Node:
    ...

And in general this is the pattern whenever we have type overloading for user convenience. I'm just not convinced that the pros of maintaining IDE hints inside the scope of the function is worth the con of extra misdirection.

With that said, mypy (or similar) is still a nice idea. It may highlight places where we can get the static typing benefits at zero additional cost. I'm just not currently convinced it's something we should be strictly enforcing, i.e. I don't want to see the CI having ❌ because mypy complains.

@niklassiemer
Copy link
Member

Fair enough, I would also not want to make the code essentially complexer by this. Does mypy allow/parse special comments like # mypy var:type to make it aware of a overloaded variable converted to a specific type internally? Then it could stay the same code :)

@niklassiemer
Copy link
Member

I just had a quick look at the documentation
There is # type: ignore to ignore an error from a specific line.
Otherwise one can use type-narrowing-expressions

If I understand this correctly both of the following snippets should work fine:

def replace_child(
    self, owned_node: Node | str, replacement: Node | type[Node]
) -> Node:
    if isinstance(owned_node, str):
        owned_node = self.children[owned_node]
    assert isinstance(owned_node, Node)
    ... # Proceed as though `owned_node` is a `Node` _in the scope of this function_
def replace_child(
    self, owned_node: Node | str, replacement: Node | type[Node]
) -> Node:
    if isinstance(owned_node, str):
        owned_node_used: Node = self.children[owned_node]
    else:
        owned_node_used: Node = owned_node
    ... # Proceed as though `owned_node_used` is a `Node` _in the scope of this function_

@liamhuber
Copy link
Member

Nice! I find the second one a little clunky, but the assert version is pretty slick

@XzzX
Copy link
Contributor Author

XzzX commented Nov 29, 2024

Frankly speaking, I am not too familiar with mypy and what possibilities it offers. I have to read the documentation and see. I just know large projects that want to have strong typing use it.

Generally, I think everything not enforced and automatically checked is not existent. It might work for the first year and first few thousand lines of code, but in the long run it will get unreliable. It is a decision of general principle. Do we want to put the effort in and have the advantage of a strongly typed language, or are we ok with hints that will be correct most of the time but might also be incorrect.

Since my background is in strongly typed languages I favour this approach. :)

@XzzX
Copy link
Contributor Author

XzzX commented Nov 29, 2024

I should have spent more time in preparing the example. In fact, mypy is smarter than I thought and takes into account runtime type narrowing, like @niklassiemer already pointed out. The following code is correct and does not trigger any warnings:

from typing import List

def func(a: List[int]|int):
    if isinstance(a, int):
        a = [a]
    a.append(5)

@XzzX
Copy link
Contributor Author

XzzX commented Nov 29, 2024

Not a great example -- this is a false positive. Check right above this in line 355.

This is a type violation.

        if isinstance(owned_node, str):
            owned_node = self.children[owned_node]

returns Semantic not Node. And Semantic is not a Node. But owned_node: Node | str.
def children(self) -> bidict[str:Semantic]:

The real problem is:
pyiron_workflow/mixin/semantics.py:191: error: "bidict" expects 2 type arguments, but 1 given [type-arg]
pyiron_workflow/mixin/semantics.py:191: error: Invalid type comment or annotation [valid-type]
bidict[str:Semantic] is an incorrect type bidict[str, Semantic] would be correct. And then, the assignment is flagged as incorrect.

I think this is a very good example illustrating my previous point of hints being only hints that cannot be trusted if they are not checked and enforced.

@liamhuber
Copy link
Member

The real problem is: ...

Indeed, that's a typo and mypy would catch it. I think the more important problem here is rather:

And Semantic is not a Node

But then, I think the solution from @niklassiemer with assert isinstance(owned_node, Node) is the best way to go. Because as is, Composite.children is directly inherited from SemanticParent.children where it is hinted as Semantic. Since Node is Semantic, the type hint is not wrong, but Composite narrows this in practice without explicitly re-defining the hint for .children. It then comes down to this:

Generally, I think everything not enforced and automatically checked is not existent. It might work for the first year and first few thousand lines of code, but in the long run it will get unreliable. It is a decision of general principle. Do we want to put the effort in and have the advantage of a strongly typed language, or are we ok with hints that will be correct most of the time but might also be incorrect.

I totally agree that if it's not in the CI it's not reliably there. But honestly it's not reliably there to begin with -- a decent amount of stuff has no type hint to start with. In some places, like adding an assert(thing, isinstance(SomeType), I find the fix very palatable. In other places, e.g. in Composite(SemanticParent).children where we narrow the scope in SemanticParent.children from Sematic to Node(Semantic), it seems to me like we probably need to have a bunch of extra boiler plate like

@property
def Composite.children(self) -> bidict[str, Node]:
    return super().children

And honestly I'm just not sure I want to commit to imposing a bunch of boiler plate on the python code. I definitely see the appeal and benefit of having the types be testably reliable, but, oof, I have flat zero desire to go around writing code like the snippet immediately above. Also, like you, I have very little (none in my case) experience with mypy, so I don't feel I have a super clear picture of exactly how steep the price to pay will be, and thus am not ready to come down super hard on either side here.

@liamhuber
Copy link
Member

liamhuber commented Nov 30, 2024

Letting it percolate more, and with the examples from @niklassiemer suggesting to us that mypy may let us accomplish it without making the code insanely ugly, I'm moderately in favour of the change (as long as we can continue to find reasonably aesthetic routes accomplish it).

@XzzX, in addition to ruff, would you be interested in getting a mypy suite running in pyiron/actions behind a flag? We can turn it on in a PR here and discuss the results. It is also probably easier to use from the get-go, so I imagine the impact is even better in greenfield projects. (It doesn't have to be you that does this, I'm just picking on you because you opened the issue. @mbruns91 and @samwaseda have both played in the centralized actions so maybe they are interested in doing/watching this.)

@jan-janssen
Copy link
Member

jan-janssen commented Nov 30, 2024

I did a first test run of mypy in conda_subprocess, pyfileindex and pyiron_dataclasses:

I think it is an helpful addition and a great step to improve our code quality. Still it might take some time to update all repositories to have strict type hints - currently the other packages like atomistics, executorlib, pysqa and pylammpsmpi all have more than 100 type errors each.

@XzzX
Copy link
Contributor Author

XzzX commented Dec 2, 2024

Letting it percolate more, and with the examples from @niklassiemer suggesting to us that mypy may let us accomplish it without making the code insanely ugly, I'm moderately in favour of the change (as long as we can continue to find reasonably aesthetic routes accomplish it).

Well, you cannot blame mypy for doing its job. It enforces a strong type guarantee whenever type hints are present. That's its job. If your code is incorrect from this perspective you can either not use mypy (we could run it selectively on just a subset of files), you could change your code to provide a strong type guarantee or you can find a multitude of workarounds in between.

assert isinstance(owned_node, Node) is just a silence mypy error. It delegates the type check back to runtime. If this line does not catch the error then the illegal access in the following lines will. Apart from a different runtime error message we have not gained anything.

In my opinion the correct solution would be to make the code strongly typed (assuming we want it to be like that). If the current class hierarchy is to be preserved, it means making children a generic container
https://docs.python.org/3/library/typing.html#generics
and providing the correct type at inheritance level.
Composite(SemanticParent[Node])

Of course, this introduces more complexity and burden for those not familiar with strongly typed languages.

@XzzX
Copy link
Contributor Author

XzzX commented Dec 2, 2024

Another remark about unnotated stuff:

def foo(a: int, b) -> int:
    return a+b

Is correct. If no type is specified, no check will happen. If you use the result of foo type checking will continue since the return type is known.

@XzzX
Copy link
Contributor Author

XzzX commented Dec 2, 2024

In other places, e.g. in Composite(SemanticParent).children where we narrow the scope in SemanticParent.children from Sematic to Node(Semantic), it seems to me like we probably need to have a bunch of extra boiler plate like

@property
def Composite.children(self) -> bidict[str, Node]:
    return super().children

This is wrong and not an option. Consider the following:

class A(Semantic): #<- this is not a Node
    pass

def add(container: SemanticParent):
    a = A(...)
    container.children.append(a) #<- valid according to type definitions

foo: Composite = ...
add(foo) #<- incorrect, due to redefition of children

The redefinition of Composite.children is incompatible with Composite being derived from SemanticParent!

@XzzX
Copy link
Contributor Author

XzzX commented Dec 2, 2024

This is a good benchmark. If the typical developer immediately sees the problem pointed out we can enforce strong typing. If not, and the typical developer thinks this is rather strange and complicated, we should not do it.

@liamhuber
Copy link
Member

Another remark about unnotated stuff:

def foo(a: int, b) -> int:
    return a+b

Is correct. If no type is specified, no check will happen. If you use the result of foo type checking will continue since the return type is known.

Yes, this is exactly how the runtime checking works too -- if one side of a connection has no type hint at all, the type checking is simply skipped.

@liamhuber
Copy link
Member

This is wrong and not an option. Consider the following:

...

The redefinition of Composite.children is incompatible with Composite being derived from SemanticParent!

Yes, this is a conclusive example that my first idea for how to handle the type narrowing is wrong.

In my opinion the correct solution would be to make the code strongly typed (assuming we want it to be like that). If the current class hierarchy is to be preserved, it means making children a generic container
https://docs.python.org/3/library/typing.html#generics
and providing the correct type at inheritance level.
Composite(SemanticParent[Node])

TBH I have never used the generics or type vars. I played around with them a little bit this afternoon and see how to make the children bidict dynamicalIy typeable in children of SemanticParent. Extremely lightly sketched:

...
from typing import Generic, TypeVar, Dict
...

T = TypeVar('T', bound='Semantic') 

class SemanticParent(Generic[T]):
    _children: bidict[str, T]
    ...

class Composite(..., SemanticParent[Node], ...):
    ...

I'm not in love with it, but it's relatively compact and clear. Thanks for the pointer to Generic.

Of course, this introduces more complexity and burden for those not familiar with strongly typed languages.

Realistically, this is not a trivial point. I'm not "blaming" mypy, but rather if we go this route (and perhaps to decide whether we should go this route), it will be important to minimize the difference to "plain" python and to introduce as few new concepts (e.g. Generic, TypeVar) as possible.

I absolutely see the merit of having the type hints be provably correct. I hope the typing here is overall not too bad, but there are 100% going to be more type narrowing problems than you uncovered in this example, and it would be great to resolve them. It is not super high on my priority list, but I fully agree it is a positive change.

The catch is that while it makes the code base easier to learn and more reliable (a clear net win), it only does so for people who are already familiar with these tools/ideas. It requires this extra knowledge, which presents a barrier to getting other developers who don't already have this knowledge to maintain and expand the project at all. So I see some intrinsic tension between making things easier for a hypothetical "ideal" dev, and our actual real team. At present, I would be surprised if anyone on the team saw Composite(SemanticParent[Node], HasParent[SemanticParent], Node, ABC): and was immediately able to understand what's happening -- before today I certainly couldn't!

So I like the idea, and I'd like to move forward with it, but I don't think it's a good idea without at least moderate buy-in from the rest of the team. Are others willing to learn about generics, type vars, and willing to spend the time making sure stuff passes mypy? @niklassiemer, you have a permanent position, are you ok committing to figuring out the typing stuff sometime in a hypothetical future where you take over maintenance? For me, that would be the absolute bare-minimum level of buy-in.

Getting people to use pyiron_workflow has been pretty smooth and easy. Getting people to look inside it and modify it has been like pulling teeth, and the bus factor is effectively 1. I'm very nervous about adding any extra barriers to the subset of people who would be most likely to improve that factor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants