dataclass transforms, propose ability for an indirection layer for SQLAlchemy #2958

zzzeek · 2022-01-30T18:58:31Z

zzzeek
Jan 30, 2022

So I just realized that dataclass transforms are available now using a magic name __dataclass_transform__, so we could in theory use this now if it were compatible with our use case.

However, the way things work in SQLAlchemy, it's not compatible right now. A SQLAlchemy mapped class defines behaviors not just at the instance level, but also at the class level, using Python descriptors. That is, a mapped class is not fully typed if it looks like this:

class MyClass(Base):
    id : int = mapped_column(...)

that of course is a form we've made "work" using the mypy plugin, but it's not actually correct. The correct form is:

class MyClass(Base):
    id : Mapped[int] = mapped_column(...)

Where Mapped is a descriptor that supplies all the correct behaviors for the mapped class, including SQL expressions at the class level like MyClass.id > 5.

the Mapped type is a regular descriptor without too much going on except that it has class level behaviors defined: https://github.com/sqlalchemy/sqlalchemy/blob/f24a34140f6007cada900a8ae5ed03fe40ce2631/lib/sqlalchemy/orm/base.py#L608

so out of the gate for dataclass_transforms to be useful to us we'd need some way to have it extract the _T of the attributes, that is, to produce the effective __init__ method below:

class MyClass(Base):
    id: Mapped[int] = mapped_column(primary_key=True)
    data: Mapped[Optional[str]] = mapped_column()

    def __init__(self, id: Optional[int]=None, data: Optional[str]=None):
        ...

note also that SQLAlchemy's __init__ method considers all attributes to be optional up front as well. Unlike dataclasses, SQLAlchemy's objects are highly dynamic where we assume some attributes come from database-generated defaults later on, other attributes are assigned at runtime but not within the constructor, etc.

at the moment it does not appear that we'd need to make use of the specially named parameters on the field() construct, which is good because that form is also not quite how our APIs look anyway; the ORM mapping process doesn't create any __init__ level default values, everything defaults to None and everything is an optional keyword argument (ah where, I'd also ask, does dataclass transform consider Optional[typ] to mean that it's an optional keyword argument, as opposed to "the parameter is required but can be None. that's another part). the high degree of specificity on those parameter names for field(), including that these can't be attributes on the object or anything like that, makes me worry a bit, so hopefully if this proposal can be accommodated I won't need to get into that.

so here I propose at least one way our altered "synthesis" of __init__ parameters can be made to work for us in a way that shouldn't get in anyone's way, since the whole purpose of the function is to model "runtime synthesis", that the annotated type given in the body of the class also has a method that can be decorated / magically named, a name like __dataclass_field_transform__, for us it would look like this:

class Mapped(Generic[_T], util.TypingOnly):
    __slots__ = ()

    if typing.TYPE_CHECKING:

        @overload
        def __dataclass_field_transform__(self, field: "Mapped[Optional[_FT]]") -> Optional[_FT]:
            ...

        @overload
        def __dataclass_field_transform__(self, field: "Mapped[_FT]") -> Optional[_FT]:
            ...

        def __dataclass_field_transform__(self, field: Any) -> Any:
            ...

        @overload
        def __get__(
            self, instance: None, owner: Any
        ) -> "InstrumentedAttribute[_T]":
            ...

        @overload
        def __get__(self, instance: object, owner: Any) -> _T:
            ...

        def __get__(
            self, instance: object, owner: Any
        ) -> Union["InstrumentedAttribute[_T]", _T]:
            ...

        # ...

so when pyright looks at the fields for __dataclass_transform__ in order to determine how it would "synthesize" an __init__ method, it also looks at __dataclass_field_transform__ on the declared type of each field to see what type of object is actually expected within the synthesized __init__ method.

theres probably other ways to do this too, but that's the basic thing we'd need. let me know what you think.

erictraut · 2022-01-30T21:11:07Z

erictraut
Jan 30, 2022
Maintainer

Thanks for reaching out. This discussion is timely because a draft PEP 681 is about to be posted for this functionality. We have a small window of time to complete this PEP because Python 3.11 is locked (in terms of new functionality) in May. @debronte is the primary author, and he's shepherding it through the feedback and review process.

It would probably be best to move this discussion to the python/typing discussion forum so the broader Python typing community benefits from it — and has the opportunity to weigh in.

I presume that the behavior you describe above is specific to sqlalchemy 2.x. I don't see these classes/types in the sqlalchemy 1.x code base. Am I correct in understanding that sqlalchemy 2.x is still under development and that there is willingness to make some breaking changes from sqlalchemy 1.x? If so, I encourage you to consider switching to something that more closely matches the behavior of the stdlib dataclass. Part of the intent behind dataclass_transform is to standardize behaviors for classes that act similar to dataclass. It's a real problem if every Python library adopts its own slightly-different and incompatible mechanism to solve the same set of problems that dataclass solves. I don't think that there will be much willingness to accommodate sqlalchemy-specific behaviors in PEP 681. That said, if you can convince the community (including the authors of other popular libraries that use dataclass-like semantics) that certain behaviors are useful beyond sqlalchemy, I could see those being included in PEP 681.

As for your proposed solution, I see a few issues:

In dataclass, data fields are all instance variables, not class variables. For example, if a dataclass Foo declares a data field x, any attempt to access x from the class (e.g. Foo.x) will result in a runtime error (and a type error in pyright). By contrast, descriptors are class variables that manage instance data. In a stdlib dataclass, class variables need to be annotated with ClassVar, and they don't appear in the __init__ method. So annotating data fields with a descriptor type will result in strange edge-case behavior (e.g. Foo.x will not cause a runtime exception but will still produce a type error). Maybe those edge cases aren't important, but we would need to think through the implications.
I don't think I understand how you intend for the __dataclass_field_transform__ to work. It appears to be an instance method on a generic class that is parameterized by TypeVar _T, but the method itself is generic with TypeVar _FT. If the field type is declared as Mapping[int], then _T is bound to int, the self parameter has an inferred type of Mapping[int], but what does the field parameter do, and how does the TypeVar _FT get bound to a type value?

I wonder if it would make sense to apply the transform in the opposite direction. In other words, have the user specify the field type without the descriptor class but then apply the descriptor class as part of the transform. This has the benefit of hiding some of the implementation details that users probably don't care about. The transform could be specified as part of the dataclass_transform.

You asked "does dataclass_transform consider Optional[typ] to mean that it's an optional keyword argument"? No, dataclass_transform uses the same behaviors as dataclass. In the Python type system, Optional always means "union with None". It doesn't imply other behaviors (such as "this argument is optional" or "this field may not be present").

@dataclass
class Model:
    a: int   # No default value, must be specified with `int` argument when calling __init__ method
    b: Optional[int]  # No default value, must be specified with `int` or `None` argument when calling __init__ method
    c: int = 3  # Default value provided, so it can be omitted when calling __init__ method
    d: Optional[int] = None  # Default value provided, so it can be omitted when calling __init__ method

reveal_type(Model.__init__)
# Type of "Model.__init__" is "(self, a: int, b: int | None, c: int = 3, d: int | None = None) -> None"

8 replies

erictraut Jan 31, 2022
Maintainer

Thanks for the edits to the proposal. It makes sense to me now. It's a reasonable proposal and could be worth discussing further in python/typing in the context of PEP 681 feedback.

In case it wasn't clear from the spec, the need for keyword-only parameters is already handled by the dataclass_transform mechanism. You can specify kw_only_default=True, which will generate keyword-only parameters in the synthesized __init__ method.

zzzeek Jan 31, 2022
Author

so I've only used dataclasses very lightly so far so I only have handwavy notions about its rules for generating its __init__ method.

for the kw thing, if the "synthesized init" uses kwargs only , I would assume that sets up kw-only arguments but nonetheless without a default value, so that they are still required. we would need some flag that indicates "default_all_keyword_arguments_to_x", or "default_all_optional_fields_to_x", something like that. so that for all Optional[] fields (which for us is all of them), instead of generating:

def __init__(*, id, name):
    ...

it generates:

def __init__(*, id=None, name=None):
  ...

so from the impression I'm getting, that's nothing like a dataclass, yet is nonetheless probably not that hard to implement.

zzzeek Jan 31, 2022
Author

thinking about it, our default keyword argument is not really even the value "None", it's just completely ignored if not present since we just use a **kw approach. with explicit keyword arguments, to get that effect I sometimes have to use a constant like NO_ARG rather than "None" if I want to disambiguate between "set the field to None" vs. "don't initialize the field at all". in the mypy plugin we set up these manufactured arguments using the ArgKind.ARG_NAMED_OPT argument type.

erictraut Jan 31, 2022
Maintainer

It sounds like sqlalchemy deviates from dataclass enough that it would be a stretch to make this work. That's really unfortunate. As I said above, it's problematic for the entire Python ecosystem when library authors decide to deviate from standard patterns. It creates special cases that all tooling needs to implement — and users pay the price because of frequent paper cuts. Do you think there would be any appetite for sqlalchemy 2.x to move closer to dataclass semantics? I feel like we're not that far away, but it's going to take some compromise on both sides to make this work.

zzzeek Jan 31, 2022
Author

So let me first apologize that over the weekend, while I was very much trying to move forward on the general typing issues in SQLAlchemy within my own work, on twitter, and here, I was working under distraction, and to that degree I feel that when I posted this question, I didn't have all the relevant facts lined up. Additionally, the discussion so far has enlightened me that what the PEP is going for here is to pull other libraries into the dataclasses style of working (which is perfectly fine, if that's what the users like), as opposed to generalizing the notion of "base classes and decorators that synthesize __init__ methods". I think this is a very important distinction and I think any PEP that is written here should likely include clarification of the "goals" , that at least include as a "non-goal", "allow arbitrary __init__ generation patterns to be typed".

This is all not bad news at all, because now we can pivot the conversation to look at it not in terms of "can this typing construct support SQLAlchemy mapped classes" and instead to "can this new decorator allow SQLAlchemy to improve its existing dataclass support", because SQLAlchemy already supports real dataclasses now - they can be mapped in a variety of ways as you can review here and here, where you'll note we have explicit support for @attrs also. So I would not use the term "deviates" at all really, SQLAlchemy can just handle a superset of behaviors vs. what @dataclasses does, but you can completely map a dataclass now, so it's there already.

So If we want to move to a world where "the dataclass style is the authoritative style to construct classes in Python", I am all for that if the users like it, as long as we can find a way to get the class-bound descriptors to be typed more nicely and also if we can improve the union of dataclass field configuration with SQLAlchemy persistence configuration. The latter part of this seems like it might be mostly if not completely doable if we can add overloads for our existing ORM constructs that line up things like "init=False" at the very least, with bonus points to the degree we can automate fields like "default=None", "default_factory=list", etc. Overall, as the whole dataclass thing seems to really focus on the behavior of __init__() and not too much else, and applies a richer set of functionality and constraints towards how it composes this __init__() method that aren't SQLAlchemy's current default (where SQLAlchemy's "current default" is basically a two-liner setattr() scheme), this would be a new SQLAlchemy API that enhances our existing dataclass support to provide a new dataclass-compatible API that also integrates more comfortably with SQLAlchemy's configurational and class-behavioral needs.

to be clear, at this point in the conversation we are dropping the original idea that @dataclass_transform works for traditional SQLAlchemy mapped classes, those classes will stick with having an agnostically typed __init__(self, **kw: Any) method (which is hardly much of an issue in any case), and we are now proposing an all new API that leverages this new feature to improve SQLAlchemy's dataclass story. There's not much issue with SQLAlchemy having a more richly designed __init__ method, it just would obviously be an "opt-in" API that isn't applied to existing code automatically.

so first let's look at what this new decorator can enable for us. firstly, it seems like it would allow us to have a base class that when subclassed, can advertise dataclass behaviors to the typing tools; this would be achieved by applying the new decorator to a metaclass SQLAlchemy provides.

Next, as SQLAlchemy supports the option to map a user-defined class using a decorator rather than a base class, SQLAlchemy can also provide new decorators (or enhance the existing ones) to also advertise to typing tools that classes using this decorator also should have dataclass semantics.

so in the examples going forward, let's assume they can all interchangeably be class MyClass(SomeBaseClass) as well as @some_class_decorator()(class MyClass).

Next thing, there's a behavior of @overload, a construct I've been working with all day for some weeks now and it still confuses me, that seems to be mentioned in the document that I'm totally unfamiliar with. While I know that we can define an @overload where we match up on some of the arguments being set to Literal[False], Literal[True], etc. in order to produce an argument match to a return type, the document's phrase "Field descriptor functions can use overloads that implicitly specify the value of init using a literal bool value type (Literal[False] or Literal[True])." as well as the subsequent example seems to imply that careful construction of @overload can imply the value of a parameter that the user didn't actually put in their code. this is sort of mindblowing if true? that would solve most of the problems we have here. Consider a hypothetical mapping:

@sqlalchemy_dataclass
class MyClass:
    id: int = mapped_column(primary_key=True)
    name: str = mapped_column()
    data: Optional[str] = mapped_column()
    related: List[OtherThing] = relationship()

In database interaction, it is usually the case that an integer primary key column is "auto incrementing" , so that in the default case, we above would not need "id" to be part of the constructor at all. The "name" field we would want as a required argument, positional or keyword is fine since we're making a new API here, and for "data", that's an optional keyword. For "related", that's a one-to-many collection, that is implicitly an empty list to start with, it semantically cannot be None. We can certainly add the "init", "default", "default_factory" keywords to our constructs, so the above can be stated explicitly as:

@sqlalchemy_dataclass
class MyClass:
    id: int = mapped_column(primary_key=True, init=False)
    name: str = mapped_column()
    data: Optional[str] = mapped_column(default=None)
    related: List[OtherThing] = relationship(default_factory=list)

    # will imply an __init__ method as:
    def __init__(self, name, data=None):
        ...

Just so we're lined up, let's just confirm that the mapping above is exactly compliant with what @dataclass_transform provides? that is, we just add mapped_column(), relationship() to the field_descriptors collection, we make sure they accept keyword arguments specifically named "init", "default", "default_factory" and others, and that's how it works.

To create this API, we certainly will add those keywords either to our existing constructs, or to new ones that are specific to the new "SQLAlchemy dataclass" API we are designing, that's not much of an issue, and I would be really excited to be able to make such a construct, as our current API that embeds into field() looks awful: field(default=None, metadata={"sa": Column(String(12))}). I get that "metadata" is useful for this but it's way too verbose and I'd rather build an integrated construct, which looks like it's possible here.

I would prefer if there is a way to make it assume "init=False" for "id", "default=None" for "data", and "default_factory=list" for "related", or maybe "init=False" for that one. The None thing in particular because in relational databases, a non-present value is universally NULL, if there is no default generation. it is implicitly part of a database row if you don't include a certain column name in an INSERT statement and there's no explicit default for it. From a relational mapping perspective, having to have "field(default=None)" for all optional fields is very redundant. The "default_factory=list" part is also very redundant since we already have List on the left side, but in practice users usually don't want to pass a list to their __init__ anyway so maybe we just do init=False as the default behavior for that.

If indeed @overload can be convinced, as in the example in the dataclass_transform document, that init should be Literal[False] based on the presence of primary_key=True (in fact it needs to also include that it's int for that to be usable), can it also be convinced of default: Literal[None] = None for the case of data that includes Optional? Not sure about the default_factory part since we can't make a Literal of list; "init=False" is often the behavior people would want in that case. I say "prefer" because if indeed we are designing towards an API where there's already a base of enthusiastic users that like typing "default=None" for all their optional fields, they appreciate the explicitness /verbosity, that's fine.

So that's the first part of it, the second is, SQLAlchemy mapped classes have descriptors on them which have class-level behaviors. Our existing dataclass support simply puts the descriptors right on the dataclass, and @dataclass doesnt mind at all (thankfully) because it's mostly concerned with __init__. that is, it already works fine, it's just the typing tools dont agree.

So this goes to where you seemed to be open to the notion of "have the user specify the field type without the descriptor class but then apply the descriptor class as part of the transform." Sure, that would be great. Given our mapping above, the typing tool would somehow be convinced to see this:

@sqlalchemy_dataclass
class MyClass:

    # implied type of "id": sqlalchemy.orm.Mapped[int]
    id: int = mapped_column(primary_key=True, init=False)

    # implied type of "name": sqlalchemy.orm.Mapped[str]
    name: str = mapped_column()

    # implied type of "data": sqlalchemy.orm.Mapped[Optional[str]]
    data: Optional[str] = mapped_column(default=None)

    # implied type of "related": sqlalchemy.orm.Mapped[List[OtherThing]]
    related: List[OtherThing] = relationship(default_factory=list)

    # will imply an __init__ method as:
    def __init__(self, name, data=None):
        ...

if that can functionality can be made possible, and we can make our own "field()" objects where we don't need a klunky metadata dictionary with our own stuff, we can make a very nice fully-dataclass compliant SQLAlchemy API.

This API, if indeed possible, would be in addition to our existing APIs but based on tooling support we can certainly push the above to be more prominent. But of course for us to advertise it as the canonical approach that's what we put in the tutorial and all the examples, the pep would need to be on an approval track and I'd really need mypy to support it, I guess because we have a mypy plugin already we could be adding it, though it would be nicer if there were a "dataclass_transform" mypy plugin that just worked.

For the moment, since none of this is nailed down yet, I'm still proceeding with our explicit Mapped[] approach as canonical, and I would want it to remain supported in any case even if it were not canonical, to at least support gradual typing of existing codebases.

I do think I'm going to look into how hard it would be for our existing constructs to have an "init=False" attribute and also think about if we want to support required keyword or positional argument patterns with __init__, without the need to use the @dataclass decorator, though this is not a primary focus for development right now. I suppose there's a way I could just get @dataclass itself to do that for me if I manipulated the class __annotations__ but that might be breaking the rules a bit.

If we went with the above approach we could also move away the prominence of our existing dataclass API, as it would no longer be necessary, though I dont think we could ever drop it.

So overall, it seems like the main thing we would need to be changed for this to work nicely is getting the transform to allow us to imply a specific descriptor type at the class level which is generic around the user-defined type for the attribute. maybe there's already a way to do that that I'm not seeing (like if our "field" descriptors did in fact implement the descriptor API -which in the new version of things, they do).

debonte · 2022-02-05T18:57:56Z

debonte
Feb 5, 2022
Maintainer

Thanks for your feedback, @zzzeek. And sorry for the delayed response.

if indeed @overload can be convinced, as in the example in the dataclass_transform document, that init should be Literal[False] based on the presence of primary_key=True...

Yes, this is possible:

@overload
def mapped_column(*, primary_key: Literal[True], init: Literal[False] = False) -> Any: ...

def mapped_column(*, primary_key: bool = False, init: bool = True) -> Any: ...

...(in fact it needs to also include that it's int for that to be usable)
...
can it also be convinced of default: Literal[None] = None for the case of data that includes Optional?

I was unable to find a way to configure overloads such that the type of the field would influence which overload we choose.

@erictraut, is there some way to make this work with bidirectional type inference and generics maybe? For the first issue I tried having the above overload return int instead of Any but that didn't help.

Btw, is bidirectional type inference standardized, or is it a Pyright-specific thing?

have the user specify the field type without the descriptor class but then apply the descriptor class as part of the transform...The transform could be specified as part of the dataclass_transform.

We could accept a field type transform as a parameter to the dataclass_transform decorator function, but I'm worried that that would make it difficult/impossible to have more than one field transform. For example, if you want some of the fields to be transformed from T to Mapped[T] and others to Other[T].

Specifying the transform as part of the field descriptor would give more flexibility. Maybe as a new transform field descriptor parameter like this?

#region Library code

class Mapped(Generic[_T], util.TypingOnly): ...

# Library author provides a function (with no implementation?) to perform
# the desired field type transformation
def transform(input: Type[_T]) -> Mapped[_T]: ...

# The function is specified as the default value of a new "transform" field
# descriptor parameter
def mapped_column(*, init: bool = True, transform: Callable[[Type], Type] = transform) -> Mapped: ...

#endregion


#region User's code

@sqlalchemy_dataclass
class MyClass:

    # type checker sees default value of transform parameter on mapped_column
    # and treats foo as if its type is Mapped[int]. At runtime it is just int.
    foo: int = mapped_column()

    # Synthesized __init__ method param's type would be int during type
    # checking and at runtime
    #
    # def __init__(self, foo: int):
    #     ...

#endregion

Questions:

Would this solve the problem or is there something I'm still not understanding?
Would the Mapped descriptors still cause problems with this approach? Or is it fair to expect type checkers to honor the descriptors' __get__ and __set__ overloads? For example Mapped[T] setters apparently accept either T or SQLCoreOperations objects.
Is it weird to have a field descriptor parameter which is only expected to be set by library authors and not library consumers?
@zzzeek, do you know of any other libraries that have this same or similar issue preventing them from adopting dataclass_transform?

11 replies

debonte Feb 10, 2022
Maintainer

"inspired by dataclasses but decidedly not really a dataclass"

Right, so it's quite possible that we'll decide not to make a change to PEP 681 for your needs. But maybe there's a way we can make it work. In either case, I think we'll be able to get to a decision faster by talking instead of typing.

I'm free for a call next week Monday

Do you have an hour free in any of these EST time ranges on Monday? Not sure we'll need an hour, but I think we should block off an hour in case:

11am-1pm
2pm-4:30pm
5pm-8pm

zzzeek Feb 10, 2022
Author

hi Erik -

the 2pm slot works best for me send me an email at mike_mp at zzzcomputing.com

debonte Feb 14, 2022
Maintainer

@zzzeek @erictraut, here's my first draft of the change we discussed today. Let me know if you have any feedback, especially on the name of the new dataclass_transform parameter (transform_descriptor_types). My first thought was something like transform_descriptors but given the existing parameter named field_descriptors that name seemed potentially confusing -- it wasn't clear that it was a command ("please transform my descriptors") as opposed to a list ("these are my transform descriptors").

debonte/peps@64afca0

erictraut Feb 15, 2022
Maintainer

Here's the commit that adds provisional support for this new feature in pyright. I typically publish a new version of pyright on Tuesday evening in preparation for the weekly Wednesday release of Pylance. So watch for pyright 1.1.222 in the next 24 hours or so.

debonte Feb 15, 2022
Maintainer

We can use this PR to review the PEP change -- debonte/peps#3

I didn't want a PR showing up on Jelle's radar yet, but it just occurred to me that I could create a PR to merge the change into my fork...

debonte · 2022-02-05T21:21:23Z

debonte
Feb 5, 2022
Maintainer

An additional transform mechanism would be needed here. That's the reason @zzzeek proposed adding a __dataclass_field_transform__ or something like that.

Wouldn't __dataclass_field_transform__ only transform the type of a dataclass field from the type checker's perspective? It's not clear to me how that would influence the values of parameters in the field descriptor function. For example, @zzzeek was saying that he wanted init to be magically determined to be False when primary_key is set to True on int fields.

Btw, assuming this worked somehow, since @overload and @__dataclass_field_transform__ are typing concepts, how would the "init=False when primary_key=True" behavior be replicated at runtime? This logic would also be coded into the code generation within @sqlalchemy_dataclass? Or is it not important in the SQLAlchemy case because all of the __init__ params are optional keyword params?

...so adding new parameters is an unnecessary burden at runtime.

Good point.

The reason that the function is normally used for allocating field descriptors is that it can be annotated to return an Any type
...
If mapped_column were modeled as a class rather than a function, the class definition could include the proposed dataclass_field_transform pseudo method...
...
If there's a good reason to model mapped_column as a function...

I think there was a logical leap you didn't explicitly write out here. Are you saying that requiring the field descriptor to be a class (rather than a function) is OK in this case because the __dataclass_field_transform__ on the class would eliminate the usual pain point of type checkers complaining that the field type and field descriptor type do not match?

1 reply

debonte Feb 8, 2022
Maintainer

@zzzeek @erictraut you may have missed this comment from over the weekend which I intended to be a reply to the thread above. Sorry, apparently I'm still a GitHub newbie. :)

zzzeek · 2022-02-15T06:02:44Z

zzzeek
Feb 15, 2022
Author

thanks for the fast turnaround. I'll try to have a look tomorrow. - mike

…

On Mon, Feb 14, 2022, at 11:15 PM, Eric Traut wrote: Here's the commit <471fba1> that adds provisional support for this new feature in pyright. I typically publish a new version of pyright on Tuesday evening in preparation for the weekly Wednesday release of Pylance. So watch for pyright 1.1.222 in the next 24 hours or so. — Reply to this email directly, view it on GitHub <#2958 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAA7JXY2XE6PDWKOBF4K7GTU3HHM5ANCNFSM5NES7TCQ>. Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you were mentioned.Message ID: ***@***.***>

5 replies

debonte Feb 18, 2022
Maintainer

@zzzeek, the new transform_descriptor_types dataclass_transform parameter is now available in Pyright 1.1.222 and Pylance 2022.2.3.

Give it a try and let us know how it goes!

zzzeek Feb 18, 2022
Author

Hey all -

It works.

See the screenshot at sqlalchemy/sqlalchemy#7642 (comment) . There's an interim patch that allows that to work at typing time. Runtime is not there yet but we can of course do anything at runtime in Python :)

I'll see if I can spend more time w/ it today, but so far, looks like this will be amazing.

thanks for accompanying me on the journey! hopefully this will work out really well.

debonte Mar 4, 2022
Maintainer

@zzzeek we're considering two changes based on a recent email from Carl Meyer on Typing SIG:

When transform_descriptor_types is True, in addition to changing the __init__ parameter types for descriptor fields, type checkers would understand that the descriptors' __get__ and __set__ methods will be used when those fields are accessed. So they would expect that getting the value of such a field would return the __get__ method's return type. And similarly, when setting the field, a value compatible with __set__ should be provided.
When transform_descriptor_types is True, all fields would be treated as class variables, not just descriptor fields. If we made this change, we would rename transform_descriptor_types to something else.

What do you think about these two proposals? Do they feel right to you? Would they improve things for SQLAlchemy at all?

zzzeek Mar 4, 2022
Author

hey there - I replied via email, so it got stuck as a new comment below.

I've been getting CCed from the typing mailing list but havent subscribed to get into the conversation directly but I can do this if you feel I should be chiming in.

debonte Mar 7, 2022
Maintainer

@zzzeek I was thinking it was better to discuss details specific to SQLAlchemy here and use the Typing SIG discussion for more general issues.

zzzeek · 2022-02-18T13:51:41Z

zzzeek
Feb 18, 2022
Author

Hi Erik - Great, I hope to be able to try this, hopefully today although it's very difficult to find time, but i know this is very time sensitive, ill try to hit it today. - mike

…

On Fri, Feb 18, 2022, at 2:13 AM, Erik De Bonte wrote: @zzzeek <https://github.com/zzzeek>, the new `transform_descriptor_types` <debonte/peps#3> `dataclass_transform` parameter is now available in Pyright 1.1.222 <https://github.com/microsoft/pyright/releases/tag/1.1.222> and Pylance 2022.2.3 <https://github.com/microsoft/pylance-release/releases/tag/2022.2.3>. Give it a try and let us know how it goes! — Reply to this email directly, view it on GitHub <#2958 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAA7JX4L24242QXNSGSIPG3U3XWQPANCNFSM5NES7TCQ>. Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you were mentioned.Message ID: ***@***.***>

0 replies

zzzeek · 2022-03-04T23:53:15Z

zzzeek
Mar 4, 2022
Author

Hi Erik - The first one surprises me in that it wasn't doing that already, the second change I don't really understand. For the first change, we have the users putting the descriptor type explicitly for those fields which are "mapped", suppose HasDataClass includes a metaclass passed into dataclass_transform(): class MyClass(HasDataClass): id : Mapped[int] name: Mapped[str] The Mapped class is a descriptor, with __get__ and __set__ methods. It was my understanding that the type checker already considers MyClass.id and MyClass.name to be descriptors, so that at the instance level, MyClass().id and MyClass().name would operate the same as they did if the @dataclass_transform weren't present. In our case, Mapped[_T] will always return a _T at the instance level in any case, so perhaps we just didnt notice this. But I would say yes, if the class has a descriptor type like I have above at the class level, __get__ and __set__ should be honored at the instance attribute access level in all cases. For the second case, that one seems to not make much sense to me, both in terms of what it would do, as well as why it would want to do...something. If I make a class like this: class MyClass(HasDataClass): x: int y: int id : Mapped[int] name: Mapped[str] I would expect that MyClass().id and MyClass().name have descriptor behavior. I would expect that MyClass().x and MyClass().y do not. They should be treated as they normally would on any dataclass, that they can be passed to the constructor and will be present as instance variables. I don't see what the purpose of introducing ClassVar into the type would accomplish. If the user wants x and y to be classvars, they should specify ClassVar, as would always be the case, as below, where "y" is a classvar: class MyClass(HasDataClass): x: int y: ClassVar[int] id : Mapped[int] name: Mapped[str] if someone specifies the above, that's what they should get. As has been mentioned in the past, we are trying to get the "dataclasses" pattern to be replicated here as closely as possible; I was going to be selling this new feature as "as close as you can get to dataclasses", including that I am including support for generating __repr__(), __eq__() and all the rest just like dataclasses do (I'm using dataclasses API directly to generate the methods). This way I can point users to how dataclasses work when there's confusion over things and I don't have to teach a new API that is slightly different in its expected behavior, for those parts that look exactly the same as they do on a standard dataclass. You can see some tests for my current API at https://gerrit.sqlalchemy.org/c/sqlalchemy/sqlalchemy/+/3597/10/test/orm/declarative/test_dc_transforms.py . The test called test_integrated_dc looks tests the above ideas with a class like this: class A(dc_decl_base): __tablename__ = "a" ctrl_one: str id: Mapped[int] = mapped_column(primary_key=True, init=False) data: Mapped[str] some_field: int = dataclasses.field(default=5) some_none_field: Optional[str] = None Above, if someone writes a class like that, I think it's clear they are expecting ctrl_one, some_field and some_none_field to do the same thing they would do on a normal dataclass, that is, be part of the __init__ method as well as other methods like __repr__, __eq__, etc. One thing I've observed about the users of dataclasses, they are very enthusiastic about being explicit where there is a choice. I'm not sure what the bigger rationale for assuming ClassVar would be but let me know. - mike

…

On Fri, Mar 4, 2022, at 6:05 PM, Erik De Bonte wrote: @zzzeek <https://github.com/zzzeek> we're considering two changes based on a recent email ***@***.***/message/FC3RIWDWQZTMXPDJT7UM6TKT5BBSXF3V/> from Carl Meyer on Typing SIG: 1. When `transform_descriptor_types` is `True`, in addition to changing the `__init__` parameter types for descriptor fields, type checkers would understand that the descriptors' `__get__` and `__set__` methods will be used when those fields are accessed. So they would expect that getting the value of such a field would return the `__get__` method's return type. And similarly, when setting the field, a value compatible with `__set__` should be provided. 2. When `transform_descriptor_types` is `True`, *all* fields would be treated as class variables, not just descriptor fields. If we made this change, we would rename `transform_descriptor_types` to something else. What do you think about these two proposals? Do they feel right to you? Would they improve things for SQLAlchemy at all? — Reply to this email directly, view it on GitHub <#2958 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAA7JXYKC7TBBQB6HRHBTW3U6KJNDANCNFSM5NES7TCQ>. Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you were mentioned.Message ID: ***@***.***>

2 replies

erictraut Mar 5, 2022
Maintainer

@edebonte, I don't think it's necessary to document the first point. That's how descriptors always work, and the dataclass_transform doesn't propose to make any change to the way descriptors work. So I think it would be confusing to say anything about __get__ and __set__ in the dataclass_transform PEP (other than the fact that the __set__ value is used when synthesizing the __init__ parameter for that field.

@zzzeek, on the second point, there's an important distinction between ClassVar and a "class variable". We have three classes of variables to consider:

ClassVar: A class variable that is explicitly declared as such and cannot be overridden by a local variable without triggering a type checker error.
A normal "class variable": Can be (and often is) overwritten by an instance variable of the same name, unless it's a descriptor in which case the descriptor's __set__ logic handles any assignments to the variable through a member access expression.
An instance variable only: Stored in the dictionary private to the instance. Cannot be accessed through the class. Descriptors do not work as expected if they are stored as instance variables.

class MyClass:
    v1: ClassVar[int] = 0  # "ClassVar"
    v2: int = 1  # "class variable"
    v4 = MyDescriptor()

    def __init__(self):
        self.v3: int = 2  # "instance variable"


m = MyClass()

print(MyClass.v1)  # 0
print(MyClass.v2)  # 1
# print(MyClass.v3)  # Runtime error

print(m.v1)  # 0
print(m.v2)  # 1
print(m.v3)  # 2

# m.v1 = 1  # Type error - cannot overwrite ClassVar with instance variable
m.v2 = 10
m.v3 = 11

print(m.v1)  # 0
print(m.v2)  # 10
print(m.v3)  # 11

print(MyClass.v1)  # 0
print(MyClass.v2)  # 1

So I think Erik's proposal matches all of the desired behaviors you've outlined above. The tricky part is how to explain it all in a way that's clear in the PEP.

debonte Mar 8, 2022
Maintainer

I don't think it's necessary to document the first point. That's how descriptors always work, and the dataclass_transform doesn't propose to make any change to the way descriptors work.

@erictraut, I was thinking it should be documented because stdlib dataclass fields are instance variables and as you later mentioned, "Descriptors do not work as expected if they are stored as instance variables."

That makes me wonder if the PEP should discuss class attributes vs. instance attributes.

PEP 557 says the following:

If the default value of a field is specified by a call to field(), then the class attribute for this field will be replaced by the specified default value. If no default is provided, then the class attribute will be deleted.

Does the dataclass_transform reference implementation assume that fields affected by dataclass_transform behave the same way?

zzzeek · 2022-03-05T20:38:52Z

zzzeek
Mar 5, 2022
Author

Ok, (sorry for email I'm only on my phone today) , so the parameter would mean, "all fields would be treated as class variables (not ClassVar), ok. In that case I don't think I understand what this particular change means. Say I have a normal dataclass with three fields, no descriptor types on it, but that it can also make use of this flag in some way. What is the change incurred by having the flag set or not set, if no descriptors are present ?

…

On Sat, Mar 5, 2022, at 12:00 PM, Eric Traut wrote: @edebonte, I don't think it's necessary to document the first point. That's how descriptors always work, and the dataclass_transform doesn't propose to make any change to the way descriptors work. So I think it would be confusing to say anything about `__get__` and `__set__` in the dataclass_transform PEP (other than the fact that the `__set__` value is used when synthesizing the `__init__` parameter for that field. @zzzeek <https://github.com/zzzeek>, on the second point, there's an important distinction between `ClassVar` and a "class variable". We have three classes of variables to consider: 1. `ClassVar`: A class variable that is explicitly declared as such and cannot be overridden by a local variable without triggering a type checker error. 2. A normal "class variable": Can be (and often is) overwritten by an instance variable of the same name, unless it's a descriptor in which case the descriptor's `__set__` logic handles any assignments to the variable through a member access expression. 3. An instance variable only: Stored in the dictionary private to the instance. Cannot be accessed through the class. Descriptors do not work as expected if they are stored as instance variables. class MyClass: v1: ClassVar[int] = 0 # "ClassVar" v2: int = 1 # "class variable" v4 = MyDescriptor() def __init__(self): self.v3: int = 2 # "instance variable" m = MyClass() print(MyClass.v1) # 0 print(MyClass.v2) # 1 # print(MyClass.v3) # Runtime error print(m.v1) # 0 print(m.v2) # 1 print(m.v3) # 2 # m.v1 = 1 # Type error - cannot overwrite ClassVar with instance variable m.v2 = 10 m.v3 = 11 print(m.v1) # 0 print(m.v2) # 10 print(m.v3) # 11 print(MyClass.v1) # 0 print(MyClass.v2) # 1 So I think Erik's proposal matches all of the desired behaviors you've outlined above. The tricky part is how to explain it all in a way that's clear in the PEP. — Reply to this email directly, view it on GitHub <#2958 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAA7JX6HUFF7LEV77XQ53QLU6OHLZANCNFSM5NES7TCQ>. Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you were mentioned.Message ID: ***@***.***>

33 replies

zzzeek Apr 5, 2022
Author

the "dataclass-init" feature isn't merged yet, while I wait for this discussion to figure things out. The actual part where the "dataclass __init__" method is generated can be seen in code review at https://gerrit.sqlalchemy.org/plugins/gitiles/sqlalchemy/sqlalchemy/+/refs/changes/97/3597/12/lib/sqlalchemy/orm/decl_base.py#742 . Currently just going right to the source, dataclasses itself, to create the constructor. I'm inclined to keep it that way though my co-developer is concerned about performance overhead.

debonte Apr 5, 2022
Maintainer

Am I correct in assuming that MyClass.mapped_int below (copied from one of your earlier posts) is expected to return the descriptor object rather than calling MyClass.__get__ with obj set to None? Just want to make sure that I'm properly representing your needs in the on-going typing-sig discussion.

select_stmt = select(MyClass).where((MyClass.mapped_int + 5) > 15)

zzzeek Apr 6, 2022
Author

for us it's in fact more complicated than that, calling MyClass.mapped_int returns a SQL expression construct that represents a SQL column expression. That is, the Mapped descriptor works as a descriptor but never returns itself, including when accessed at the class level.

debonte Apr 6, 2022
Maintainer

The most recent suggestion around this on typing-sig was:

require that F.y return the default value "foo", but at runtime the library could satisfy this requirement either by having F.__dict__["y"] actually be set to "foo" (as dataclass does) OR by having the field specifier be a descriptor whose __get__ returns "foo" when instance argument is None, meaning that in either case F.y == "foo".

Based on the overload that you just linked to, it appears that you are expecting a call to __get__ with instance set to None, but the difference is that you don't return the field's default value in that case.

        @overload
        def __get__(
            self, instance: None, owner: Any
        ) -> InstrumentedAttribute[_T]:
            ...

The conversation on typing-sig started around not wanting the descriptor object to be overwritten by the default value. It seems that Carl Meyer at least is OK with the descriptor being kept around in dataclass_transform scenarios. I think the risk is that some type checkers might use the return type of the None overload of __get__ to tell them the type of the field's default value. In your case that would be wrong (InstrumentedAttribute[_T] instead of _T).

I'm having trouble thinking of a reason why a type checker would need the type of the field's default value though. They know the type that __init__ should accept based on __set__, the type obtained when accessing the field based on __get__, and the default value itself based on the default parameter. So maybe this isn't a problem.

zzzeek Apr 6, 2022
Author

The conversation on typing-sig started around not wanting the descriptor object to be overwritten by the default value. It seems that Carl Meyer at least is OK with the descriptor being kept around in dataclass_transform scenarios. I think the risk is that some type checkers might use the return type of the None overload of __get__ to tell them the type of the field's default value. In your case that would be wrong (InstrumentedAttribute[_T] instead of _T).

that is in fact what dataclasses will do if you put a descriptor like ours on a dataclass as a class-level assigned value. However, that's because an arbitrary descriptor is not recognized as a "Field" object. In pep681 we are saying, "here's some more objects that also should be considered to be like a Field if they are in the class definition". So if the "field-like" object is also a descriptor, the typing tools, when seeing this object being assigned on the right side of the expression, should treat it first and foremost like a Field, meaning, it's not itself the default value, and if there is no "default=X" inside of its definition, then the field has no default.

I'm having trouble thinking of a reason why a type checker would need the type of the field's default value though. They know the type that __init__ should accept based on __set__, the type obtained when accessing the field based on __get__, and the default value itself based on the default parameter. So maybe this isn't a problem.

IMO this is not a problem if the descriptor class is also part of the field_specifiers collection. The fact that the object is a field_specifier should take precedence when deciding upon the object's implications for a dataclass, when the object is what's assigned (the right side of the equals expression). When the descriptor class is part of the type, then it should be considered as the descriptor that it is, since Field is not relevant for the type annotation portion (the left side of the equals expression).

more very specific things that would make for a great spec if it didn't run the risk of running counter to dataclass' largely unspecified assumptions !

zzzeek · 2023-03-10T19:15:27Z

zzzeek
Mar 10, 2023
Author

post-mortem, mypy 1.1.1 is released and they did descriptors exactly wrong, the way I was afraid they would, which is why I really wanted this language to be in the specification. See python/mypy#14868

5 replies

erictraut Mar 10, 2023
Maintainer

Please file a bug in the mypy tracker if you haven't already done so. Pyright is the reference implementation for this feature, so they should implement it the same way.

zzzeek Mar 10, 2023
Author

it looks like there already was a bug that they didnt implement before releasing their feature, at python/mypy#13856 .

Thanks for the reply, it looks like there shouldn't be controversy that it's a mypy bug.

debonte Mar 10, 2023
Maintainer

This isn't a question of PEP 681 support, but rather support of stdlib dataclasses with descriptors. Give me a bit to look at your example and I'll reply on the mypy issue.

debonte Mar 10, 2023
Maintainer

I replied on python/mypy#13856 with links to the issue where this behavior was discussed plus the docs.

zzzeek Mar 10, 2023
Author

thanks very much @debonte and @erictraut. as you know, SQLAlchemy 2.0's entire flagship "dataclasses" feature rests on this one little aspect of the behavior that is very easy to get wrong, combine that with a lot of coffee today and I'm a little jittery overall 😛 ☕ ☕

dataclass transforms, propose ability for an indirection layer for SQLAlchemy #2958

zzzeek Jan 30, 2022

Replies: 8 comments · 65 replies

erictraut Jan 30, 2022 Maintainer

erictraut Jan 31, 2022 Maintainer

zzzeek Jan 31, 2022 Author

zzzeek Jan 31, 2022 Author

erictraut Jan 31, 2022 Maintainer

zzzeek Jan 31, 2022 Author

debonte Feb 5, 2022 Maintainer

debonte Feb 10, 2022 Maintainer

zzzeek Feb 10, 2022 Author

debonte Feb 14, 2022 Maintainer

erictraut Feb 15, 2022 Maintainer

debonte Feb 15, 2022 Maintainer

debonte Feb 5, 2022 Maintainer

debonte Feb 8, 2022 Maintainer

zzzeek Feb 15, 2022 Author

debonte Feb 18, 2022 Maintainer

zzzeek Feb 18, 2022 Author

debonte Mar 4, 2022 Maintainer

zzzeek Mar 4, 2022 Author

debonte Mar 7, 2022 Maintainer

zzzeek Feb 18, 2022 Author

zzzeek Mar 4, 2022 Author

erictraut Mar 5, 2022 Maintainer

debonte Mar 8, 2022 Maintainer

zzzeek Mar 5, 2022 Author

zzzeek Apr 5, 2022 Author

debonte Apr 5, 2022 Maintainer

zzzeek Apr 6, 2022 Author

debonte Apr 6, 2022 Maintainer

zzzeek Apr 6, 2022 Author

zzzeek Mar 10, 2023 Author

erictraut Mar 10, 2023 Maintainer

zzzeek Mar 10, 2023 Author

debonte Mar 10, 2023 Maintainer

debonte Mar 10, 2023 Maintainer

zzzeek Mar 10, 2023 Author

zzzeek
Jan 30, 2022

Replies: 8 comments 65 replies

erictraut
Jan 30, 2022
Maintainer

erictraut Jan 31, 2022
Maintainer

zzzeek Jan 31, 2022
Author

zzzeek Jan 31, 2022
Author

erictraut Jan 31, 2022
Maintainer

zzzeek Jan 31, 2022
Author

debonte
Feb 5, 2022
Maintainer

debonte Feb 10, 2022
Maintainer

zzzeek Feb 10, 2022
Author

debonte Feb 14, 2022
Maintainer

erictraut Feb 15, 2022
Maintainer

debonte Feb 15, 2022
Maintainer

debonte
Feb 5, 2022
Maintainer

debonte Feb 8, 2022
Maintainer

zzzeek
Feb 15, 2022
Author

debonte Feb 18, 2022
Maintainer

zzzeek Feb 18, 2022
Author

debonte Mar 4, 2022
Maintainer

zzzeek Mar 4, 2022
Author

debonte Mar 7, 2022
Maintainer

zzzeek
Feb 18, 2022
Author

zzzeek
Mar 4, 2022
Author

erictraut Mar 5, 2022
Maintainer

debonte Mar 8, 2022
Maintainer

zzzeek
Mar 5, 2022
Author

zzzeek Apr 5, 2022
Author

debonte Apr 5, 2022
Maintainer

zzzeek Apr 6, 2022
Author

debonte Apr 6, 2022
Maintainer

zzzeek Apr 6, 2022
Author

zzzeek
Mar 10, 2023
Author

erictraut Mar 10, 2023
Maintainer

zzzeek Mar 10, 2023
Author

debonte Mar 10, 2023
Maintainer

debonte Mar 10, 2023
Maintainer

zzzeek Mar 10, 2023
Author