-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: Fast, Efficient Unions #7016
Comments
I would agree that tagged unions, as implemented as structs with named cases, seems like the "easier" approach and I think would be a low cost abstraction, which I think is critical for common DUs like I think there's also room for reference-based DUs. If they had a similar public sealed interface Foo permits Bar, Baz { ... } You get no syntax shorthand but the compiler could make assumptions as to whether or not your matches are exhaustive. |
Seeing enum class Foo
{
Bar,
Baz(string a),
} can be translated to abstract sealed class Foo
{
public sealed class Bar : Foo { }
public sealed class Baz(string a) : Foo { public string A { get; } = a; }
} |
I don't really like that syntax, on |
Yes, it is redundant, but it's also explicit. I'm not suggesting that be the syntax that C# adopts, but it does offer a way to describe a closed hierarchy, which I think is suitable for a discriminated union of related types. |
Sorry if I missed it but I wondered what the semantics of 'default' are for enum structs. If such a value makes its way into my system, what constructs will match it? Will it just be the first case of my enumeration with default values for all the case fields, for example? |
I think there's a whole conversation that would have to happen there. In F# I think it's the first case. That may be safe in some contexts, e.g. |
Maybe put the enum struct Option<T>
{
None = 0,
Some(T value) = 1,
} and in this case |
Or existing patterns around private constructors and nested classes can be recognized by the compiler. enum class Foo
{
Bar,
Baz(string a),
} isn't much shorter than abstract record Foo
{
private Foo() {}
public sealed record Bar : Foo;
public sealed record Baz(string a) : Foo;
} |
Declaring a private constructor on abstract type seems like a pretty decent gesture for I intend this to be a closed set. One place where this could fall over is if the type is from another assembly. I could update versions and now there's a new case at runtime which I didn't account for. Of course this seems like a risk with all possible representations. What I think is missing is a gesture to say how that flow is supposed to work when new cases are added over time. When should the exhaustiveness warning be issued or not. |
That's true for DUs in general. I believe adding new cases is considered a breaking change. I think the only way to avoid that would be to indicate that the DU is non-exhaustive and that consumers should be forced to consider the unknown case. Otherwise, the compiler will end up throwing |
We don't ask the user to handle null when the switch input is a non-nullable reference type. #nullable enable
M(null!); // will throw at runtime
void M(object obj)
{
Console.Write(obj switch { object obj1 => obj1.ToString() });
} I think that kind of behavior is what we should have as a default for DUs, even from other assemblies. // OtherAssembly
enum struct MyUnion
{
Case1, Case2(int item)
}
// My code
void M(MyUnion u)
{
Console.Write(u switch { Case1 => "a", Case2(_) => "b" }); // no warning.
} From the semantic/diagnostics point of view, act like it is exhaustive, but still insert that Then, if we judge there is sufficient need for DUs which are not treated as exhaustive, introduce a mechanism for marking them as such, and give warnings for those if "anything else" is not handled. |
This is a great question which I don't really have a satisfying answer for. I suspect the answer will be "the first case with default values for each of the types" but there's no good reason why that's the case, it just is the case. An answer which I would prefer is, "you can't take the default value of enum structs", but obviously that ship sailed long ago. |
I suspect that will be less error-prone than having a tag of public enum Option<T> {
None = 0, // safe default
Some(T value) // implied 1
}
public enum Either<TL, TR> {
Left(TL value) = 1,
Right(TR value) // implied 2
} |
I would have expected |
I actually like this. Maybe all tag values start at |
Maybe by default? I think there are cases ( Would be nice to get the perspective from the F# folks here. |
|
Agreed, and I think the C# compiler could warn here as well. I'd be more interested in cases where you end up with |
You're right, if you do |
Yep, same is true with custom struct DUs, the first case has a tag of 0: open System
[<Struct>]
type Multicase =
| Case1 of Case1 : int
| Case2
let x = Unchecked.defaultof<Multicase>
match x with
| Case1 i -> printf "%d" i
| Case2 -> printf "none" This prints |
I would be disappointed if a mechanism were introduced for preventing "unsafe" |
Yeah, that's fair. Tbh, I think the basic approach of, "it's the first one in the case of |
Non-defaultable value types would help a lot with designing proper struct tagged unions ( |
In #4628 I suggested using the syntax This is because you can already write |
I also am wondering if 'or' gives us an opportunity to unify these language concepts. What concerned me was: doesn't that mean if I say |
Unfortunately it does have a meaning, but not what you would expect. The following compiles: public class C {
public void M(X x) {
if (x is (int or string) y) {}
}
}
public class X {
public void Deconstruct(out object z) { z = null; }
} It deconstructs X, and then matches it against the pattern It's exceedingly rare to have a type which deconstructs to a single variable, so I think it would be reasonable to either make a breaking change here, or check for the deconstruct, and otherwise have the intuitive meaning. |
What I meant was for |
@RikkiGibson |
Chalk this up as another source of friction with pure unions. |
My feeling is that having to explicitly declare all the unions you use can be annoying if you just want to return this, that or the other thing, and the user checks which one it was. Yet the amount of orthogonality problems we sidestep by not blessing a particular OneOf type in the runtime seems very significant. (Maybe there's a term more fitting than orthogonality here. Don't know.) I'm pretty convinced that even if you implement special boxing behavior for a OneOf type, you will still get into trouble when it comes to checking conversions involving 'object' or type parameters, expectations around subtyping relationships when one pure-union type is convertible to another, etc... I don't recall if Andy discussed these problems here or just in meeting. Basically, it's a case where making it so the cliff is easy to spot may be more valuable than pushing the cliff out as far as possible. If you have to get into some total corner case before the seams of pure-unions start to show, you might be worse off than saying: hey, these are just structs, they box and compare like structs, operate accordingly. |
Maybe this has already been discussed, but wouldn't generic type parameters being unions further complicate matters? As an example: public T | A M<T>(T u)
{ ... } In this case, the union would effectively be public void P(B | C u) {
var x = M<B | C>(u);
} In this case, Would we have to require that unions can't be used as generic type parameters? This seems like it could significantly reduce the usefulness. For example, then you couldn't iterate an |
Can you expand on why you think this would be a problem? I don't see it as any different from supporting generic types within generic type arguments today. I expect that nesting unions will be a very normal thing to do and don't see why it would pose any issues. Sure, it might be slightly slower than a flattened union, but that statement is true about any degree of indirection that exists in the language and runtime today. |
@HaloFour @brantburnett public void M0(A | (B | C) u) // So OneOf<A, OneOf<B, C>>
{ }
public void M1(A | B | C u) // So OneOf<A, B, C>
{ }
M0(new B())
M1(new B()) Will the compiler be smart enough to support the calls to both |
I see two difficulties off the top of my head, there may be more I haven't considered yet.
These difficulties are solvable, of course, I just wanted to call them out. |
I personally don't see why the compiler should treat the type as |
As a consumer it logically makes sense to me. Something that can be A or B or C should map to any way you can build A or B or C. We can choose in the design to make how you go about reaching that combination be relevant, but I don't know that it necessarily should be relevant. Even the proposal above states "The type-level combinator is commutative and associative". I'd also point out that, unless I'm misunderstanding something, this is the way that TypeScript effectively handles unions. I recognize that TypeScript is, from a design standpoint, a very different beast. However, I do think that things like this are worth considering. |
I guess this is why I support the conclusion of this proposal that the team should focus on tagged unions. I'm of the opinion that they solve all of these use cases, albeit in a less clever way for the pure union scenarios. But if the pure union approach is very expensive it's likely not worth the effort and if the seams are visible then you just end up with weird language warts. The functional languages I am familiar with do anything like this, and nesting ADTs is very normal, no different than nesting any generic type arguments. |
Logically (as in, I have no idea whether this a good idea or not implementation-wise) I think it makes most sense to only support |
@Richiban What about IMO the type union should be treated as a commutative and associative operation by the compiler. |
I mentioned this on the main du issue #113 before I saw this proposal, but what about re-using records for each case instead of creating something new. Your tagged union would become union struct S
{
record struct Case1(A a) : S;
record struct Case2(A a, B b) : S;
} or a variant without the structs with just classes to. record struct S {
readonly byte Tag;
record struct Case1(A a) : S(1);
record struct Case2(A a, B b) : S(2);
} and the class version wouldn't even need the Tag. The syntax is slightly more verbose since you need to repeat |
It's basically the same thing with different syntax. But it implies that struct inheritance is more generally a thing and that is not the case, so it doesn't really make sense to use a form of syntax that implies that it is. |
That is, of course, the most desirable outcome. It's what we have in Typescript, where You're right though, that with generics it would be possible to create a |
Note that TS gets thsi because there is zero actual runtime impact of any of this, and it's ok if the values are inaccurate. In TS this is all erased and in JS it's all just 'object'. Such an approach isn't really viable for us/the-runtime as we're a strongly, statically typed system where this needs to be correct throughout the layers :) |
Absolutely; I understand that |
perhaps OneOf<T,U> where T not OneOf, to force a disambiguation? (not sure if where x is not a Y is a generic constraint that is possible) |
I'm actually convinced that if we want anonymous unions, erasure is the best way to implement it. That said, the fact that erasure is the best way to implement may be a reason not to implement anonymous unions. |
The erased anonymous unions proposal in F# also requires the types to be disjunct. I think there was a good reason we would also want to do that. I just don't remember it 😓. Maybe it's that if we don't have that requirement, we can't reliably test which case of the union it is and do things like dispatch to the right case of a I think tagged unions are the way forward. Hoping we get to hear more from the working group about it. |
If we do implement anonymous unions, I think if we take a narrow set of features (mostly the In particular, I think an elimination form like x switch {
A a => ...
B b => ...
}; And in the current language if If anonymous unions end up the same way, that seems fine. However, having a feature like tagged unions that guarantees disjunction seems separately valuable beyond what we have in the language today, so there are general reasons to prefer that general direction. Especially in the context of adding a language feature, where you would hope to add to the general capabilities of the language. |
When I was deaing with enum in F#, the compiler already warned me after I matched all the cases, there might be other values given to me outside of what I defined in the enum. Of course I would be getting other values, my enum is based on int so I can just convert an int to enum without the value being the cases of enum I defined, it's acceptable. When we are making distinctions, we are doing the work of abstracting infinite state machine into finite state machine, a byte has 256 values but we don't usually have 256 distinctions in our union |
Awesome proposal. Although I believe |
I really wish this gets implemented at some point. This would add so much value to the language. |
This comment was marked as off-topic.
This comment was marked as off-topic.
@Xyncgas Please keep your posts on topic. The topic here is about C# unions, not a question of which language someone wants to use. We want to consider if adding unions to C# is a good thing for our language and our ecosystem. And, if it's a good thing, what should the design be. |
I’m going to close this out since I think the original proposal isn’t clear given the edit at the bottom. I still think the best thing to compromise in the struct system is size. In particular I would have two types of unions: class and struct. Class would be implemented using inheritance and would be space efficient, at the cost of using reference types and GC. Structs would waste space but incur no GC overhead and have very fast code throughout. Anonymous unions (the |
@agocke Could you maybe link to https://github.com/dotnet/csharplang/blob/main/proposals/TypeUnions.md in your first post here so that people can quickly find the proposal that is currently being worked on? (I'm not saying that this other proposal has the same goals as yours. But it's at least related.) |
https://github.com/sera-net/Sera.Union Propose gc changes to support object headers on the stack for overlapping struct and reference types |
I understand the point of wanting fast efficient DUs. However, there are use cases that are beyond the scope of native DUs. C# interoperates with a number of different languages and runtime environments which provide their own native implementations of DUs. Any implementation that is done should support a "fast track" DU as well as an "abstract" DU. This steps around the "it should be a struct"/"use a class" argument neatly: using the native ones that the language gives you, but provide a mechanism for allowing a non-native implementation that will play well with the syntax. That way, if a type representing a non-native DU (1) meets a particular interface and (2) statically advertises an association of the possible cases and types this should be manageable. There are a number of ways to do this. The precedent for doing fast track/abstract already exists in C#: compare the output of |
Fast, Efficient Unions
Many types of unions, including discriminated unions, have been proposed for C#. This proposal, unlike others that came before, is focused on building a new, efficient primitive for C# and .NET. As of now, C# and .NET have lacked a mechanism for expressing any kind of type union, meaning an "or" relationship between types.
While developers might reasonably prefer a wide variety of designs for union types depending on their domain, as a language and type system it should be a goal to build broad, useful primitives on which many separate complex structures could be built.
To that end, any union primitive we construct should flexible and as efficient in both space and computation.
Given this constraint, we know that we want a struct-based solution. Introducing allocations into a previously allocation-free code path is rarely acceptable, while introducing structs into reference-type code paths is generally fine. There are also two union designs which stand out as good candidates.
The first is what we might call a "pure" union -- a combination of multiple types in an "or" relationship. For this we could introduce a pure "or" type-level combinator, which for this proposal we will call
|
and use in combination with type variables in the form(A|B|C)
.The second design is what is usually described as a "discriminated" or "tagged" union. In this design we would define a new type declaration form to introduce the union, and a series of cases which delimit precisely the set of types that comprise the state of the union when in that case. For instance,
It should be noted that, although only the second form is described as a "tagged" union, in fact the "pure" union is tagged as well, since all values carry type tags in the .NET runtime and can be queried through type tests. The usefulness and usability of each tagging system is covered in more detail later.
Semantics
To discuss the different designs usefully, we must assign semantics to the different designs. These semantics could be debated, but I will do my best to define what I see as the most "generous" capabilities in each design, without regard for implementation. When implementation is covered later, we will note problems and limitations.
To start we may note the similarities in both union designs. Both provide the ability to represent multiple values of different types in a single value and single type specification.
Pure unions
In the case of pure unions, the inline
|
combinator provides the ability to represent anonymous unions directly in type syntax. This is paired with a trivial introduction form -- a value of pure union type may be created simply by assigning a value of one of its cases to it. This makes pure unions particularly good at ad hoc modeling. In any type location a pure union could be provided, and in general, no additional code is required to create values of that type, either in the method return position, method argument position, or local variable assignment position.On the other hand, while pure unions are simple for ad hoc modeling, they may not be as effective for larger scale programs. While it is useful that the
|
combinator does not require a type declaration, it can be problematic that it also doesn't allow for one. Consider that C# variable names are rarely pithy single characters likeA
orB
and in fact are usually quite long.A|B
looks good in examples but,isn't quite as attractive or readable. The inability to declare unions using the pure union syntax is made more complicated by the fact that C# doesn't have a particularly good way of declaring new type aliases. The best answer is the
using
directive, which cannot be public, cannot represent open generics, can only appear at the top of a file, and often requires specifying the fully qualified type name of any alias targets. While these limitations could be changed or lessened, these changes are not currently proposed -- partially because C# has historically not preferred this style of programming over direct declarations.Aside from declaration simplicity, the union
|
combinator is also conceptually simple and familiar. Many C# programmers are already familiar with the|
value combinator and its extension to the type level is mostly natural. The type-level combinator is commutative and associative, just like its value form. Subtyping also follows the same simple set theory rules that hold true of the set "union" combinator, which many developers will be familiar with, e.g.A|B
is a subtype ofA|B|C
, as isC|A
.Lastly, the elimination form is also simple and familiar -- dynamic type checking through pattern matching. To deconstruct a union
A|B
we can use simple pattern matching to check for each of the cases, i.e.or with a switch expression
In the second case, we are even automatically provided exhaustiveness checking, since all cases of the union have been handled in the switch.
Unfortunately, the simplicity defined above does come with some significant downsides. In particular, the flexibility of
|
, including allowing generic type parameters as cases, prevents unions from guaranteeing disjointedness. That is, for any union of the formA|B
, there is no guarantee thatA
andB
do not have a subtype or identity relationship. For instance, if the methodM
in the previous example were instantiated asM<object, string>
, the union ofA|B
would be substituted asobject|string
. This creates potential problems for uses of the elimination forms, as caseB
would never be executed in this situation. While the switch expression in the second example would continue to check for exhaustiveness in this case, it would not check for subsumption.Tagged unions
In contrast with pure unions, tagged unions provide only a declaration form, not an inline form. The benefits and drawbacks are almost the mirror image of the pure union cases described above. A tagged union is declared via the syntax
enum struct
, deliberately reminiscent of simple C#enum
s. Inside the body of theenum struct
is a series of cases. Each case begins with a case name and is followed by a comma-separated list of types, similar to a tuple type:Unlike C# enums, the case names don't directly represent values, instead they are simply names used in the introduction and elimination forms. For the introduction form, each case name corresponds to a public static method on the
enum struct
type. This method is defined to have a parameter list with parameter types and arities that match the list following the case name in theenum struct
declaration.For instance, in the above example a valid expression would be
S.Case1(a)
, wherea
is a variable of typeA
. The result of this expression would be a value of typeS
.To use the values passed into case methods on
S
, there is a corresponding elimination form in pattern matching to retrieve the values, similar to deconstruction. The case name once again appears as a new language construct, also nested under the enum struct name scope, which we might simply call a "variant." Variants may appear only as patterns, in the same position as a type pattern. Following the variant, there may be a deconstruction pattern which matches the arity and types of the original parameter list to the enum struct case. For instance,or
Like pure unions, the above elimination forms guarantee exhaustiveness. Unlike pure unions, however, tagged unions are guaranteed to be disjoint and therefore they also guarantee subsumption. Also unlike pure unions, they do not have a "transparent" form, which means that they cannot be eliminated without using the new, variant-case elimination form.
What may be most notable about the above formulation is that, while pure unions may be conceptually familiar, tagged unions should be particularly familiar to C# programmers, as they behave quite similarly to traditional C#
enum
s (thus the choice of similar syntax). Like enums, tagged unions have only a declaration form. Like enums, they have a series of named cases. Like enums these cases can be switched on. The primary difference is that each enum struct case carries values of arbitrary different types, while all traditional enums are guaranteed to have a single value of one shared underlying primitive type.Implementations
Pure unions
For pure unions, there is one obvious encoding to the current CLR type system. For any union type
(T1|T2|T3|...Tn)
, this could be translated to an instantiation of a predefined struct typeOneOf<T1, T2, T3, ... Tn>
. This fits well because:(A|B)
syntax is a type syntax, not a declaration syntax. Therefore you would expect the same type to repeated often throughout signatures. If each term were to create a different type declaration, this could incur large compile- and run-time size costs.A|B
type syntax corresponded to a different CLR type, then conversion between variables could be costly. Using one type makes conversions immediate.That said, our optimizations around conversions are limited. In particular, because CLR structs lack variance capabilities, we cannot cheaply implement the subtyping relationship that we described in the semantics above. Instead we would be forced to emit a series of type checks for each case, converting each one individually to the new type. This is a small, but non-zero cost. We would have to provide the same trick for case reordering, as the CLR would see
OneOf<A, B>
andOneOf<B, A>
as different types, despite our desire thatA|B
andB|A
represent the same type.This is particularly unfortunate in places where the CLR forces nested type equivalence, like in assigning
List<OneOf<A, B>>
toList<OneOf<B, A>>
. Here such a conversion would be particularly expensive, likely forcing the allocation of new backing memory.Tagged unions
Tagged unions would preferably be implemented using standard struct types, with some runtime optimizations. Every enum struct would need to contain a tag, to enforce our guarantee of disjointedness. It would also need to contain fields for each type in each type list in each case. Lastly it would also need to contain methods for creating instances of the enum struct. A declaration like
may look like
and the elimination form
x is Case2(A a, B b)
would look likeIn general, most operations should be simple operations on the larger struct, since cases cannot be represented directly, only held inside a value of the parent type, or deconstructed to constituents. Improvements on basic structs would likely be left to the runtime.
Space efficiency
First, let's consider the space efficiency of possible implementations of the options. For all unions, space efficiency is defined relative to a tuple type with each of the cases, since tuples can represent all possible values of all possible union cases simultaneously.
For pure unions, any space optimization comes from runtime optimization. Since each type parameter in
OneOf
may be an incompatible overlapping type, the runtime will have to be responsible for special-casing theOneOf
type to provide better space utilization.For tagged unions, this same optimization strategy could be employed, or the C# compiler could participate. Since certain types and members would be known to be overlap-compatible, the compiler could emit
FieldOffset
as appropriate to optimize space. Space usage of tagged unions may be uniformly higher since they may be forced to always include an extra tag field.Computation efficiency
As mentioned previously, pure unions may be forced to do more individual type checks and assignment to provide conversion semantics.
Tagged unions likely will not need any such overhead. Most overhead will likely come from checking the tag. If tag checking is cheaper than type checking, tagged unions may be slightly more efficient, or vice versa.
Conclusion
Overall, it seems that both proposals have significant advantages and disadvantages.
However, tagged unions appear to initially fit better with C#. The syntax and semantic forms align more closely with the way C# has been designed and how it is widely used. As an analogy, we might note that tuples are to pure unions as structs/classes are to tagged unions. While tuples are a valuable language feature with unique advantages, their usage pales in comparison to structs and classes. C# is simply optimized for the kind of programming that favors structs and classes.
Indeed, one of the biggest weakness of pure unions appears to be that many of the "elegant" parts of their semantics would be difficult to efficiently implement in C#. With these disadvantages, the field is tilted even more towards tagged unions.
However, in the same way that tuples and structs/classes are both part of C# and provide unique advantages, it might be worthwhile to consider adding both designs.
If we must add only one, or prioritize one, then tagged unions should be preferred.
Edit:
I talked with the runtime team on potential optimizations for any union implementation we came up with. Unfortunately, I missed that there's a fundamental tearing problem with object and value overlap. Consider,
This type represents a union of
int
andstring
. Ideally we would like to have the minimal possible space by overlappingValue1
andValue2
, sinceBuiltInUnion
will only use one of them at a time. But note that this is a multi-word-sized type. Even ifValue1
andValue2
overlap, the tag exceeds word size, which means that it is not possible to atomically update this type during a copy. So ifBuiltInUnion
appears in a location shared by two threads, copying could produce tearing such that the tag is updated and the value isn't, or vice versa, which would produce memory unsafety and potentially crash the runtime.There isn't a simple fix here, just a set of tradeoffs. Those tradeoffs are:
(4) would mean making enums ref structs, which would remove the data race, but would be incredibly restrictive. Probably not a general solution. (5) would compromise memory safety, which is unacceptable in my opinion.
That leaves size, allocation, or RW time. Fixing the race through synchronization would mean including what is effectively a spin lock, which increases copy time by ~150x. Allocation has the standard tradeoffs. Size might be the best thing to compromise on.
The text was updated successfully, but these errors were encountered: