Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Fast, Efficient Unions #7016

Closed
1 of 4 tasks
agocke opened this issue Mar 1, 2023 · 99 comments
Closed
1 of 4 tasks

Proposal: Fast, Efficient Unions #7016

agocke opened this issue Mar 1, 2023 · 99 comments

Comments

@agocke
Copy link
Member

agocke commented Mar 1, 2023

Fast, Efficient Unions

  • Proposed
  • Prototype: Not Started
  • Implementation: Not Started
  • Specification: Not Started

Many types of unions, including discriminated unions, have been proposed for C#. This proposal, unlike others that came before, is focused on building a new, efficient primitive for C# and .NET. As of now, C# and .NET have lacked a mechanism for expressing any kind of type union, meaning an "or" relationship between types.

While developers might reasonably prefer a wide variety of designs for union types depending on their domain, as a language and type system it should be a goal to build broad, useful primitives on which many separate complex structures could be built.

To that end, any union primitive we construct should flexible and as efficient in both space and computation.

Given this constraint, we know that we want a struct-based solution. Introducing allocations into a previously allocation-free code path is rarely acceptable, while introducing structs into reference-type code paths is generally fine. There are also two union designs which stand out as good candidates.

The first is what we might call a "pure" union -- a combination of multiple types in an "or" relationship. For this we could introduce a pure "or" type-level combinator, which for this proposal we will call | and use in combination with type variables in the form (A|B|C).

The second design is what is usually described as a "discriminated" or "tagged" union. In this design we would define a new type declaration form to introduce the union, and a series of cases which delimit precisely the set of types that comprise the state of the union when in that case. For instance,

enum struct E
{
    Case1(A),
    Case2(B),
    Case3(C),
}

It should be noted that, although only the second form is described as a "tagged" union, in fact the "pure" union is tagged as well, since all values carry type tags in the .NET runtime and can be queried through type tests. The usefulness and usability of each tagging system is covered in more detail later.

Semantics

To discuss the different designs usefully, we must assign semantics to the different designs. These semantics could be debated, but I will do my best to define what I see as the most "generous" capabilities in each design, without regard for implementation. When implementation is covered later, we will note problems and limitations.

To start we may note the similarities in both union designs. Both provide the ability to represent multiple values of different types in a single value and single type specification.

Pure unions

In the case of pure unions, the inline | combinator provides the ability to represent anonymous unions directly in type syntax. This is paired with a trivial introduction form -- a value of pure union type may be created simply by assigning a value of one of its cases to it. This makes pure unions particularly good at ad hoc modeling. In any type location a pure union could be provided, and in general, no additional code is required to create values of that type, either in the method return position, method argument position, or local variable assignment position.

On the other hand, while pure unions are simple for ad hoc modeling, they may not be as effective for larger scale programs. While it is useful that the | combinator does not require a type declaration, it can be problematic that it also doesn't allow for one. Consider that C# variable names are rarely pithy single characters like A or B and in fact are usually quite long. A|B looks good in examples but,

string|List<string>|Dictionary<string, string> M(string|List<string>|Dictionary<string, string> u)
{ ... }

isn't quite as attractive or readable. The inability to declare unions using the pure union syntax is made more complicated by the fact that C# doesn't have a particularly good way of declaring new type aliases. The best answer is the using directive, which cannot be public, cannot represent open generics, can only appear at the top of a file, and often requires specifying the fully qualified type name of any alias targets. While these limitations could be changed or lessened, these changes are not currently proposed -- partially because C# has historically not preferred this style of programming over direct declarations.

Aside from declaration simplicity, the union | combinator is also conceptually simple and familiar. Many C# programmers are already familiar with the | value combinator and its extension to the type level is mostly natural. The type-level combinator is commutative and associative, just like its value form. Subtyping also follows the same simple set theory rules that hold true of the set "union" combinator, which many developers will be familiar with, e.g. A|B is a subtype of A|B|C, as is C|A.

Lastly, the elimination form is also simple and familiar -- dynamic type checking through pattern matching. To deconstruct a union A|B we can use simple pattern matching to check for each of the cases, i.e.

void M<A, B>(A|B union)
{
    if (union is A a)
    {
        ...
    }
    if (union is B b)
    {
        ...
    }
}

or with a switch expression

void M<A, B>(A|B union)
{
    var x = union switch {
        A a => ...,
        B b => ...
    };
}

In the second case, we are even automatically provided exhaustiveness checking, since all cases of the union have been handled in the switch.

Unfortunately, the simplicity defined above does come with some significant downsides. In particular, the flexibility of |, including allowing generic type parameters as cases, prevents unions from guaranteeing disjointedness. That is, for any union of the form A|B, there is no guarantee that A and B do not have a subtype or identity relationship. For instance, if the method M in the previous example were instantiated as M<object, string>, the union of A|B would be substituted as object|string. This creates potential problems for uses of the elimination forms, as case B would never be executed in this situation. While the switch expression in the second example would continue to check for exhaustiveness in this case, it would not check for subsumption.

Tagged unions

In contrast with pure unions, tagged unions provide only a declaration form, not an inline form. The benefits and drawbacks are almost the mirror image of the pure union cases described above. A tagged union is declared via the syntax enum struct, deliberately reminiscent of simple C# enums. Inside the body of the enum struct is a series of cases. Each case begins with a case name and is followed by a comma-separated list of types, similar to a tuple type:

enum struct S
{
    Case1(A),
    Case2(A, B)
}

Unlike C# enums, the case names don't directly represent values, instead they are simply names used in the introduction and elimination forms. For the introduction form, each case name corresponds to a public static method on the enum struct type. This method is defined to have a parameter list with parameter types and arities that match the list following the case name in the enum struct declaration.

For instance, in the above example a valid expression would be S.Case1(a), where a is a variable of type A. The result of this expression would be a value of type S.

To use the values passed into case methods on S, there is a corresponding elimination form in pattern matching to retrieve the values, similar to deconstruction. The case name once again appears as a new language construct, also nested under the enum struct name scope, which we might simply call a "variant." Variants may appear only as patterns, in the same position as a type pattern. Following the variant, there may be a deconstruction pattern which matches the arity and types of the original parameter list to the enum struct case. For instance,

if (x is S.Case1(A a))
{

}
if (x is S.Case2(A a, B b))
{

}

or

var y = x switch {
    S.Case1(A a) => ...
    S.Case2(A a, B b) => ...
}

Like pure unions, the above elimination forms guarantee exhaustiveness. Unlike pure unions, however, tagged unions are guaranteed to be disjoint and therefore they also guarantee subsumption. Also unlike pure unions, they do not have a "transparent" form, which means that they cannot be eliminated without using the new, variant-case elimination form.

What may be most notable about the above formulation is that, while pure unions may be conceptually familiar, tagged unions should be particularly familiar to C# programmers, as they behave quite similarly to traditional C# enums (thus the choice of similar syntax). Like enums, tagged unions have only a declaration form. Like enums, they have a series of named cases. Like enums these cases can be switched on. The primary difference is that each enum struct case carries values of arbitrary different types, while all traditional enums are guaranteed to have a single value of one shared underlying primitive type.

Implementations

Pure unions

For pure unions, there is one obvious encoding to the current CLR type system. For any union type (T1|T2|T3|...Tn), this could be translated to an instantiation of a predefined struct type OneOf<T1, T2, T3, ... Tn>. This fits well because:

  • Sharing a type can remove duplicate declarations. The (A|B) syntax is a type syntax, not a declaration syntax. Therefore you would expect the same type to repeated often throughout signatures. If each term were to create a different type declaration, this could incur large compile- and run-time size costs.
  • Sharing a type makes conversions simpler. If each A|B type syntax corresponded to a different CLR type, then conversion between variables could be costly. Using one type makes conversions immediate.

That said, our optimizations around conversions are limited. In particular, because CLR structs lack variance capabilities, we cannot cheaply implement the subtyping relationship that we described in the semantics above. Instead we would be forced to emit a series of type checks for each case, converting each one individually to the new type. This is a small, but non-zero cost. We would have to provide the same trick for case reordering, as the CLR would see OneOf<A, B> and OneOf<B, A> as different types, despite our desire that A|B and B|A represent the same type.

This is particularly unfortunate in places where the CLR forces nested type equivalence, like in assigning List<OneOf<A, B>> to List<OneOf<B, A>>. Here such a conversion would be particularly expensive, likely forcing the allocation of new backing memory.

Tagged unions

Tagged unions would preferably be implemented using standard struct types, with some runtime optimizations. Every enum struct would need to contain a tag, to enforce our guarantee of disjointedness. It would also need to contain fields for each type in each type list in each case. Lastly it would also need to contain methods for creating instances of the enum struct. A declaration like

enum struct S
{
    Case1(A),
    Case2(A, B)
}

may look like

readonly struct S
{
    public readonly byte Tag;
    public readonly A <>Case1_1;
    public readonly A <>Case2_1;
    public readonly B <>Case2_2;

    public static S Case1(A a) => new S(1, a, default, default);
    public static S Case2(A a, B b) => (2, default, a, b);

    private S(byte tag, A case1_1, A case2_1, B case2_2)
    {
        Tag = tag;
        <>Case1_1 = case1_1;
        <>Case2_1 = case2_1;
        <>Case2_2 = case2_2;
    }
}

and the elimination form x is Case2(A a, B b) would look like

x.Tag == 2 && (x.<>Case2_1, x.<>Case2_2 is (A a, B b))

In general, most operations should be simple operations on the larger struct, since cases cannot be represented directly, only held inside a value of the parent type, or deconstructed to constituents. Improvements on basic structs would likely be left to the runtime.

Space efficiency

First, let's consider the space efficiency of possible implementations of the options. For all unions, space efficiency is defined relative to a tuple type with each of the cases, since tuples can represent all possible values of all possible union cases simultaneously.

For pure unions, any space optimization comes from runtime optimization. Since each type parameter in OneOf may be an incompatible overlapping type, the runtime will have to be responsible for special-casing the OneOf type to provide better space utilization.

For tagged unions, this same optimization strategy could be employed, or the C# compiler could participate. Since certain types and members would be known to be overlap-compatible, the compiler could emit FieldOffset as appropriate to optimize space. Space usage of tagged unions may be uniformly higher since they may be forced to always include an extra tag field.

Computation efficiency

As mentioned previously, pure unions may be forced to do more individual type checks and assignment to provide conversion semantics.

Tagged unions likely will not need any such overhead. Most overhead will likely come from checking the tag. If tag checking is cheaper than type checking, tagged unions may be slightly more efficient, or vice versa.

Conclusion

Overall, it seems that both proposals have significant advantages and disadvantages.

However, tagged unions appear to initially fit better with C#. The syntax and semantic forms align more closely with the way C# has been designed and how it is widely used. As an analogy, we might note that tuples are to pure unions as structs/classes are to tagged unions. While tuples are a valuable language feature with unique advantages, their usage pales in comparison to structs and classes. C# is simply optimized for the kind of programming that favors structs and classes.

Indeed, one of the biggest weakness of pure unions appears to be that many of the "elegant" parts of their semantics would be difficult to efficiently implement in C#. With these disadvantages, the field is tilted even more towards tagged unions.

However, in the same way that tuples and structs/classes are both part of C# and provide unique advantages, it might be worthwhile to consider adding both designs.

If we must add only one, or prioritize one, then tagged unions should be preferred.

Edit:

I talked with the runtime team on potential optimizations for any union implementation we came up with. Unfortunately, I missed that there's a fundamental tearing problem with object and value overlap. Consider,

struct BuiltInUnion
{
   byte Tag;
   int Value1;
   string Value2;
}

This type represents a union of int and string. Ideally we would like to have the minimal possible space by overlapping Value1 and Value2, since BuiltInUnion will only use one of them at a time. But note that this is a multi-word-sized type. Even if Value1 and Value2 overlap, the tag exceeds word size, which means that it is not possible to atomically update this type during a copy. So if BuiltInUnion appears in a location shared by two threads, copying could produce tearing such that the tag is updated and the value isn't, or vice versa, which would produce memory unsafety and potentially crash the runtime.

There isn't a simple fix here, just a set of tradeoffs. Those tradeoffs are:

  1. Size
  2. RW time
  3. Allocation
  4. Ref struct
  5. Safety

(4) would mean making enums ref structs, which would remove the data race, but would be incredibly restrictive. Probably not a general solution. (5) would compromise memory safety, which is unacceptable in my opinion.

That leaves size, allocation, or RW time. Fixing the race through synchronization would mean including what is effectively a spin lock, which increases copy time by ~150x. Allocation has the standard tradeoffs. Size might be the best thing to compromise on.

@HaloFour
Copy link
Contributor

HaloFour commented Mar 1, 2023

I would agree that tagged unions, as implemented as structs with named cases, seems like the "easier" approach and I think would be a low cost abstraction, which I think is critical for common DUs like Option<T>/Result<T>.

I think there's also room for reference-based DUs. If they had a similar enum class syntax they could be implemented like DUs in F#, as an abstract class with nested case classes. Or C# could go the Java route and offer a way to describe what types are permitted to implement or extend from an interface or abstract class:

public sealed interface Foo permits Bar, Baz { ... }

You get no syntax shorthand but the compiler could make assumptions as to whether or not your matches are exhaustive.

@FaustVX
Copy link

FaustVX commented Mar 1, 2023

Seeing enum struct means enum class is also valid (at least to me)
And like @HaloFour said, enum class could mean

enum class Foo
{
  Bar,
  Baz(string a),
}

can be translated to

abstract sealed class Foo
{
  public sealed class Bar : Foo { }
  public sealed class Baz(string a) : Foo { public string A { get; } = a; }
}

@FaustVX
Copy link

FaustVX commented Mar 1, 2023

@HaloFour

Or C# could go the Java route and offer a way to describe what types are permitted to implement or extend from an interface or abstract class:

public sealed interface Foo permits Bar, Baz { ... }

You get no syntax shorthand but the compiler could make assumptions as to whether or not your matches are exhaustive.

I don't really like that syntax, on Foo you specify Bar and Baz.
And on Bar and Baz you specify they implements Foo
It seems to be redundant.

@HaloFour
Copy link
Contributor

HaloFour commented Mar 1, 2023

@FaustVX

I don't really like that syntax, on Foo you specify Bar and Baz.
And on Bar and Baz you specify they implements Foo
It seems to be redundant.

Yes, it is redundant, but it's also explicit. I'm not suggesting that be the syntax that C# adopts, but it does offer a way to describe a closed hierarchy, which I think is suitable for a discriminated union of related types.

@RikkiGibson
Copy link
Contributor

Sorry if I missed it but I wondered what the semantics of 'default' are for enum structs. If such a value makes its way into my system, what constructs will match it? Will it just be the first case of my enumeration with default values for all the case fields, for example?

@HaloFour
Copy link
Contributor

HaloFour commented Mar 1, 2023

@RikkiGibson

Sorry if I missed it but I wondered what the semantics of 'default' are for enum structs. If such a value makes its way into my system, what constructs will match it? Will it just be the first case of my enumeration with default values for all the case fields, for example?

I think there's a whole conversation that would have to happen there. In F# I think it's the first case. That may be safe in some contexts, e.g. Option<T>.None, but probably less so in others. But it's probably difficult (if not impossible) to avoid default so consideration should be made as to how it is handled.

@FaustVX
Copy link

FaustVX commented Mar 1, 2023

@RikkiGibson

Sorry if I missed it but I wondered what the semantics of 'default' are for enum structs. If such a value makes its way into my system, what constructs will match it? Will it just be the first case of my enumeration with default values for all the case fields, for example?

Maybe put the Tag value to 0 which signify an invalid state and any other defined values will be 1 or greater.
And maybe allow for

enum struct Option<T>
{
    None = 0,
    Some(T value) = 1,
}

and in this case default(Option<int>) == None

@orthoxerox
Copy link

I think there's also room for reference-based DUs. If they had a similar enum class syntax they could be implemented like DUs in F#, as an abstract class with nested case classes.

Or existing patterns around private constructors and nested classes can be recognized by the compiler.

enum class Foo
{
  Bar,
  Baz(string a),
}

isn't much shorter than

abstract record Foo
{
  private Foo() {}
  
  public sealed record Bar : Foo;
  public sealed record Baz(string a) : Foo;
}

@RikkiGibson
Copy link
Contributor

RikkiGibson commented Mar 1, 2023

Declaring a private constructor on abstract type seems like a pretty decent gesture for I intend this to be a closed set.

One place where this could fall over is if the type is from another assembly. I could update versions and now there's a new case at runtime which I didn't account for.

Of course this seems like a risk with all possible representations. What I think is missing is a gesture to say how that flow is supposed to work when new cases are added over time. When should the exhaustiveness warning be issued or not.

@HaloFour
Copy link
Contributor

HaloFour commented Mar 1, 2023

@RikkiGibson

One place where this could fall over is if the type is from another assembly. I could update versions and now there's a new case at runtime which I didn't account for.

That's true for DUs in general. I believe adding new cases is considered a breaking change. I think the only way to avoid that would be to indicate that the DU is non-exhaustive and that consumers should be forced to consider the unknown case. Otherwise, the compiler will end up throwing SwitchExpressionException exceptions.

@RikkiGibson
Copy link
Contributor

RikkiGibson commented Mar 1, 2023

We don't ask the user to handle null when the switch input is a non-nullable reference type.

#nullable enable
M(null!); // will throw at runtime

void M(object obj)
{
    Console.Write(obj switch { object obj1 => obj1.ToString() });
}

I think that kind of behavior is what we should have as a default for DUs, even from other assemblies.

// OtherAssembly
enum struct MyUnion
{
    Case1, Case2(int item)
}

// My code
void M(MyUnion u)
{
    Console.Write(u switch { Case1 => "a", Case2(_) => "b" }); // no warning.
}

From the semantic/diagnostics point of view, act like it is exhaustive, but still insert that throw branch if they don't use a pattern that matches "anything else" such as _.

Then, if we judge there is sufficient need for DUs which are not treated as exhaustive, introduce a mechanism for marking them as such, and give warnings for those if "anything else" is not handled.

@agocke
Copy link
Member Author

agocke commented Mar 1, 2023

Sorry if I missed it but I wondered what the semantics of 'default' are for enum structs. If such a value makes its way into my system, what constructs will match it? Will it just be the first case of my enumeration with default values for all the case fields, for example?

This is a great question which I don't really have a satisfying answer for. I suspect the answer will be "the first case with default values for each of the types" but there's no good reason why that's the case, it just is the case.

An answer which I would prefer is, "you can't take the default value of enum structs", but obviously that ship sailed long ago.

@HaloFour
Copy link
Contributor

HaloFour commented Mar 1, 2023

@agocke

I suspect the answer will be "the first case with default values for each of the types" but there's no good reason why that's the case, it just is the case.

I suspect that will be less error-prone than having a tag of 0/default indicate an invalid state which will likely result in exceptions thrown at runtime. I think it will be suggested to designers of DUs that they should order their tags so that it's "safer" for the first tag to be the default state. But maybe if the DU can explicitly specify the tag for each case, as suggested by @FaustVX , then that would enable DU authors to opt-in to having default be an invalid state:

public enum Option<T> {
    None = 0, // safe default
    Some(T value) // implied 1
}

public enum Either<TL, TR> {
    Left(TL value) = 1,
    Right(TR value) // implied 2
}

@yaakov-h
Copy link
Member

yaakov-h commented Mar 1, 2023

I suspect the answer will be "the first case with default values for each of the types" but there's no good reason why that's the case, it just is the case.

I would have expected default(SomeUnionOfSeveralThings) to be none of the things in the union, but I can also see immediately that this won't appeal to Option<T> or similar types.

@agocke
Copy link
Member Author

agocke commented Mar 1, 2023

I suspect the answer will be "the first case with default values for each of the types" but there's no good reason why that's the case, it just is the case.

I would have expected default(SomeUnionOfSeveralThings) to be none of the things in the union, but I can also see immediately that this won't appeal to Option<T> or similar types.

I actually like this. Maybe all tag values start at 1, and if you get 0 then your program blows up at the first switch expression.

@HaloFour
Copy link
Contributor

HaloFour commented Mar 1, 2023

@agocke

I actually like this. Maybe all tag values start at 1, and if you get 0 then your program blows up at the first switch expression.

Maybe by default? I think there are cases (Option<T> specifically) where assuming the first case is safe.

Would be nice to get the perspective from the F# folks here.

@agocke
Copy link
Member Author

agocke commented Mar 1, 2023

default is always unsafe in F#. You should never use it. If you know you're returning an Option, just use None.

@HaloFour
Copy link
Contributor

HaloFour commented Mar 1, 2023

@agocke

default is always unsafe in F#. You should never use it. If you know you're returning an Option, just use None.

Agreed, and I think the C# compiler could warn here as well. I'd be more interested in cases where you end up with default from the runtime where neither language can prevent it from happening.

@agocke
Copy link
Member Author

agocke commented Mar 1, 2023

@HaloFour
Copy link
Contributor

HaloFour commented Mar 2, 2023

@agocke

Yep, same is true with custom struct DUs, the first case has a tag of 0:

open System

[<Struct>]
type Multicase =
    | Case1 of Case1 : int
    | Case2
    
let x = Unchecked.defaultof<Multicase>
match x with
| Case1 i -> printf "%d" i
| Case2 -> printf "none"

This prints 0

@RikkiGibson
Copy link
Contributor

I would be disappointed if a mechanism were introduced for preventing "unsafe" default enum struct values, where regular old struct wouldn't be able to participate.

@agocke
Copy link
Member Author

agocke commented Mar 2, 2023

Yeah, that's fair. Tbh, I think the basic approach of, "it's the first one in the case of default" is both precise enough that things will work out, but also esoteric enough that people will generally avoid using default for enum structs. That's a good place to be.

@orthoxerox
Copy link

Non-defaultable value types would help a lot with designing proper struct tagged unions (Either<L, R> in language-ext has three tags to avoid mistakenly getting a default L or R from a default value of Either). However, I wouldn't want to see the DUs delayed by the massive amount of work needed to retrofit non-defaultable value types into C#. I'd rather get Option<T> and Result<T> sooner and deal with Result.Failure.Message being null. Or even better, it being a non-nullable property over the real nullable error message field that ??s the value with "someone tried to pass you a default value". Will the tags on DUs support properties or custom deconstructors?

@YairHalberstadt
Copy link
Contributor

In #4628 I suggested using the syntax A or B instead of A|B for union types.

This is because you can already write if (x is A or B) today, and it would be natural to assume that once union types exist you should be able to write if (x is (A or B) aOrB). This works well if the union type is indeed named A or B.

@RikkiGibson
Copy link
Contributor

I also am wondering if 'or' gives us an opportunity to unify these language concepts.

What concerned me was: doesn't that mean if I say if (x is (A or B) y), the type is different before and after this feature. I thought that pattern was already allowed--not at my PC to check.

@YairHalberstadt
Copy link
Contributor

Unfortunately it does have a meaning, but not what you would expect. The following compiles:

public class C {
    public void M(X x) {
        if (x is (int or string) y) {}
    }
}

public class X {
    public void Deconstruct(out object z) { z = null; }
}

It deconstructs X, and then matches it against the pattern int or string.

It's exceedingly rare to have a type which deconstructs to a single variable, so I think it would be reasonable to either make a breaking change here, or check for the deconstruct, and otherwise have the intuitive meaning.

@RikkiGibson
Copy link
Contributor

What I meant was for y in the example if (x is (A or B) y) changes from being whatever it was before (maybe the type of x or a common type of A and B?), to actually being a value of a union type.

@333fred
Copy link
Member

333fred commented Mar 3, 2023

@RikkiGibson y is X, because the input type is X, which has a Deconstruct method. It was deconstructed, the result was checked for int or string, then y was created with the type being deconstructed, or X.

@agocke
Copy link
Member Author

agocke commented Mar 3, 2023

Unfortunately it does have a meaning, but not what you would expect.

It deconstructs X, and then matches it against the pattern int or string.

Chalk this up as another source of friction with pure unions.

@RikkiGibson
Copy link
Contributor

My feeling is that having to explicitly declare all the unions you use can be annoying if you just want to return this, that or the other thing, and the user checks which one it was. Yet the amount of orthogonality problems we sidestep by not blessing a particular OneOf type in the runtime seems very significant. (Maybe there's a term more fitting than orthogonality here. Don't know.)

I'm pretty convinced that even if you implement special boxing behavior for a OneOf type, you will still get into trouble when it comes to checking conversions involving 'object' or type parameters, expectations around subtyping relationships when one pure-union type is convertible to another, etc... I don't recall if Andy discussed these problems here or just in meeting.

Basically, it's a case where making it so the cliff is easy to spot may be more valuable than pushing the cliff out as far as possible. If you have to get into some total corner case before the seams of pure-unions start to show, you might be worse off than saying: hey, these are just structs, they box and compare like structs, operate accordingly.

@brantburnett
Copy link

Maybe this has already been discussed, but wouldn't generic type parameters being unions further complicate matters? As an example:

public T | A M<T>(T u)
{ ... }

In this case, the union would effectively be OneOf<T, A>. However, what if the caller passes another union as T?

public void P(B | C u) {
    var x = M<B | C>(u);
}

In this case, x would seem to be OneOf<OneOf<B, C>, A> rather than OneOf<B, C, A>. This could create performance issues, complicate copying between different OneOf variants, etc.

Would we have to require that unions can't be used as generic type parameters? This seems like it could significantly reduce the usefulness. For example, then you couldn't iterate an IEnumerable<A | B> with LINQ. Another less drastic option would be to simply disallow combining generic type parameters with other types in a union, but this also limits usefulness.

@HaloFour
Copy link
Contributor

@brantburnett

In this case, x would seem to be OneOf<OneOf<B, C>, A> rather than OneOf<B, C, A>. This could create performance issues, complicate copying between different OneOf variants, etc.

Can you expand on why you think this would be a problem? I don't see it as any different from supporting generic types within generic type arguments today. I expect that nesting unions will be a very normal thing to do and don't see why it would pose any issues. Sure, it might be slightly slower than a flattened union, but that statement is true about any degree of indirection that exists in the language and runtime today.

@FaustVX
Copy link

FaustVX commented Apr 13, 2023

@HaloFour @brantburnett
What will happen if we have

public void M0(A | (B | C) u) // So OneOf<A, OneOf<B, C>>
{ }
public void M1(A | B | C u) // So OneOf<A, B, C>
{ }

M0(new B())
M1(new B())

Will the compiler be smart enough to support the calls to both M0 and M1 ?

@brantburnett
Copy link

In this case, x would seem to be OneOf<OneOf<B, C>, A> rather than OneOf<B, C, A>. This could create performance issues, complicate copying between different OneOf variants, etc.

Can you expand on why you think this would be a problem? I don't see it as any different from supporting generic types within generic type arguments today. I expect that nesting unions will be a very normal thing to do and don't see why it would pose any issues. Sure, it might be slightly slower than a flattened union, but that statement is true about any degree of indirection that exists in the language and runtime today.

I see two difficulties off the top of my head, there may be more I haven't considered yet.

  1. This would be very complex for the compiler to reason about when pattern matching x based on type. As a consumer of M, I logically feel like the type of x is B | C | A and should be able to pattern match across all three without knowing they are nested. If so, the compiler would need to handle unnesting itself. In complex libraries and use cases this could quickly become several levels deep.
  2. Similarly, what if the consumer wishes to pass x as a parameter to another method with a simple type of A | B | C represented as OneOf<A, B, C>? The proposal already covers the fact that some complex copy logic would be required, with a performance cost, to copy from OneOf<B, C, A> where the type parameters are in a different order. Also being capable of copying from OneOf<OneOf<B, C>, A> to OneOf<A, B, C> is even more complex for the compiler to reason about and may have a greater performance cost.

These difficulties are solvable, of course, I just wanted to call them out.

@HaloFour
Copy link
Contributor

@brantburnett

I personally don't see why the compiler should treat the type as (A | B | C), the correct type is (A | (B | C)). Trying to have the compiler flatten the into a single union sounds not only unnecessary but incorrect. But maybe that's all the more reason to not to try to have the compiler silently support OneOf<...> as a type union. This issue doesn't exist in tagged unions because OneOf<T1, T2> can only have two cases, it doesn't matter whether one of those cases is another union that can have other cases. I honestly don't know of a language that tries to treat unions in the way you've described.

@brantburnett
Copy link

I personally don't see why the compiler should treat the type as (A | B | C), the correct type is (A | (B | C)). Trying to have the compiler flatten the into a single union sounds not only unnecessary but incorrect. But maybe that's all the more reason to not to try to have the compiler silently support OneOf<...> as a type union. This issue doesn't exist in tagged unions because OneOf<T1, T2> can only have two cases, it doesn't matter whether one of those cases is another union that can have other cases. I honestly don't know of a language that tries to treat unions in the way you've described.

As a consumer it logically makes sense to me. Something that can be A or B or C should map to any way you can build A or B or C. We can choose in the design to make how you go about reaching that combination be relevant, but I don't know that it necessarily should be relevant. Even the proposal above states "The type-level combinator is commutative and associative". I'd also point out that, unless I'm misunderstanding something, this is the way that TypeScript effectively handles unions. I recognize that TypeScript is, from a design standpoint, a very different beast. However, I do think that things like this are worth considering.

@HaloFour
Copy link
Contributor

As a consumer it logically makes sense to me. Something that can be A or B or C should map to any way you can build A or B or C. We can choose in the design to make how you go about reaching that combination be relevant, but I don't know that it necessarily should be relevant.

I guess this is why I support the conclusion of this proposal that the team should focus on tagged unions. I'm of the opinion that they solve all of these use cases, albeit in a less clever way for the pure union scenarios. But if the pure union approach is very expensive it's likely not worth the effort and if the seams are visible then you just end up with weird language warts. The functional languages I am familiar with do anything like this, and nesting ADTs is very normal, no different than nesting any generic type arguments.

@Richiban
Copy link

@HaloFour @brantburnett What will happen if we have

public void M0(A | (B | C) u) // So OneOf<A, OneOf<B, C>>
{ }
public void M1(A | B | C u) // So OneOf<A, B, C>
{ }

M0(new B())
M1(new B())

Will the compiler be smart enough to support the calls to both M0 and M1 ?

Logically (as in, I have no idea whether this a good idea or not implementation-wise) I think it makes most sense to only support OneOf<T, U> with arity 2, that way A | B | C always means OneOf<A, OneOf<B, C>>.

@orthoxerox
Copy link

@Richiban What about OneOf<OneOf<A, B>, C>>? Is it the same type as OneOf<A, OneOf<B, C>>? If they are distinct types, which overloaded method should the compiler pick if I pass B as the argument?

IMO the type union should be treated as a commutative and associative operation by the compiler.

@wuzzeb
Copy link

wuzzeb commented Apr 14, 2023

I mentioned this on the main du issue #113 before I saw this proposal, but what about re-using records for each case instead of creating something new. Your tagged union would become

union struct S
{
    record struct Case1(A a) : S;
    record struct Case2(A a, B b) : S;
}

or a variant without the structs with just classes to.
The struct version would compile to

record struct S {
   readonly byte Tag;
   record struct Case1(A a) : S(1);
   record struct Case2(A a, B b) : S(2);
}

and the class version wouldn't even need the Tag. x is S.Case1 would check the Tag in the struct version and just do the normal thing in the class version.

The syntax is slightly more verbose since you need to repeat : S on each case and it would support inheritance of cases which is kinda weird, but it leverages lots of existing ideas like records and nested classes/structs.

@HaloFour
Copy link
Contributor

@wuzzeb

but what about re-using records for each case instead of creating something new. Your tagged union would become

It's basically the same thing with different syntax. But it implies that struct inheritance is more generally a thing and that is not the case, so it doesn't really make sense to use a form of syntax that implies that it is.

@Richiban
Copy link

@orthoxerox

IMO the type union should be treated as a commutative and associative operation by the compiler.

That is, of course, the most desirable outcome. It's what we have in Typescript, where A | B == B | A , A | A == A, (A | B) | C == A | (B | C) etc, but since I doubt the CLR will be getting some kind of genuine, built-in type union support we have to make do with some kind of lowering, and I just thought that reducing the options was a good idea.

You're right though, that with generics it would be possible to create a OneOf<OneOf<A, B>, C> so the problem doesn't go away entirely.

@CyrusNajmabadi
Copy link
Member

It's what we have in Typescript

Note that TS gets thsi because there is zero actual runtime impact of any of this, and it's ok if the values are inaccurate. In TS this is all erased and in JS it's all just 'object'. Such an approach isn't really viable for us/the-runtime as we're a strongly, statically typed system where this needs to be correct throughout the layers :)

@Richiban
Copy link

Absolutely; I understand that object is basically how Scala accomplishes this functionality and it's not what we want for C#, I'm sure.

@AartBluestoke
Copy link
Contributor

@Richiban What about OneOf<OneOf<A, B>, C>>? Is it the same type as OneOf<A, OneOf<B, C>>? If they are distinct types, which overloaded method should the compiler pick if I pass B as the argument?

IMO the type union should be treated as a commutative and associative operation by the compiler.

perhaps OneOf<T,U> where T not OneOf, to force a disambiguation? (not sure if where x is not a Y is a generic constraint that is possible)

@agocke
Copy link
Member Author

agocke commented Apr 17, 2023

Absolutely; I understand that object is basically how Scala accomplishes this functionality and it's not what we want for C#, I'm sure.

I'm actually convinced that if we want anonymous unions, erasure is the best way to implement it. That said, the fact that erasure is the best way to implement may be a reason not to implement anonymous unions.

@RikkiGibson
Copy link
Contributor

The erased anonymous unions proposal in F# also requires the types to be disjunct. I think there was a good reason we would also want to do that. I just don't remember it 😓. Maybe it's that if we don't have that requirement, we can't reliably test which case of the union it is and do things like dispatch to the right case of a switch. I think the disjunctness requirement could feel arbitrary and cumbersome for users.

I think tagged unions are the way forward. Hoping we get to hear more from the working group about it.

@agocke
Copy link
Member Author

agocke commented Apr 17, 2023

If we do implement anonymous unions, I think if we take a narrow set of features (mostly the | type combinator), then the semantics are somewhat sensible, if limited.

In particular, I think an elimination form like is (A | B) is a bad idea specifically because it runs into the problems specified around subsumption. On the other hand, if we end up with the same type-test elimination forms that already exist in C# then you would have to do

x switch {
A a => ...
B b => ...
};

And in the current language if A and B are generic parameters that are somehow not statically related, this will not produce any warnings about subsumption, or any checking. It's simply the case that you may end up with the test for A subsuming the test for B.

If anonymous unions end up the same way, that seems fine.

However, having a feature like tagged unions that guarantees disjunction seems separately valuable beyond what we have in the language today, so there are general reasons to prefer that general direction. Especially in the context of adding a language feature, where you would hope to add to the general capabilities of the language.

@Xyncgas
Copy link

Xyncgas commented Jun 12, 2023

@agocke

default is always unsafe in F#. You should never use it. If you know you're returning an Option, just use None.

Agreed, and I think the C# compiler could warn here as well. I'd be more interested in cases where you end up with default from the runtime where neither language can prevent it from happening.

When I was deaing with enum in F#, the compiler already warned me after I matched all the cases, there might be other values given to me outside of what I defined in the enum. Of course I would be getting other values, my enum is based on int so I can just convert an int to enum without the value being the cases of enum I defined, it's acceptable.

When we are making distinctions, we are doing the work of abstracting infinite state machine into finite state machine, a byte has 256 values but we don't usually have 256 distinctions in our union

@unknownburrito
Copy link

unknownburrito commented Jun 29, 2023

Awesome proposal. Although I believe record enum would be a more expressive name, especially because of the similar syntax with the parentheses. enum struct seems off because all enums are structs at the end of the day, whereas enum class is simply inaccurate since it will be lowered into a struct most likely. The way I see it, each enum value would behave as a record holding some arbitrary data (or nothing), and would have no identity.

@iuribrindeiro
Copy link

I really wish this gets implemented at some point. This would add so much value to the language.

@Xyncgas

This comment was marked as off-topic.

@CyrusNajmabadi
Copy link
Member

@Xyncgas Please keep your posts on topic. The topic here is about C# unions, not a question of which language someone wants to use. We want to consider if adding unions to C# is a good thing for our language and our ecosystem. And, if it's a good thing, what should the design be.

@agocke
Copy link
Member Author

agocke commented Aug 8, 2024

I’m going to close this out since I think the original proposal isn’t clear given the edit at the bottom.

I still think the best thing to compromise in the struct system is size. In particular I would have two types of unions: class and struct. Class would be implemented using inheritance and would be space efficient, at the cost of using reference types and GC. Structs would waste space but incur no GC overhead and have very fast code throughout.

Anonymous unions (the | operator) is probably best implemented by erasure to object. I still don’t know if it pulls its weight in the language.

@agocke agocke closed this as not planned Won't fix, can't repro, duplicate, stale Aug 8, 2024
@cremor
Copy link

cremor commented Aug 8, 2024

@agocke Could you maybe link to https://github.com/dotnet/csharplang/blob/main/proposals/TypeUnions.md in your first post here so that people can quickly find the proposal that is currently being worked on?

(I'm not saying that this other proposal has the same goals as yours. But it's at least related.)

@2A5F
Copy link

2A5F commented Aug 8, 2024

https://github.com/sera-net/Sera.Union

Propose gc changes to support object headers on the stack for overlapping struct and reference types

@stephen-hawley
Copy link

I understand the point of wanting fast efficient DUs. However, there are use cases that are beyond the scope of native DUs. C# interoperates with a number of different languages and runtime environments which provide their own native implementations of DUs. Any implementation that is done should support a "fast track" DU as well as an "abstract" DU. This steps around the "it should be a struct"/"use a class" argument neatly: using the native ones that the language gives you, but provide a mechanism for allowing a non-native implementation that will play well with the syntax.

That way, if a type representing a non-native DU (1) meets a particular interface and (2) statically advertises an association of the possible cases and types this should be manageable. There are a number of ways to do this.

The precedent for doing fast track/abstract already exists in C#: compare the output of foreach (var s in someStrings) { } when someStrings is string[] vs when someStrings is IEnumerable<string>. The high level goal of "iterate over every element in this collection" is met while the low level goal of "make it efficient if you can" is also met. Best of both worlds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests