-
Notifications
You must be signed in to change notification settings - Fork 1k
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simple encoding of unions in C# #7544
Comments
It feels like there is overlap here between this and extensions. What's teh difference between a declared union: public union Result : (string | int)
{
public bool IsEmpty => this is null or "" or 0;
} And a similar extension: public explicit extension Result for (string | int)
{
public bool IsEmpty => this is null or "" or 0;
} ? |
Another area this doesn't seem to handle well woudl be something like |
The allocations suggested here I think would immediately disqualify this approach for any common monads like |
This is true. But this is still a particularly bitter pill when you have a union of only struct-types. Having I wonder if, like generics, there is a middle ground where there is a single |
I still feel that a simple implementation of tagged unions covers the majority of use cases. Trying to implement type unions as something native, without deep runtime support for it, feels like it would be a very leaky abstraction where it's easy to run into the warts. Type unions can just as easily be emulated with tagged unions, making it obvious to the developer exactly what type they're working with rather than trying to hide it behind a lot of chicanery (and hidden allocations). |
There's a lot of overlap. The extension version would probably lead to two nested wrapper types, but that might not be a real issue. Mostly I wanted to highlight that declared unions were easily part of the model, without implicating another large feature. 😊 |
Do you mean when I think there's a set of questions around how type unions interact with generics. Do they nest? Do they merge? I considered this to be somewhat orthogonal to the representation, so I left it out. |
Gotcha. Primarily i was approaching this from the perspective of "could we avoid having one of these if we have hte other?" So, if extensions cover this case, then it would be nice (for me) if we only did the extension route, keeping things simpler in the user facing model :)
Yes.
WFM. |
One goal of this proposal is to get away from "immediately disqualify" and get all options on the table. If we don't like the allocation profile of a given proposal, then that's part of the discussion. Maybe we can find other solutions short of discarding the whole thing out of hand. E.g., if memory serves me right, F# special cases In general, I really think F# is an important data point here. Discriminated unions are the main abstraction mechanism in the language, they use classes - and hence allocation - nearly everywhere, and yet the world doesn't come crashing down on them. I'm not saying allocation doesn't really matter. It does! But other things matter too. With a proposal that optimizes for something else (expressiveness and simplicity), we can get a sense of the cost of a better allocation profile. |
I would just note that F# 4.1+ added struct tuples, struct records, and more to the point, struct unions. (Struct tuples were partially motivated by C# interop, but performance was also cited as a primary motivation.) So even in F#, allocations were enough of a concern to warrant fairly major language changes. |
Understood, and as an exploration then it's definitely appropriate to consider all options (pun intended). For tagged unions I personally think it's worthwhile to have both reference- and struct-based versions, just like F# does. I'd agree that an allocating version is probably fine in most cases. For a DU that wraps a value in a context, my opinion is that an allocation would make it too expensive. |
Thanks, Mads, this is an excellent middle ground proposal. |
If the value that you are switching over is a boxed union (typed as object) can you ever match any of the constituent types? How would the generated code get at the value? Or is boxing of these unions disallowed or maybe special cased with runtime magic like with |
If you are going to use records as implementations of tags w/ values for tagged union on top of type unions, and the value of the union is stored as a reference always, why not just use record classes instead of record structs and avoid boxing? There is still an allocation, but it avoids other potential tearing issues when you use the constituent types by themselves. |
That is an excellent question! I think it applies to any approach that encodes anonymous unions as structs. I don't think the answer can be that boxing is disallowed, because in generic contexts we may not know that the thing we are boxing is a union. One answer is that you can match it against a union type. If you use a union type in a type pattern, the compiler could generate code to bridge the gap between this union type and the one it actually is, if they are structurally compatible. That last bit might be tricky to establish. Another answer is that all type tests will check to see, as a fallback, if the value is a union, and if so, dig out its contained value and to the test against that one. This is a place where having a shared representation across all unions can really help, actually! In general, there seems to be a slew of open design questions around what to do when we don't know that we're dealing with a union, e.g., because of boxing or generics. I just steered clear of all of those! 😄 |
Good point! I think you're right in general. I used struct records partly because I was thinking ahead to the "Middle ground?" alternative, where if they come in as structs there's a chance that we can stash them without allocation. |
Could we take a page from Nullable and have the runtime have special handling for the boxing side of this? That way you don't ever see the wrapper when boxed, only when projecting back into the struct-union form. |
Wouldn't that make all type checks more expensive? IMO, trying to dig a type union out of a box only sounds like another reason to not do type unions, at least not without deep runtime support that makes this entirely transparent (and a zero-cost abstraction) to the language. |
That's a neat idea. When boxing, the runtime simply removes the wrapper and produces the already boxed contents! This seems most realistic with the simple wrapped-object-reference scheme though, otherwise the runtime would need to know some gnarly details about how to unpack the contents of the union struct. |
Yes, probably. |
It seems like it would potentially be viable with the hybrid model as well. Runtime would def have to know about the special encoding there, but that doesn't seem so bad. This would at least make small-unmanaged-structs and reference types non-boxing, which would def be a win for me. |
Apparently the current metadata is blocking us to have an efficient implementation of all those kinds of modern features. Another approach would be introducing metadata v3.0 to provide all the necessary runtime support for union types, extensions, const generics, variadic generics, HKT and etc. Then we will be able to have a high-performance no-workaround-needed straight-forward efficient implementation of those features. |
An out-of-topic thing is that roslyn and BCL are opting-out some language features to save allocations, namely interface-based LINQ.
To be explicit, it allows efficient implementation. The layout of union of structs still need to be defined somewhere, but it can be decided by the runtime layout algorithm. |
Is it possible to solve the performance problem by lowering to something like PS. Off the top of my head, this is probably difficult from the perspective of GC tracking... |
FWIW, I think Swift has a concept of embedding the payloads of enum cases when they are small, and the storage can be shared, and using a polymorphic reference type or something when they are large. |
I've been thinking about this off and on for while and have broadly come to the conclusion that we need to decide on semantics and declaration forms before we deep dive into encoding. The main issue at hand is this: what we're calling "type unions," in set theory, are just called "unions." What we're calling "tagged unions," in set theory, are called "disjoint unions". These are not the same set operations. You can, in some cases, construct one from the other. But I don't think that this is an argument for why we should prefer one over the other, how they should play in the language, or what semantics will mesh best with C#. I see unions as being useful in small, simple cases, like we use tuples today. I see disjoint unions as being useful as top-level data types, like we use classes and structs today. I'm not sure either is or should be mutually exclusive. And if I had to choose one, I would probably prefer disjoint unions, for much the same reason that I would probably prefer classes over tuples if I could only pick one. |
I needed to build something like this to handle variant parameters, and came up with a fairly compact codewise solution for my common case which were smaller structs (e.g. The downside was to get good performance for nullable required some fairly ugly hacks, as I wanted to essentially treat it like Obviously, it would be great that instead of having to call Here is an older verison of my solution:: |
To elaborate on #7544 (comment) I think there are a couple problems that crop up in the encoding that lead to important semantic discussions. In particular, @mattwar raised the very good question, "how does this work if the union is boxed"? One answer is, the runtime could do the indirection. But I think that's putting the cart before the horse. My inference of why we would need the runtime support is to keep the But I think there are other questions in this area. The question of whether or not we should support value sum types, even if they are second class, like F#, is still relevant. But I also think generic definitions aren't obvious in this system. Consider
This seems perfectly reasonable as a sum type. On the other hand, consider:
This is still OK, but starts to raise some questions. For instance, the declaration form is now mandatory. There's no place on the anonymous form ( But then, it's worth noting that the sum type form doesn't have such a symmetry either -- it always requires the declaration form and has no anonymous form. There's also some weirdness about the
Is In both of these instances, I think you run into cases where fusing the syntax starts to push towards semantics that either have difficult-to-resolve conflicts, or just outright don't make sense. So once again, this starts to feel a little like tuples. In the anonymous form you have simple expressions, but if you want a more complex data type, you use a record.
We probably could do something like that if we went back in time and reworked these from the start, but we just didn't choose that path. If we decide that we like the abstract semantics where sum types can have generic parameters but unions can't, we can have that by separating the declaration forms. And if we want to require sum types to have top-level declarations, but don't want that for unions we can have that as well. And if we want to support value sum types but not value unions, we can do that later. Combining the forms into one super-rich declaration form, and then either ending up with weird semantics, or choosing semantics we wouldn't otherwise like, feels like pressure to unify into a neat implementation package. If we just split the concepts, we can better align with desired semantics and pick the appropriate representation for each of them. |
Moving this here from the other thread.
Thinking out loud.
(1.a.) is the sum of how data gets on to the heap - while I could be wrong, the actual ruleset couldn't be that much more complicated. (1.b.) is an optional opportunistic optimization. (2.a.) This could cause the amount of machine code to explode but, to contradict myself, it would only occur for methods where enum structs are used in different way. (2.b.) A pointer to the start of a variable-length struct. The first word would contain a tag, which would be What really sucks about my idea is that arrays would necessarily contain pointers, I considered this workaround but it doesn't actually work: struct ReadonlyContainer<T> {
public readonly T Value;
}
void Foo() {
var arr = new ReadonlyContainer<MyEnum>[...];
} As for how to represent them in C#, I think that this "accidental" approach using FieldOffset is fraught with surprising behavior. How about something like this: [TaggedStructLayout] // Considered LayoutKind.Tagged, but that would likely break some existing code somewhere
// Only valid on readonly structs
readonly struct MyEnum {
[TaggedStructMember(TaggedStructMemberKind.Tag)]
readonly byte _tag;
[TaggedStructMember(TaggedStructMemberKind.Union)]
readonly int _a;
[TaggedStructMember(TaggedStructMemberKind.Union)]
readonly String _b;
[TaggedStructMember(TaggedStructMemberKind.Union)]
readonly decimal _c;
// Unused by enums, but no reason why this couldn't exist
[TaggedStructMember(TaggedStructMemberKind.Value)]
readonly decimal _e;
// TaggedStructMemberKind.Accessor set by this attr ctor
[TaggedStructMember(0, nameof(_a))]
public int A => _a;
[TaggedStructMember(1, nameof(_b), nameof(_c))]
public (string, decimal) B => (_b, _c);
// Bonus feature for users building these by hand/source-generator:
// Error: `_a` does not appear in the list of tagged fields for this accessor.
[TaggedStructMember(2, nameof(_b))]
public int C => _a;
} That could ship in the compiler today with no runtime support, and the runtime could add support for it later. |
I think that looking at StringValues might be useful to understand a real-world use case for a union.
|
Two years ago, I experimented with building some kind of Rust-like Result and Option types in C#. Actually, based on Beefs types: https://www.beeflang.org/docs/language-guide/errors/ There are limitations in C# syntax that makes my specialized approach cumbersome. And you would want a general concept of a union - which is what we are discussing here. Anyway, I have documented my findings here: https://github.com/marinasundstrom/OptionAndResultConcept/ |
Is there a reason why unions can't be represented in the same way that unions are in languages like Rust or C [1]? That would mean each member would overlap, and the size of the storage would be equal to the size of the largest member. Then ( in the same way as Rust ), a discriminant would be used to determine which member was currently active. This would also allow for similar optimisations to what Rust does, e.g a tagged union
could have the same representation as This could work for both value and reference types, and would avoid unnecessary boxing as:
would essentially be storing two integers, one Anonymous unions could be treated the same way, just without the explicit names for the different cases. |
@needlesslygrim unfortunately, it's not possible at the moment (if ever) because references have to be tracked by GC. |
Ah, that's a shame, but does that mean pure value unions ( i.e. unions composed of only value types ) could still be represented in this way? |
Isn't it preferable to name each of those cases? Or the intention is to literally enable unions of primitives with implicit conversion etc? That is certainly useful in JS but maybe |
I think a syntax that's more-or-less identical to tuples [given a parenthesis-based approach] would be ideal The following is valid for tuples (string AgeAsString, int Age) = ...; // Named tuple (much more useful than un-named)
(string, int) = ...; // un-named tuples (`Item1` etc..) The following should be valid for unions: (string AgeAsString | int Age) x = ...; // Named union
(string | int) x = ...; // Un-named union
if (x is string) {...} That's not to say that this should be the only syntax, but I think this consistency is an upside. I don't see how to use a union of different cases with the same type with this syntax however, so.. ye.. not ideal. |
They could be. |
How could you pass |
Great feature! What do you think about using existing keywords, Then, from user's point of view, it will become "allowed" to have nested subtypes of a sealed type. Look at Kotlin examples, they seem intuitive and clean in my opinion: So in C# it could look like public sealed record Result
{
public record Success(Outcome outcome): Result()
public record Failure(Exception e): Result()
}
public string Render(Result result) => result switch
{
Success s => s.outcome.Items.JoinToString(", "),
Failure f => $"Operation failed: {f.e}",
// no _ => throw needed, all cases are handled!
} |
@PiN73 what if I want to reuse the same The approach you describe forces the child to be exclusive to that named union, which is more restrictive. |
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
Simple encoding of unions in C#
Unions are a feature where the runtime representation matters a lot to the trade-off between expressiveness and performance. Here's a simple way to encode type unions that are very expressive, and tagged unions that are type unions instead of an alternative to them. This approach trades off performance compared to others, notably around allocation, in exchange for simplicity and expressiveness.
It might just be the right approach for us! But regardless, it may serve as a useful endpoint of the spectrum, and we can use it to measure any comparative loss of expressiveness, elegance and unification incurred by lower-allocation alternatives.
I won't dwell on syntax. I'll adopt some suggestive syntax, but not discuss it. The focus is on the underlying structures. I lean heavily on the great work of the Discriminated Unions Working Group.
Type unions
A "type union" corresponds to a union of a set of types, as in
union (string | int | bool)
. A value belongs to the union if and only if it belongs to (at least) one of the constituent types - in this casestring
,int
orbool
.Type unions can take the form of anonymous unions in e.g. method signatures, but can also grow up to become declared unions when circumstances make it preferable, e.g. because of verbosity, repeated use or the need to add additional members - similar to how tuples can grow up to be declared as struct records.
A declared union type could look something like:
Declared unions can't have additional state, but function members can operate on the value of
this
itself.Runtime representation
A union type lowers to a struct type which wraps the single
object
field__value
containing the value of the union. The struct declaration will contain some mix of members and metadata to tell the compiler what the constituent types are, and enable it to implement union-specific semantics. In the case of declared unions, any declared function members are lowered to make use of the underlying__value
instead ofthis
. E.g.:The exact format of the metadata is TBD - the
[Union]
attribute is just for illustrative purposes.Anonymous unions lower to a declaration with a compiler-generated name (say
__Union1
), and the compiler is free to reuse declarations for equivalent union types across the same assembly.If any of the constituent types are value types, those values will be boxed into the
__value
field. There is no need to separately record the type of the contents of__value
.If all the constituent types of a union have a shared base class, we might be able to use that instead of
object
as the type of__value
.This simple layout means that all unions have the same representation in memory, so that they are easily interchangeable, and so that generated types for anonymous unions are optimally shareable.
Also by being a single reference field they are small enough to behave atomically and not be subject to "tearing"; i.e., inconsistent state caused by multiple threads mutating simultaneously.
Finally, by using a reference field for the value, declared union types can be recursive, containing other instances of the same type.
Type testing
The usual mode of consumption for a type union is through a type test, e.g. a type pattern, to see what type its value actually has.
The generated code for this is trivial; just redirect the type test to the underlying
__value
:The runtime already knows how to test the type of the underlying member. No need to invent a new encoding.
Implicit conversions
This runtime representation allows us to offer extensive structural conversions between unions, whether declared or anonymous.
Where two union types
U1
andU2
are deemed equivalent by the compiler, there is an implicit conversion between them in both directions. (This isn't quite an identity conversion, because unfortunately it won't apply to arrays or generic types over union types).Where
U1
is deemed a subtype ofU2
by the compiler, an implicit conversion exists fromU1
toU2
.In addition, an implicit conversion exists from each constituent type to the union type, and from the union type to any shared base type of its constituent types.
The runtime implementation of these implicit conversions is trivial, since it can just wrap the
__value
of the source in the target type:The declaration of
u2
might in fact lower like this instead:Since the compiler would have realized that
union (C | D)
andunion (D | C)
are equivalent anonymous unions and might have reused the same lowered declaration for both.The declaration of
b
might look like this instead:Since the compiler would have realized that
B
is a shared base class ofC
andD
, and might therefore have declared the__value
field of__Union1
to be of typeB
instead ofobject
, thus obviating the explicit cast.Tagged unions
The general idea of tagged unions (or "discriminated unions") is that each "variant" or "case" of a union has a tag, as well as some contents (zero, one or more values of given types). A given tag can then be matched with a pattern that is syntactically identical to a type pattern:
What if this isn't like a type pattern, but is a type pattern? Consider the following declarations:
Assuming
result
is of typeResult
, this makes the above switch expression work exactly as expected!The proposal then is that we unify tagged unions into type unions by saying that a tag is simply a named wrapper type for zero, one or more values. You can deconstruct them immediately upon matching, but you don't have to. As types they are already an existing concept and first class citizens in the language. No new machinery or semantics needed.
Of course, a union type would be free to mix tagged and untagged constituent types, and it might well make sense to do so.
This is not to say that there shouldn't be convenient syntax directly supporting tagged unions, though perhaps it, too, should be more integrated into the syntax of type unions. The main point is that, whatever the syntax, tags, semantically speaking, are just types.
Tag type declarations could certainly be nested within type union declarations, so that they're proprietary to the given union type and don't "pollute" the top-level type space. If there's a dedicated syntax for tagged unions, maybe it generates the tag types nested inside of the union type by default. This might be paired with a feature like #2926 to keep the pattern matching against nested types elegant.
Performance
The main performance downside of this approach to unions is the frequent need for allocation. Value types need to be boxed, and "tag types" would also need to be allocated to go into the reference field of a union type. This seems particularly glaring when talking about a small union of a few small value types.
There are probably some mitigations possible without deviating from the format. For instance, common boxed values (
true
,false
,-1
,0
,1
,2
,...) could be cached and reused. But fundamentally, most value types would be boxed most of the time.Non-allocating alternatives
On the other hand, non-allocating alternatives have performance downsides too. Trying to represent a union value as a struct that has a field for each possible type gets very space inefficient. While some packing is probably possible, it is complicated, as it must respect the runtime's need to keep references and non-reference data separate. So unions of many and/or large value types could become quite wasteful (lots of zeros in memory!), expensive to copy around and complicated to pack and unpack to underlying values.
On top of that, every union type would have a unique layout. Even if two of them are structurally related, any interop between them, such as implicit conversion or equality comparison, would be quite complicated and expensive, or maybe outright prohibited.
Recursive types
We should note that F# represents unions as classes by default. While different in detail from what's proposed here, it leads to a similar amount of allocation. In F#, discriminated unions are used - indeed usually preferred - as an alternative to class hierarchies to describe your data model. While you can explicitly ask for struct unions in F#, those are quite limited in comparison, partly because they cannot be recursive.
If you're doing serious data modeling, chances are you're going to need recursive types - where a value can directly or indirectly contain another value of its same type. In order to represent that, you need something to be a reference type: Value types cannot contain themselves! If union types don't support that directly, you're going to have to go out of your way to explicitly introduce reference types elsewhere. Your code gets more complicated, and the end tally of allocations will end up much the same.
Middle ground?
It's possible that we could pick another struct representation that is still uniform across all union types while avoiding a large portion of allocations. For instance, we could standardize on a scheme similar to https://github.com/JeremyKuhne/ValuePrototype, which in addition to a reference field keeps a certain number of bits on the side for storing small enough values without boxing. Perhaps we could expand the scheme to also efficiently represent "tag types" without extra allocation in many cases.
Compared to the single reference approach, such a scheme would make all union values bigger (perhaps twice as big), taking up more inline space and costing more to copy, but at least not in a way that grows proportionally to the number of constituent types involved. In addition, they would no longer be safe against tearing. But the approach would maintain the all-important property that the same value is represented by the same bits, regardless of which union type it is stored in. As such, it could still straightforwardly support many of the semantic benefits.
Conclusion
I think the described approach compares favorably to others in terms of expressiveness and inherent simplicity. As such it will be useful to measure against as we explore other approaches with different performance characteristics.
The text was updated successfully, but these errors were encountered: