-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extend System.Guid with a new creation API for v7 #103658
Comments
@tannergooding shouldn't it be cased On another note, what about having the method take an enum parameter speficying the version (instead of having all the Could then have something like: var guid = Guid.NewGuid(GuidVersion.Version7); Which would look much cleaner with: of course: Guid guid = NewGuid(Version7); |
I'll let API review decide if I got it "wrong" or not. I prefer the look of
This doesn't work because there are unique parameters/overloads per version. i.e. |
Hello @tannergooding, a few questions:
var t1 = DateTimeOffset.FromUnixTimeMilliseconds(0x010203040506).DateTime; // 11/02/2005 20:02:37
var t2 = DateTimeOffset.FromUnixTimeMilliseconds(0x020203040505).DateTime; // 16/12/2039 15:56:25
var g1 = Create(t1);
var g2 = Create(t2);
Console.WriteLine(g1); // 03040506-0102-0000-0000-000000000000
Console.WriteLine(g2); // 03040505-0202-0000-0000-000000000000
Console.WriteLine(g1 < g2); // False
Console.ReadKey();
static Guid Create(DateTime dateTime)
{
var dto = new DateTimeOffset(dateTime); // handle Local and Unspecified
var ms = (ulong)dto.ToUnixTimeMilliseconds();
ms &= (1ul << 49) - 1;
Span<byte> data = stackalloc byte[16];
BinaryPrimitives.WriteUInt64LittleEndian(data, ms);
return new Guid(data);
} |
That's largely an implementation detail and is not really relevant to the API proposal.
There is no such thing as "proper" sorting/comparison, that is there is no formal definition of how to compare UUIDs. The way .NET does it is to treat it effectively as an unsigned 128-bit integer represented in hex form. The output string is already in big endian format. The algorithm you've used to create the
The internal layout used by
Given a 48-bit Unix Timestamp of _a = (int)(msSinceUnixEpoch >> 16); // store the most significant 4-bytes of the timestamp
_b = (short)(msSinceUnixEpoch); // store the least significant 2-bytes of the timestamp
// Fill c, d, e, f, g, h, i, j, and k with random bits
_c = (short)(_c & ~0xF000) | 0x7000; // Set the version to 7
_d = (byte)(_d & ~0xC0) | 0x80; // Set the variant to 2 This ensures that the Unix timestamps are already effectively sorted. The exception is for the random data that exists for two UUIDs created in the same "Unix tick". The spec allows for these random bits to contain better structed data, however that is extended functionality and can be provided at a later point in time if/when there is enough ask for it. I'd guess that we'd likely do that via an additional overload that looks something like
It can be discussed in API review, but isn't really necessary to get the core functionality supported. It is also likely a far stretch from what the typical use-case would be. The actual implementation is likely going to be effectively: |
Thanks for the explanation, you will perform some additional operation to make string, sorting and binary big-endian representation consistent. var t1 = DateTimeOffset.FromUnixTimeMilliseconds(0x010203040506).DateTime; // 11/02/2005 20:02:37
var t2 = DateTimeOffset.FromUnixTimeMilliseconds(0x020203040505).DateTime; // 16/12/2039 15:56:25
var g1 = Create(t1);
var g2 = Create(t2);
Console.WriteLine(g1); // 01020304-0506-7000-8000-000000000000
Console.WriteLine(g2); // 02020304-0505-7000-8000-000000000000
Console.WriteLine(g1 < g2); // true
Console.WriteLine(Convert.ToHexString(g1.ToByteArray(true))); // 01020304_0506_7000_8000000000000000
Console.WriteLine(Convert.ToHexString(g2.ToByteArray(true))); // 02020304_0505_7000_8000000000000000
Console.ReadKey();
static Guid Create(DateTime dateTime)
{
var dto = new DateTimeOffset(dateTime); // handle Local and Unspecified
var msSinceUnixEpoch = (ulong)dto.ToUnixTimeMilliseconds();
var a = (int)(msSinceUnixEpoch >> 16); // store the most significant 4-bytes of the timestamp
var b = unchecked((short)(msSinceUnixEpoch)); // store the least significant 2-bytes of the timestamp
var c = (short)0x7000; // Set the version to 7
var d = (byte)0x80; // Set the variant to 2
return new Guid(a, b, c, d, 0, 0, 0, 0, 0, 0, 0);
} |
There's not really anything "additional" to do here.
Which is to say, the underlying storage format doesn't matter except for when it applies to serialization/deserialization. The actual value stored is what matters and is what is used in the context of doing operations such as You can use any storage format you'd like, provided that the underlying operations understand how to interpret it as the actual value. Different platforms then use different formats typically based on what is the most efficient or convenient (hence why most CPUs natively use little-endian and why networking typically use big-endian). .NET happens to use a format for |
|
There is no such thing as Guidv7, v2, v8 or any other v-something. There's
Since there's no And now let's look at how this (doesn't) work in the real world. RFC 9562, 6.13. DBMS and Database Considerations
Using a specialized data type provided by a specific RDBMS in conjunction with a particular way of generating Uuid can increase data access speed. This is precisely the reason for Uuidv7's existence. Uuidv7 stores the number of milliseconds that have passed since the start of Unix time in the first 48 bits ( An important point is that the time-based part is stored in big-endian. Since RDBMS typically indexes binary data from left to right, this method of generation ensures monotonically increasing values, just like an integer counter. This allows for maintaining low levels of index fragmentation, fast search, and constant insertion time. And now, having finished with the introductory part, let's dive into the peculiarities of how popular RDBMS and their .NET drivers work with Welcome to hell. Uuidv7Let's write a simple function for generating Uuidv7, which will use the fields static string GenerateUuidV7()
{
Span<byte> uuidv7 = stackalloc byte[16];
ulong unixTimeTicks = (ulong)DateTimeOffset.UtcNow.Subtract(DateTimeOffset.UnixEpoch).Ticks;
ulong unixTsMs = (unixTimeTicks & 0x0FFFFFFFFFFFF000) << 4;
ulong unixTsMsVer = unixTsMs | 0b0111UL << 12;
ulong randA = unixTimeTicks & 0x0000000000000FFF;
// merge "unix_ts_ms", "ver" and "rand_a"
ulong hi = unixTsMsVer | randA;
BinaryPrimitives.WriteUInt64BigEndian(uuidv7, hi);
// fill "rand_b" and "var"
RandomNumberGenerator.Fill(uuidv7[8..]);
// set "var"
byte varOctet = uuidv7[8];
varOctet = (byte)(varOctet & 0b00111111);
varOctet = (byte)(varOctet | 0b10111111);
uuidv7[8] = varOctet;
return Convert.ToHexString(uuidv7);
} The hexadecimal representation is used here as the base because for Guid, only the string representation is considered valid. PostgreSQLGod bless the developers of PostgreSQL and Npgsql. MySQLIt can only use Okay, let's test how it works in practice. docker run --name mysql -e MYSQL_ROOT_PASSWORD=root -p 3306:3306 --cpus=2 --memory=1G --rm -it mysql:latest Somehow connect to the server and create a database and its schema there. CREATE DATABASE `dotnet`; And after selecting our newly created database: CREATE TABLE `uuids`
(
`uuid` BINARY(16) NOT NULL PRIMARY KEY,
`order` BIGINT NOT NULL
); Let's write a simple program that generates a Uuidv7, inserts its value along with an ordinal number (for validation of sorting). This happens because the MySQL driver has a connection string parameter with a default value of If we set the values to However, if you execute SELECT * FROM uuids ORDER BY uuid ASC; You will notice that for I would like to remind you that we are passing absolutely correct Uuidv7 (according to the specification) values as parameters, using Okay, let's generate Uuidv7 and insert them into the database, using various GUID Format values, and construct graphs for visualizing the process. After all, everyone loves graphs. For this, I wrote a pair of small programs:
And here are the results by insertion time: And here is the deviation: The denser the points are to the left part and the higher the values there, the more such values resemble a monotonically increasing sequence. Notably, when using I ran BenchmarkDotNet and saw the following picture. BenchmarkDotNet v0.13.12, Windows 11 (10.0.22631.3737/23H2/2023Update/SunValley3)
AMD Ryzen 9 7950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK 8.0.302
[Host] : .NET 8.0.6 (8.0.624.26715), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
Job-FTFTEX : .NET 8.0.6 (8.0.624.26715), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
Server=True
| Method | Mean | Error | StdDev | Gen0 | Allocated |
|--------------- |---------:|---------:|---------:|-------:|----------:|
| GenerateUuidV7 | 63.96 ns | 0.654 ns | 0.612 ns | 0.0001 | 88 B | This means that sometimes 2 Uuids are generated within a 100-nanosecond interval (one Tick), and the random part of the second one outpaces the first. Thus, However, such visualization does not allow for a clear assessment of the situation for It is seen that when using In the case of Microsoft SQL ServerThe final boss of the next Doom will be the Let's start by generating a Uuidv7 with a correct text representation in the form of We get the following graph: We see a significant slowdown in insertion over time. We observed a similar pattern with MySQL. And here is the deviation: Let's check the state of our index: SELECT * FROM sys.dm_db_index_physical_stats(db_id('dotnet'), object_id('uuids'), NULL, NULL, NULL); and we see that From here we dive into obscure technologies.
But this is insufficient. This is where poor engineering decisions come into play, in the form of Guid's internal layout. Now we multiply obscure technologies by poor engineering decisions. This is done as follows (the code is written as simply as possible so that anyone can understand what's going on): string ReorderUuid(string uuid)
{
var src = Convert.FromHexString(uuid);
var dst = new byte[16];
// reorder for SQL SERVER Sort order
dst[0] = src[12];
dst[1] = src[13];
dst[2] = src[14];
dst[3] = src[15];
dst[4] = src[10];
dst[5] = src[11];
dst[6] = src[8];
dst[7] = src[9];
dst[8] = src[6];
dst[9] = src[7];
dst[10] = src[0];
dst[11] = src[1];
dst[12] = src[2];
dst[13] = src[3];
dst[14] = src[4];
dst[15] = src[5];
// reorder for guid internal layout
var tmp0 = dst[0];
var tmp1 = dst[1];
var tmp2 = dst[2];
var tmp3 = dst[3];
dst[0] = tmp3;
dst[1] = tmp2;
dst[2] = tmp1;
dst[3] = tmp0;
var tmp4 = dst[4];
var tmp5 = dst[5];
dst[4] = tmp5;
dst[5] = tmp4;
var tmp6 = dst[6];
var tmp7 = dst[7];
dst[6] = tmp7;
dst[7] = tmp6;
return Convert.ToHexString(dst);
} And if we construct a Guid in the following way We get the following picture for insertion time: And for deviation: The maximum deviation is 1 (the reason is the same - generating 2 values in 1 Tick). BINARY(16)Despite the specification explicitly requiring the use of a specialized data type for storing UUIDs in the database, not everyone in the real world follows these recommendations. Or simply, a project could have started when there were neither recommendations nor .NET (let alone Core, even Framework). And databases may simply not have had a specialized type for working with UUIDs. This is where Explicit conversion to a byte array on the calling side is required, or the parameter value from In case of Only such combination of APIs will ensure a correct roundtrip of values and a monotonically increasing sequence of values at the database level. SummaryIn the case of a Guid as a container for Uuidv7 and using a specialized type on the database side:
In the case of a Guid as a container for Uuidv7 and using
Do we need this?@tannergooding, given everything listed above, I have a question Whose problems and how exactly will this API solve? At the moment, it appears to be a feature only for PostgreSQL and MySql users (under certain conditions). IMHO: it should not be added. It's a minefield. This will definitely do more harm than good. |
The multitude of users, both internal and external, that have asked for such an API to exist and which are currently using 3rd party packages that are doing effectively what this API will be providing. Since last year, #88290 was opened explicitly asking for this on There are a multitude of NuGet packages that provide this as well, most of which (particularly I am not interested in getting to another elongated discussion around the pros vs cons of using GUIDs are not a Microsoft specific concept, they are an alternative name for UUIDs and that is explicitly covered in the official RFC 9562 in the very first sentence
The RFC further discusses layout and how GUIDs underlying values are represented and that this is distinct from the concept of saving it to binary format. The basic quote is as below, but I have given consistent in depth analysis and additional citations as replies other threads where this has been asked:
Citing worst case performance characteristics is likewise not the correct basis for deciding whether or not a feature is suitable. We do not provide worst case or naive implementations of core APIs. We provide optimized routines that efficiently handle the data and which try to do a lesser number of operations where possible. Serializing to a |
This is a good example because this library actually creates "GUID-like" UUIDv7 by accepting the database where it would be used: Why is that? It's because GUID fails to represent UUID in a single format, and each database needs its own representation of UUIDv7 within a GUID to function correctly. |
This is not true. Guid is a structure that appeared for COM/OLE. In its original form, with its layout and all subsequent advantages and disadvantages, it was invented by Microsoft and is used in the Microsoft technology stack. Literally all other languages use Uuid, which from the perspective of the public API is either 16 bytes in big endian, or a string, the format of which is defined in the specification.
I am 100% sure that the .NET runtime team can implement the fastest and most optimized Uuidv7 generation algorithm in the world. But I did not raise the issue of generation performance. I highlighted the topic of what happens to such a Uuid AFTER generation. When it is used as an identifier in a database that will live there for 50 years. What happens next - when it starts living its own life? Uuidv7 is needed in order to NOT fragment indexes in databases. This is why it exists. And from the perspective of PostgreSQL or MySQL (under certain conditions), a Uuidv7, packed into a Guid, will be written to the database in such a way that it won't cause index fragmentation. In case of using Microsoft SQL Server and writing such a Uuidv7, packed into a Guid, into a column of type When changing the connection string in MySQL, data begins to be written to the database in a not optimal way. This leads to the loss of all benefits from such a generation algorithm. I would characterize the benefit of such an algorithm for MySQL as positive under a certain driver configuration. For PostgreSQL, everything is fine. So we have a situation, where the proposed API will generate Uuidv7, which when written to databases will have the following characteristics:
As @ImoutoChan rightly noted, For the API to generate Uuidv7, packed into a Guid to make sense, it is necessary to specify for which database it is generated. But this is not something that can be added to the BCL. So if the Microsoft platform can't make an algorithm that would work equally well for both the Microsoft-developed database and all other databases - then maybe it's not worth adding such a method at all and leave the implementations to the community? It seems to have been doing pretty well all these years. |
A UUID of This is no different than the value Binary serialization and deserialization is fully independent of the value represented at runtime in the named type. If you have a destination that requires the data to be stored in a particular binary format, you should explicitly use the APIs which ensure that the data is serialized as that format and ensure the inverse APIs are used when loading from that format. The underlying storage format used by the type at runtime is fully independent of the value represented by the UUID. It is not safe to rely on and is not something that is observable to the user outside of At this point, you appear to be explicitly ignoring how the code actually works, the considerations that actually exist, and what the official UUID specification actually calls out. That is not productive, it does not assist anyone, and it is blatantly pushing the discussion in a direction that makes it appear as though this is not a viable solution when there is no actual difference in how It is purely a consideration of serialization. |
Okay. Let's imagine a situation where such an API was added. Let's go through a User Story. I, as an ordinary .NET developer, install the .NET 9 SDK, see that a new method And my friend, who uses this API for generating Guids and writing them to PostgreSQL, where values are stored in a column of the At the same time, we both use specialized data types (as required by the specification) and pass the Guid as a query parameter without any preliminary conversion. await using var cmd = connection.CreateCommand();
cmd.CommandText = "INSERT INTO someTable (id, payload) VALUES (@id, @payload);";
cmd.Parameters.AddWithValue("id", Guid.NewGuidv7());
cmd.Parameters.AddWithValue("payload", payload);
await cmd.ExecuteNonQueryAsync(); So it turns out that when I use Microsoft technology (.NET) with a recently added API in combination with a database driver developed by Microsoft and an RDBMS developed by Microsoft - I don't get the absence of fragmentation. But when I use an OpenSource database (PostgreSQL) with an OpenSource driver (Npgsql) for this database, I do get it. |
The problem would be caused by failure to properly serialize the data as big endian and therefore not storing the UUIDv7 that was generated, but rather a different GUID instead It is not an issue with the |
You seem like you're not reading what I'm writing. |
If I had a separate data type for dst[0] = src[12];
dst[1] = src[13];
dst[2] = src[14];
dst[3] = src[15];
dst[4] = src[10];
dst[5] = src[11];
dst[6] = src[8];
dst[7] = src[9];
dst[8] = src[6];
dst[9] = src[7];
dst[10] = src[0];
dst[11] = src[1];
dst[12] = src[2];
dst[13] = src[3];
dst[14] = src[4];
dst[15] = src[5]; at the driver level. Likewise, the MySQL and PostgreSQL driver teams could easily adapt such a type. |
The same statement holds true in the inverse, I had merely misunderstood which of the two had the problem. It is fundamentally the fault of the developer for not serializing/deserializing in the format expected by the database. If the database expects little endian format, you must use The actual underlying storage format used by the |
The problem isn't about the binary representation. It's about how the existing ecosystem of RDBMS drivers works with the Guid data type. This already exists and is already "somehow" working. And you can't change it without breaking a huge amount of code. Due to how it works with Guid - different database drivers require different workarounds:
Therefore, the presence of such an API might create a misconception in the minds of developers. With the current proposed implementation, this will be a feature for PostgreSQL and MySQL (remember about the parameter). But those who use Microsoft SQL Server need to know that their database and its driver require mandatory reshuffling to get a "Uuidv7 that doesn't ruin indexes" (remember it needs 2 reshuffles, one to compensate for ordering at the database level, another to compensate for using the COM-intended Guid API at the database driver level. Yes, they can be collapsed into one, but anyway). Because the Guid that the new API will generate can't be provided to the driver in its unchanged form - there will be index fragmentation. |
I expect synergy between Microsoft products. With the current state of affairs, there will be none. If we go down the path of problem-solving, there are two ways. Either introduce a new data type or fix the Microsoft SQL Server driver. We will put aside the first option for reasons we both know. The second can be implemented in two ways - either through a breaking change in the driver, or a feature toggle. A breaking change is not an option, so let's consider the second one. It can be implemented in different ways - through environment variables or connection string parameters. And it seems that in this case, it becomes a problem of the database driver. But if you add the proposed API BEFORE the Microsoft SQL Server driver has support for "alternative Guid handling mode", it turns out that you will roll out a feature that does not synergize with your own database. As a developer, I expect that such a line of code: cmd.Parameters.AddWithValue("id", Guid.NewGuidv7()); will generate identifiers for me that, when written to a database into a column of the Therefore, I suggest that you, as the author, create an issue in the Microsoft SQL Server driver repository, and discuss the possibility of adding support for "native Uuid" at the database driver level through some alternative mode toggle mechanism, or in some other way. |
Developers already can create a valid This API proposal changes nothing with that regard, it simply gives developers an easier way to generate a v7 UUID that will serialize as expected if stored using
.NET is producing correct UUIDs that serialize as expected when using the relevant APIs such as If there is a scenario that you believe is not covered by the downstream component, then you as the interested party should be the one to file the feature request and to correctly articulate the problem space you believe exists and to optionally provide input as to how you believe it should be resolved. |
@vanbukin Have you filed any issues for your complaints about the Microsoft.Data.SqlClient performance? I regularly contribute performance improvements but if no-one tells me about them how would I know what needs improving to help your codebase? |
@Wraith2 There's no problem with the driver's performance itself. The driver simply takes Guid as input, does something with it, and forms bytes, which are sent to Microsoft SQL Server over the TDS protocol. The problem is that when a Guidv7 formed through the proposed API gets into a database in a column of the To make
The irony of the situation is that Microsoft develops both .NET itself, the proposed API, the driver, and the database. cmd.Parameters.AddWithValue("id", Guid.NewGuidv7()); will write data, the index of which will be completely fragmented right from the start. |
That's right. And the BCL can only provide one single way to do this, in order to meet the RFC. Yet, this option will not work with your own products right now. It's absurd.
The driver developed by Microsoft for its own database is not doing this right now. And it doesn't even have any options to change this behavior. You propose to create an API that won't work as expected. The very generation of Uuidv7, without regard to how it will be indexed by the database, is devoid of meaning. Because Uuidv7 was created to be optimized when inserted into the database. It doesn't matter who's to blame - the API of Guid, the driver developers, the creators of the TDS protocol, or the database developers. As a consumer, what matters to me is that it doesn't work as it should. And this will only happen when I use the combination of Microsoft .NET together with Microsoft SQL Server. However, if I take PostgreSQL and its driver, I will have no problems. So what am I paying for? |
cmd.Parameters.AddWithValue("id", Guid.NewGuidv7()); This line of code will give me proper indexes in PostgreSQL. I have to write my own function to rearrange the internals of the Guid and call it before each insertion. Yes, you will be following the RFC. But there is zero synergy between your products. |
Hi @tannergooding , thanks for sharing so much juicy info. When all the details will be settled (are they already?) I think a blogpost on devblogs with a recap of all of this would be great! |
@tannergooding Really looking forward to using this in November! Thanks for your work on this. I'm curious on what your thought are regarding Ulid, which implements the monotonic random counter as per the Ulid spec. The data structure between UUIDv7 and Ulid seems identical apart from a few bits reserved for the version. I also wonder what your thoughts are when it comes to involving MAC addresses in the generation of these types of unique identifiers, like MassTransit does with their NewId. |
@jscarle Here are some thoughts on the use of MAC addresses. A MAC address takes 48 bits. This eats off so much of the random portion that you lose the unpredictable property - IDs can become too guessible, especially once you have observed one single ID. You'd also want to be very sure that a MAC address (or especially if you'd opt to use a partial MAC address) is unique, because as soon as you get a "collision" on those, you have two instances that continue to have a much increased chance of generating colliding IDs for their entire lifetime. As with UUIDv4, the very large random portion of UUIDv7 is supposed to make the use of any "instance identifier" unnecessary. |
As per https://datatracker.ietf.org/doc/rfc9562/
At a glance the only difference between it and v7 is that the 6 bits otherwise required for the variant/version are no longer reserved and so there doesn't appear to be a significant benefit to it and using the more broadly standardized thing likely outweighs any minor benefit you might otherwise get from the 6 extra bits of randomness. -- UUIDv7 then of course also has the option to allow non-random data, instead of random data if the domain needs it which is not something ULID appears to allow. |
@glen-84 I'll post some results at the bottom, but here is the approach to calculate this:
Those are some fine numbers. There's still one issue. If we generate a handful of invoice lines for an invoice and insert them into a database, they'll have the same millisecond component followed by a random component, and end up getting shuffled! That's arguably astonishing. One way to solve it is to use smaller pseudorandom increments to the previous ID, for example by a 57-bit number. The calculations get very interesting with this approach:
I suspect that both calculations offer some level of comfort. 😊 |
Given that this is closed, this may not be the right place, but... Is it possible to parse the timestamp back from a ULID? in Guid v7? Will there be a |
Just to confirm, have @Timovzl’s considerations been taken into account? I believe he mentioned some crucial V7 features necessary for actual |
AFICK there is no such v7 features for distributed systems. For distributed systems, the best one (to prevent collisions) is version 4 (all random). v7 is about getting a better order/locality by replacing some of the random data with time and this will increase change of collisions in distributed systems, and there is nothing to do about that. Pick your poison:-) The .net v7 implementation AFAIK uses a simple implementation where it take a v4 Guid, replace the 6 first bytes with the current unix epoc milliseconds and set version as 7. Plain and simple. It does not use rand_a to create a local sub sequence of time, it does not try to make it monotonic, it does not make sure that the newly generated v7 Guid is "greater" that the previously generated v7. But all these things are about improving local monotony. For distributed systems, its best with as much random data as possible (use rand_a for random data). Long story short: the current .net implementation is the best for distributed systems. But it is the worst for local monotony. We can not have both and need to choose, and IMO .net made the right choice (all thou it may change in the future). My 5cent. |
That was bad wording on my part - I stand corrected. I'm still curious if @Timovzl's feedback was considered all the same (reference: #103658 (comment)).
That's is why I'm not sure about the use of this new implementation in an actual system (let's leave out the distributed part). @Timovzl says:
As it stands, unless the above scenario is considered, I for one would have no use for this new implementation - at least, none come to mind right now but I'm open to suggestions. |
Monotony can be solved locally in the node, but then it would need to be implemented by the os/kernel itself, like Guid v1/v6 generators typically are. The .net implementation could attempt to make it monotonic, thou it would only be monotonic locally in that application/process. And even if monotonic in node, distributed monotony can not be guaranteed. For batch generation, worst case order of .net v7 Guid's is the same as for v4 Guid's. But for all other uses, locally and distributed, .net v7 Guid's will be ordered as they were generated (good for locality/database indexes). Maybe we can suggest a new method, eg. |
Could you clarify why you're saying that UUIDv7 can't be used if it's not monotonic? As far as I know it's by design that UUIDv7 is k-sorted and not monotonic, and that seems to be enough to work well with B-tree primary key indexes in databases, which is the primary use case. In distributed systems where collisions are a concern you'll likely want to use Snowflake or variants of Snowflake, which are also usually k-sorted and not monotonic. |
@Timovzl wrote:
I'd also wager to say that any reliable Transactional Outbox implementation, which is my use case, won't be able to use this v7 GUID, since keeping the insertion order of messages in any batch is paramount. Batches can be reordered, that wouldn't be a problem.
I'm going to have to use something else then :) Perhaps what @Timovzl implemented or https://github.com/Cysharp/Ulid are better suited for my needs.
That could be useful. |
As per the original proposal, there is room to expose additional
Please feel free to open an API proposal covering these, I imagine the full overload might look something like (possibly with better parameter names or different parameter order): public static Guid CreateVersion7(DateTime timestamp, ulong counter, bool submillisecondResolution); |
Personally I am not missing such overloads (to specify the counter etc.) but maybe others do. But back to my suggestion :-) I would suggest something like (slightly adjusted from before):
Maybe a better name could be CreateVersion7MonotonicSequence, a bit long thou. The implementer (.net) can then choose how to make it monotonic: to increase the timestamp after rand_a rollover, to stall until the next millisecond when rand_a rollover, generate new random v7 Guid's until it randomly becomes greater than the previous etc. Personally, I like this suggestion best (from the RFC): I have made a prof of concept implementation for this api, in case of interest:-) |
It is my experience that without the following two properties, the usefulness and pit of success of a UUIDv7 implementation is severely limited:
Sadly, providing a counter value is not a pit of success when it comes to point 1, and it also sacrifices point 2.
Requiring the user to do special things for batches is a huge sacrifice, and also not a pit of success. For example, consider the very valid use case of a DDD entity that self-validates in the constructor and is to be complete and reliable after construction. It makes sense for such an entity to populate its (readonly) ID property with a UUIDv7 during construction. If such entities are constructed in batch, the intuition is for them to get stored in creation order. It is far from ideal to have to (A) remember to do something special to get the intuitive result and (B) pollute the constructor with technical details to achieve it. I have spent a lot of time on this topic, and my conclusions (as implemented) are:
|
Wonderfully articulated, @Timovzl. I appreciate your input and apologize for tagging you multiple times, but I felt it was crucial to urgently reconsider your points. Additionally, your analysis on collision probabilities in a distributed system was quite convincing. The calculations you shared in another post were impressive. Is it just me, or is there a possibility to enjoy the best of both worlds here? 😃 |
A balance exists to exposing APIs and that includes covering the relevant set that allow users to roll their own appropriate implementation without adding every potential overload or consideration. Developers are responsible for writing code to handle and use a type in a manner that works best for their application. It is therefore not the responsibility of the BCL to expose "everything", but rather enough that allows developers to achieve success for common needs by providing foundational building blocks. We also have the consideration of dependencies, and The right foundational building block is therefore something like: public static Guid CreateVersion7(DateTimeOffset timestamp, ulong counter, bool submillisecondResolution); This API "works" because it simply gives access to the 3 "keys" that represent the state of the UUIDv7, inline with the RFC allowances:
That is, you get:
Such an API therefore serves as a building block and helps trivialize building your own "create many GUIDs" API. It keeps all the control in the hands of the developer while removing some of the worst complexity in the typical case. What remains in complexity is then domain or scenario specific requirements, such as synchronizing and providing the base timestamp and ensuring that the counter changes are within the domain requirements (whether that be If the BCL were to provide some "create many GUIDs" API, it would end up being a fairly naive implementation that looked something like (haven't validated this entirely for correctness, mostly meant to be a rough example): ulong previousUnixTs = _unixTs;
ulong unixTs = GetUnixTs(DateTimeOffset.UtcNow, _submillisecondResolution);
ulong counter = _counter;
if (unixTs == previousUnixTs)
{
counter += _increment;
if (counter > 0x3FFFFFFF_FFFFFFFF)
{
counter -= 0x3FFFFFFF_FFFFFFFF;
}
}
else
{
counter = (ulong)Random.Shared.NextInt64(0x3FFFFFFF_FFFFFFFF);
_increment = (ulong)Random.Shared.NextInt64(0x3FFFFFFF_FFFFFFFF);
}
Unsafe.SkipInit(out Guid result);
Unsafe.AsRef(in result._a) = (int)(unixTs >> 28);
Unsafe.AsRef(in result._b) = (short)(unixTs >> 12);
Unsafe.AsRef(in result._c) = (short)(unixTs & 0xFFF) | Version7Value;
Unsafe.As<byte, ulong>(ref Unsafe.AsRef(in result._d)) = counter | (Variant10xxValue << 56);
_current = result;
_counter = counter;
_unixTs = unixTs; |
I really like that overload, as it would allow anyone to toll their own implementation while remaining a Guid. |
@tannergooding Would it be possible to grant a bit more control over all the 74 random bytes, rather than just the last 62 of them? For example, when doing 57-bit random increments (for which the collision resistance is arguably decent), if we have only 62 bits of space, that leaves only 62-57 = 5 bits to flow into. That would limit the generation rate to an insufficient 2^5 = 32 IDs per millisecond. Is there a strong reason for the first 12 bits of randomness to be treated differently from the last 62? I'm guessing the reasons are ease of formulation (especially with regards to each of M1, M2, and M3), parameter simplicity ( |
To answer my own question from the RFC's perspective (section 5.7):
|
If you really want full control, then there is already the The primary reason to differentiate the two from the perspective of the public constructor is that the RFC has a specific definition of those bits. They are either The RFC itself calls out an appropriate way to use the 62 random bits in a way that guarantees uniqueness while still keeping it random/unpredictable. Namely per timestamp (whether that is 48 or 60-bit timestamp) you generate a 62-bit random seed that serves as the basis. Then you pick a singular 62-bit random increment that is used for all UUIDs generated within that timestamp. When the timestamp changes, you then change the random seed that serves as the basis and the random increment used for that timestamp window. Such a setup ensures it is sufficiently random (62-bits) to avoid predictability while also ensuring it is unique (2^62 unique values can be generated in that timestamp window) for a non-zero increment. Of course, in practice you need much fewer than 62-bits because the 48-bit timestamp gives millisecond accuracy and the 60-bit timestamp gives you around 245 nanosecond accuracy. Even with the fastest computers today, you really only need up to |
I would (also) suggest a more generic overload, like: |
It really depends on the scenario and what considerations are important. A consumer of this method would simply need to ensure that |
@tannergooding You have shared a lot of very useful information. For better discoverability it would be great to have it in the remarks section of this API and perhaps a separate blog post with the details. Many thanks! |
Rationale
The UUID specification (https://datatracker.ietf.org/doc/rfc9562) defines several different UUID versions which can be created and which allow developers to produce and consume UUIDs that have a particular structure.
As such, we should expose helpers to allow creating such UUIDs. For .NET 9, the set of potential versions is detailed below, but only UUIDv7 is proposed for the time being.
As per the UUID spec,
GUID
is a valid alternative name and so the name of the new APIs remainsNewGuid*
for consistency with our existing APIs:API Proposal
The text was updated successfully, but these errors were encountered: