Update Trace specification to use W3C Trace Context Level 2; add Random flag #3924

jmacd · 2024-03-06T00:45:00Z

Part of #1826.
Part of #1413.

Changes

The W3C Trace Context Level 2 specification adds a new Random flag intended for use with Probability sampling. The Sampling SIG has produced an OTEP describing how it will be used and how probability sampling can be encoded in the tracestate field. See https://github.com/open-telemetry/oteps/blob/main/text/trace/0235-sampling-threshold-in-trace-state.md

Prototype semantic conventions: open-telemetry/semantic-conventions#793
Prototype collector tail sampler: open-telemetry/opentelemetry-collector-contrib#24811
Prototype sampler logic: open-telemetry/opentelemetry-collector-contrib#29720

Not adding a new spec-compliance matrix for this, we already have "Conforms to the W3C TraceContext spec" however this clarifies what that means. Some SDKs are out of compliance with this specification, but (IMO) not intentionally, so we should not create a new row of the matrix.

Related issues Complete the TraceIdRatio specification #1826, Decision on new encoding for sampling "selectivity" #3602
Related OTEP 235
Links to the prototypes (when adding or changing features)
CHANGELOG.md file updated for non-trivial changes
spec-compliance-matrix.md updated if necessary

jmacd · 2024-03-06T00:57:17Z

As an example of what this is hoping to clarify, see open-telemetry/opentelemetry-go#5030

kalyanaj · 2024-03-07T04:26:10Z

specification/context/api-propagators.md

+- TraceState (string)
+
+Propagators MUST NOT assume that bits 2-7 (6 most significant bits)
+will be zero, as they are reserved for future use and are expected to


The W3C TraceContext Level 2 spec says that these must be set to zero: https://www.w3.org/TR/trace-context-2/#other-flags. It will be good to clarify this here.

This conflicts with https://www.w3.org/TR/trace-context-2/#a-traceparent-is-received

The vendor will only parse the trace-flags values supported by this version of this specification and ignore all other values. If parsing fails, the vendor creates a new traceparent header and deletes the tracestate. Vendors will set all unparsed / unknown trace-flags to 0 on outgoing requests.

I believe, then, my entire PR is invalid.

This makes me question whether W3C specification is for the best, because it will be many ages before the new flag becomes useful. Perhaps we should not think of intermediate processors such as the trace SDK or a sampling processor to be "vendors" in this sense. If the W3C specification says not to propagate and respect future-defined flags, it could be maybe 5 years before we're able to use the new randomness flag. If that's what we want, I'll close this PR.

Thanks Josh. I checked up now the W3C TraceContext Level 1 spec (which is already in recommendation state since 2021) and it has the same wording for the unknown flags (https://www.w3.org/TR/trace-context/#other-flags). Agree this makes things harder for mixed-mode scenarios (where a participant is on an older version that doesn't understand the newer flag)....Adding @dyladan , @SergeyKanzhelev , @basti1302 in case they have more context behind that recommendation or inputs on how to address this...

In my understanding the handling of unknown flags currently mandated by the spec is motivated as follows: Propagating an unknown flag value downstream as 1 because it was 1 in the header incoming from upstream effectively makes the participant claim a fact without knowing what it actually claims.

I'll try to explain this with a concrete example. Let's assume for the sake of the argument that the spec had started without the sampled flag, and we would introduce the sampled flag now in version 2. The sampled flag is effectively a fact about the actions of the specific participant that sends the header downstream, it is not a fact about the trace as a whole. Its semantic when sending a 1 downstream is "I have sampled this request".

Imagine the following scenario:

Service A (which already supports the newer version and understands the sampled flag) sends sampled=1 to service B.

Service B implements the older version of the spec and it does not know about the sampled flag.

Service B decides to not sample this request (ultimately, each participant can make its own sampling decision).

If the rule would be "propagate unknown flags as received", service B would propagate sampled=1 downstream, although it did not sample. So service B would effectively claim something that is not true.

I think the underlying problem in this discussion (and I'm only realizing this now as well) is the following: There are different kinds of flags.

We have flags that represent a fact about the actions of the most recent participant (like sampled).

There are also flags that represent facts about the trace as a whole, and which can be fully determined by the participant starting the trace -- like the new randomness flag.

Ideally, we would have different propagation behaviors for these two kinds of flags -- "set unknown flags to 0" for (1.), and "propagate as-is" for (2.). However, such a distinction is currently not possible. When the current rule (set unknown flags to 0 when sending the header downstream) was put in place, there was only the sampled flag. Maybe the assumption back then was that other flags would also be of type (1.), although that is pure speculation on my part (I wasn't around back then).

Some other remarks:

Perhaps we should not think of intermediate processors such as the trace SDK or a sampling processor to be "vendors" in this sense.

I disagree with this. "Vendor" might not be the best term (and we have replaced it in more recent versions of the spec with "participant" I think), but every piece of software interacting with the traceparent header must follow the spec if it claims to be compliant, we cannot arbitrarily claim "oh, I'm not a vendor, I'll just do something different".

If the W3C specification says not to propagate and respect future-defined flags, it could be maybe 5 years before we're able to use the new randomness flag. If that's what we want, I'll close this PR.

I fully understand your frustration with this. Indeed, it would take a while to get this spec level implemented across the board, and it will be less useful if only a subset of SDKs have implemented it. Ultimately, backwards compatibility is a difficult and complex topic :-(

In my opinion, there is basically a tradeoff between speed of adoption of new flags and correctness of propagated flags, and the spec choose to value the latter over the former.

FWIW, I very much disagree that this is a defect in level 1. :-)

I was going by the criteria of mixed mode environments (which is the common case in a DT scenario). Ideally when a new capability (say a new flag) is introduced, we want whoever understands that newer capability to benefit from it. But in the current model, that's not possible, even if there's one older participant in the mix.

If it doesn't work in mixed mode environments, how do you see such new capabilities being adopted? Currently, the only way this can be used is for ALL the participants to upgrade to Level 2 of the spec - which may not be practical in environments with large number of participants.

Your earlier response on why the Level 1 spec might have modelled it that way (with the information available 3-4 years back) makes sense. And now we have the benefit of hindsight. I think what Josh/I are saying is the above f/w compat is a strong requirement, without which we cannot introduce new capabilities and expect adoption.

So, similar to what OTel is doing, maybe more bits are needed to express this and not just a single bit...

Happy to discuss more (either here and/or in the next W3C DT meeting).

In my understanding the handling of unknown flags currently mandated by the spec is motivated as follows: Propagating an unknown flag value downstream as 1 because it was 1 in the header incoming from upstream effectively makes the participant claim a fact without knowing what it actually claims.

But it feels even worse when a participant changes a flag from 1 to 0, without even knowing what it means.

I think the underlying problem in this discussion (and I'm only realizing this now as well) is the following: There are different kinds of flags.

We have flags that represent a fact about the actions of the most recent participant (like sampled).

There are also flags that represent facts about the trace as a whole, and which can be fully determined by the participant starting the trace -- like the new randomness flag.

Ideally, we would have different propagation behaviors for these two kinds of flags -- "set unknown flags to 0" for (1.), and "propagate as-is" for (2.). However, such a distinction is currently not possible.

Indeed, so perhaps just to open a path for moving forward we can consider the following. We have 8 bits - how about splitting them into the above categories today, let's say 4+4 - before these bits are defined. This would still be a breaking change for the specs, of course.

If I understand @jmacd's argument correctly, it is that when the random bit is 0, it is ambiguous if this means "not random" or if this means "old version of spec." The same ambiguity does not exist for the 1 state because the spec requires setting unknown flags to 0. Adding a second bit set to 1 would clarify that this is indeed generated by a level-2 generator but the random bit was explicitly not set. We discussed at one point adding a second bit for this purpose, but decided that the current situation where 0 is ambiguous is ok because "not random" and "maybe random" should be treated the same way.

I agree with @basti1302 that propagating unknown claims is potentially hazardous. I don't have anything to add to his arguments above.

specification/context/api-propagators.md

specification/trace/api.md

specification/trace/sdk.md

github-actions · 2024-03-15T03:16:56Z

This PR was marked stale due to lack of activity. It will be closed in 7 days.

…ication into jmacd/tracestate

jmacd · 2024-03-20T15:54:44Z

@kalyanaj I've updated this PR to reflect the reality of the W3C trace context specification regarding unknown flags. The PR itself is much less important, as a result, but still worth applying.

specification/trace/sdk.md

…ication into jmacd/tracestate

specification/trace/sdk.md

specification/context/api-propagators.md

github-actions · 2024-04-13T03:15:36Z

This PR was marked stale due to lack of activity. It will be closed in 7 days.

Co-authored-by: Tyler Yahn <[email protected]>

jmacd · 2024-04-16T23:02:45Z

I may not re-open this; it serves very little purpose in its current form. It shows we need work on #1929.

Clarify how Trace Flags and Tracestate are handled by Trace SDKs

e7585f3

jmacd requested review from a team March 6, 2024 00:45

github-actions bot assigned tigrannajaryan Mar 6, 2024

chlog

a45f6e1

This was referenced Mar 6, 2024

Stabilize trace context in non-OTLP formats and define casing #3909

Merged

Propagate 8 bits of W3C trace context flags open-telemetry/opentelemetry-go#5030

Closed

lint

cbce7a5

kalyanaj reviewed Mar 7, 2024

View reviewed changes

github-actions bot added the Stale label Mar 15, 2024

jmacd removed the Stale label Mar 18, 2024

jmacd added 4 commits March 20, 2024 08:34

chlog

ed61a86

Merge branch 'main' of github.com:open-telemetry/opentelemetry-specif…

f466ab1

…ication into jmacd/tracestate

Bits must be set to zero

3b4d276

update w3c trace context links to level 2

6120aac

jmacd requested a review from a team March 20, 2024 15:52

tigrannajaryan reviewed Mar 21, 2024

View reviewed changes

specification/trace/sdk.md Outdated Show resolved Hide resolved

dyladan approved these changes Mar 26, 2024

View reviewed changes

dyladan self-requested a review March 26, 2024 12:50

jmacd added 3 commits March 27, 2024 11:55

Merge branch 'main' of github.com:open-telemetry/opentelemetry-specif…

c6d47fa

…ication into jmacd/tracestate

add spec complicance

9d96396

remove flags

677b3a9

jmacd commented Mar 27, 2024

View reviewed changes

specification/trace/sdk.md Outdated Show resolved Hide resolved

MrAlias reviewed Apr 5, 2024

View reviewed changes

specification/context/api-propagators.md Outdated Show resolved Hide resolved

github-actions bot added the Stale label Apr 13, 2024

Update specification/context/api-propagators.md

4737732

Co-authored-by: Tyler Yahn <[email protected]>

MrAlias approved these changes Apr 16, 2024

View reviewed changes

jmacd added 2 commits April 16, 2024 15:31

Remove mention of Span Tracestate from SDK spec

e4c0349

revert spec-matrix

850c2d8

jmacd changed the title ~~Clarify how Trace Flags and Tracestate are handled by Trace SDKs~~ Update Trace specification to use W3C Trace Context Level 2; add Random flag Apr 16, 2024

jmacd closed this Apr 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update Trace specification to use W3C Trace Context Level 2; add Random flag #3924

Update Trace specification to use W3C Trace Context Level 2; add Random flag #3924

jmacd commented Mar 6, 2024 •

edited

Loading

jmacd commented Mar 6, 2024

kalyanaj Mar 7, 2024

XSAM Mar 15, 2024

jmacd Mar 18, 2024

kalyanaj Mar 18, 2024

basti1302 Mar 19, 2024 •

edited

Loading

basti1302 Mar 20, 2024

kalyanaj Mar 20, 2024 •

edited

Loading

PeterF778 Mar 22, 2024

PeterF778 Mar 22, 2024

dyladan Mar 26, 2024

github-actions bot commented Mar 15, 2024

jmacd commented Mar 20, 2024

github-actions bot commented Apr 13, 2024

jmacd commented Apr 16, 2024

Update Trace specification to use W3C Trace Context Level 2; add Random flag #3924

Update Trace specification to use W3C Trace Context Level 2; add Random flag #3924

Conversation

jmacd commented Mar 6, 2024 • edited Loading

Changes

jmacd commented Mar 6, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

basti1302 Mar 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kalyanaj Mar 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Mar 15, 2024

jmacd commented Mar 20, 2024

github-actions bot commented Apr 13, 2024

jmacd commented Apr 16, 2024

jmacd commented Mar 6, 2024 •

edited

Loading

basti1302 Mar 19, 2024 •

edited

Loading

kalyanaj Mar 20, 2024 •

edited

Loading