-
Notifications
You must be signed in to change notification settings - Fork 904
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update Trace specification to use W3C Trace Context Level 2; add Random flag #3924
Conversation
As an example of what this is hoping to clarify, see open-telemetry/opentelemetry-go#5030 |
- TraceState (string) | ||
|
||
Propagators MUST NOT assume that bits 2-7 (6 most significant bits) | ||
will be zero, as they are reserved for future use and are expected to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The W3C TraceContext Level 2 spec says that these must be set to zero: https://www.w3.org/TR/trace-context-2/#other-flags. It will be good to clarify this here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This conflicts with https://www.w3.org/TR/trace-context-2/#a-traceparent-is-received
The vendor will only parse the trace-flags values supported by this version of this specification and ignore all other values. If parsing fails, the vendor creates a new traceparent header and deletes the tracestate. Vendors will set all unparsed / unknown trace-flags to 0 on outgoing requests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe, then, my entire PR is invalid.
This makes me question whether W3C specification is for the best, because it will be many ages before the new flag becomes useful. Perhaps we should not think of intermediate processors such as the trace SDK or a sampling processor to be "vendors" in this sense. If the W3C specification says not to propagate and respect future-defined flags, it could be maybe 5 years before we're able to use the new randomness flag. If that's what we want, I'll close this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Josh. I checked up now the W3C TraceContext Level 1 spec (which is already in recommendation state since 2021) and it has the same wording for the unknown flags (https://www.w3.org/TR/trace-context/#other-flags). Agree this makes things harder for mixed-mode scenarios (where a participant is on an older version that doesn't understand the newer flag)....Adding @dyladan , @SergeyKanzhelev , @basti1302 in case they have more context behind that recommendation or inputs on how to address this...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my understanding the handling of unknown flags currently mandated by the spec is motivated as follows: Propagating an unknown flag value downstream as 1
because it was 1
in the header incoming from upstream effectively makes the participant claim a fact without knowing what it actually claims.
I'll try to explain this with a concrete example. Let's assume for the sake of the argument that the spec had started without the sampled flag, and we would introduce the sampled flag now in version 2. The sampled flag is effectively a fact about the actions of the specific participant that sends the header downstream, it is not a fact about the trace as a whole. Its semantic when sending a 1
downstream is "I have sampled this request".
Imagine the following scenario:
- Service A (which already supports the newer version and understands the sampled flag) sends
sampled=1
to service B. - Service B implements the older version of the spec and it does not know about the sampled flag.
- Service B decides to not sample this request (ultimately, each participant can make its own sampling decision).
- If the rule would be "propagate unknown flags as received", service B would propagate
sampled=1
downstream, although it did not sample. So service B would effectively claim something that is not true.
I think the underlying problem in this discussion (and I'm only realizing this now as well) is the following: There are different kinds of flags.
- We have flags that represent a fact about the actions of the most recent participant (like sampled).
- There are also flags that represent facts about the trace as a whole, and which can be fully determined by the participant starting the trace -- like the new randomness flag.
Ideally, we would have different propagation behaviors for these two kinds of flags -- "set unknown flags to 0" for (1.), and "propagate as-is" for (2.). However, such a distinction is currently not possible. When the current rule (set unknown flags to 0 when sending the header downstream) was put in place, there was only the sampled flag. Maybe the assumption back then was that other flags would also be of type (1.), although that is pure speculation on my part (I wasn't around back then).
Some other remarks:
Perhaps we should not think of intermediate processors such as the trace SDK or a sampling processor to be "vendors" in this sense.
I disagree with this. "Vendor" might not be the best term (and we have replaced it in more recent versions of the spec with "participant" I think), but every piece of software interacting with the traceparent
header must follow the spec if it claims to be compliant, we cannot arbitrarily claim "oh, I'm not a vendor, I'll just do something different".
If the W3C specification says not to propagate and respect future-defined flags, it could be maybe 5 years before we're able to use the new randomness flag. If that's what we want, I'll close this PR.
I fully understand your frustration with this. Indeed, it would take a while to get this spec level implemented across the board, and it will be less useful if only a subset of SDKs have implemented it. Ultimately, backwards compatibility is a difficult and complex topic :-(
In my opinion, there is basically a tradeoff between speed of adoption of new flags and correctness of propagated flags, and the spec choose to value the latter over the former.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, I very much disagree that this is a defect in level 1. :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was going by the criteria of mixed mode environments (which is the common case in a DT scenario). Ideally when a new capability (say a new flag) is introduced, we want whoever understands that newer capability to benefit from it. But in the current model, that's not possible, even if there's one older participant in the mix.
If it doesn't work in mixed mode environments, how do you see such new capabilities being adopted? Currently, the only way this can be used is for ALL the participants to upgrade to Level 2 of the spec - which may not be practical in environments with large number of participants.
Your earlier response on why the Level 1 spec might have modelled it that way (with the information available 3-4 years back) makes sense. And now we have the benefit of hindsight. I think what Josh/I are saying is the above f/w compat is a strong requirement, without which we cannot introduce new capabilities and expect adoption.
So, similar to what OTel is doing, maybe more bits are needed to express this and not just a single bit...
Happy to discuss more (either here and/or in the next W3C DT meeting).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my understanding the handling of unknown flags currently mandated by the spec is motivated as follows: Propagating an unknown flag value downstream as
1
because it was1
in the header incoming from upstream effectively makes the participant claim a fact without knowing what it actually claims.
But it feels even worse when a participant changes a flag from 1 to 0, without even knowing what it means.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the underlying problem in this discussion (and I'm only realizing this now as well) is the following: There are different kinds of flags.
- We have flags that represent a fact about the actions of the most recent participant (like sampled).
- There are also flags that represent facts about the trace as a whole, and which can be fully determined by the participant starting the trace -- like the new randomness flag.
Ideally, we would have different propagation behaviors for these two kinds of flags -- "set unknown flags to 0" for (1.), and "propagate as-is" for (2.). However, such a distinction is currently not possible.
Indeed, so perhaps just to open a path for moving forward we can consider the following. We have 8 bits - how about splitting them into the above categories today, let's say 4+4 - before these bits are defined. This would still be a breaking change for the specs, of course.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand @jmacd's argument correctly, it is that when the random bit is 0
, it is ambiguous if this means "not random" or if this means "old version of spec." The same ambiguity does not exist for the 1
state because the spec requires setting unknown flags to 0
. Adding a second bit set to 1
would clarify that this is indeed generated by a level-2 generator but the random bit was explicitly not set. We discussed at one point adding a second bit for this purpose, but decided that the current situation where 0
is ambiguous is ok because "not random" and "maybe random" should be treated the same way.
I agree with @basti1302 that propagating unknown claims is potentially hazardous. I don't have anything to add to his arguments above.
This PR was marked stale due to lack of activity. It will be closed in 7 days. |
@kalyanaj I've updated this PR to reflect the reality of the W3C trace context specification regarding unknown flags. The PR itself is much less important, as a result, but still worth applying. |
…ication into jmacd/tracestate
This PR was marked stale due to lack of activity. It will be closed in 7 days. |
Co-authored-by: Tyler Yahn <[email protected]>
I may not re-open this; it serves very little purpose in its current form. It shows we need work on #1929. |
Part of #1826.
Part of #1413.
Changes
The W3C Trace Context Level 2 specification adds a new Random flag intended for use with Probability sampling. The Sampling SIG has produced an OTEP describing how it will be used and how probability sampling can be encoded in the tracestate field. See https://github.com/open-telemetry/oteps/blob/main/text/trace/0235-sampling-threshold-in-trace-state.md
Prototype semantic conventions: open-telemetry/semantic-conventions#793
Prototype collector tail sampler: open-telemetry/opentelemetry-collector-contrib#24811
Prototype sampler logic: open-telemetry/opentelemetry-collector-contrib#29720
Not adding a new spec-compliance matrix for this, we already have "Conforms to the W3C TraceContext spec" however this clarifies what that means. Some SDKs are out of compliance with this specification, but (IMO) not intentionally, so we should not create a new row of the matrix.
CHANGELOG.md
file updated for non-trivial changesspec-compliance-matrix.md
updated if necessary