-
Notifications
You must be signed in to change notification settings - Fork 714
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds CorrelationFields in support of in and between process propagation #577
Conversation
cc @felixbarny @wu-sheng @marcingrzejszczak @devinsba @llinder @jcchavezs @basvanbeek @shakuzen @nicmunroe @jkschneider (and any others) please check me on commentary etc. Actual implementation will follow shortly as homework (the hard part) is done |
} | ||
}; | ||
|
||
static final class Real extends CorrelationFields { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PS I am starting with LHM eventhough it has more memory cost than array-offset indexing.
/** | ||
* Retrieves a correlation key by name or returns null if not found. | ||
* | ||
* @throws IllegalArgumentException if the input name is not valid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is out-of-date. throwing is bad as it can break users (who are likely to ignore this). Current impl just silently drops on problems.
} | ||
|
||
private static boolean log(String message) { | ||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO add logger?
37096a5
to
189e2fc
Compare
moved the huge rationale section to where the other rationale are in the brave README. Added |
* Census tags are types, encouraging prevalidation and use as constants. | ||
* Sharing mutable data across threads can lead to inconsistent reads of fields. | ||
* For example, child threads affect the same state their parent reads. | ||
* Immutable tag contexts eliminate this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be worth mentioning that immutable tags eliminate this, at the cost of having to manage scopes for the TagContext.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point.. author died in the middle of this point i think
* For example, child threads affect the same state their parent reads. | ||
* Immutable tag contexts eliminate this | ||
* A tracer api is more complex than a tagger api, and has an ordering concern. | ||
* To contribute tags, you need to have tracing start before metrics. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not entirely sure I get the point of bringing up metrics here. Ideally only a subset of span tags (the ones ensured to be low cardinality + http.template
) and none of these seem to be related to correlation fields. What's the link I'm missing here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah wording was a bit crap. (tags are the census name for correlation fields).
The jist is that if the correlation api is separate, it can scope fields before or even if tracing doesn't exist. One use case is to use the same correlation field in metrics and tracing (and further down the line, such as country-code)
To have metrics re-use correlation fields from tracing, for example, it would likely need to start after tracing started (or at the same time). Similar problem exists in logging. If you want logs to contain a correlation field, that field should be scoped as early as possible. Otherwise, you see request handling without and MDC value until sometime later when tracing kicks in.
The idea is that a tagging api could be made at the head of a request, and be simpler or at least less dangerously, ensuring that whether it is tracing metrics or logging (or something else), they all read exactly the same values, with no ordering interdependency apart from tagger.
What I'm calling out here is that if we assign that responsibility to brave (for convenience because parsing, inject extract propagation etc is already handled), there is at least an interdependency problem, which is most easily sorted if trace starts before other subsystems on a request, or at least as early as possible.
Now.. if I did clear up anything (mildly possible), could you help me revise it into a shorter bullet or two? :)
Hi guys, is it possible to have the option of setting a custom header by using |
@aldex32 yeah I'd expect ExtraFieldPropagation to use this feature for mutability |
@aldex32 I'm going to address the mutability of |
@aldex32 Actually, there's a limitation in the current Propagation design where it can only affect traces that originated remotely. That impacts this, too.. will raise something independently |
#579 needs to be explored prior to finishing this design. Will work on that |
Had some chats with folks like @ivantopo and I feel I want to think this through a bit more. I generally don't like how this feels like it needs to hook together (want to see if there's another way to avoid special hooks in the tracer or not). Meanwhile I've rejigged things in support of prefixed keys (and mutation) of the existing extra field propagation code. I'm fairly happy with this, as the logic is completely optional to those not pushing extra fields. Going to cut that below before picking this up again. #581 |
I've pushed some interim improvements to extra field propagation in brave 4.13.3 to buy time here. Will revisit shortly |
did a huge revamp on the description. updates around the code will follow next |
Just sharing our use case that I'm hoping this can solve: we have an additional ID that we would like to propagate throughout the trace and add to all services' logs for log correlation based on that ID. |
Just sharing our use case that I'm hoping this can solve: we have an
additional ID that we would like to propagate throughout the trace and add
to all services' logs for log correlation based on that ID.
Is that ID constant for all downstream requests? Also, does that need to be
sent in a specific header, or does it matter as long as in-process
correlation occurs?
|
The service that starts the trace (or possibly an uninstrumented service before the trace starts) should discern and set the ID. It shouldn't be changed.
Things aren't quite set in stone yet, but correlation is the primary goal. As alluded to above, the service initially setting this ID may be uninstrumented and therefore a well-known header is probably the best way to receive this in an instrumented app, was my line of thinking. If everything were instrumented, I guess it wouldn't matter how it is propagated. |
@SergeyKanzhelev ps if you like please review the PR description eventhough the code has drifted a bit This is in response to notes in the census issue that came out of the same discussion census-instrumentation/opencensus-specs#45 |
feedback from @mpetazzoni included that if this feature didn't require a header-per-field encoding (whitelist), it would be easier to use
|
68bc17a
to
501f355
Compare
I've rejigged the code so that it is compilable now, but it isn't right. I'll try to peck at it over a flight |
here are some design thoughts: An implementation of correlation context could be "accumulate and publish later". For example, data is placed into the correlation fields object, but only made visible when the span is re-scoped. There are generally 3 scopes of data which one might correlate with: process scoped: ex current amazon instance ID To address the scoping of data, we could consider options such as a flag per field, indicating whether it is to be published downstream or not, or if it is to expire with the span. Note that in the Correlation-Context spec, there's been discussion of a ttl flag (interpreted remotely), which for example the value 1 means drop at the next hop. |
deprioritizing this as during the tracing workshop, there are especially security concerns about adding user data to outgoing requests by default. Basically, what some call "baggage" is not only dangerous to the size of production requests, but also a potentially large security risk (imagine accidentally propagating data to a third party) We should be careful when implementing it to make 100% optional and something that potentially re-validates per hop whether to send the correlation keys out or not. In the mean time, users can use the whitelist approach with ExtraFieldPropagation. We may still want to continue with in-process propagation efforts, but at any rate be very careful with remote propagation (cc @garyd203 @SirTyro) https://docs.google.com/document/d/1YJ0oriH62Ixk3W37z31X0SeuXpgXpXBetiz8EWxG6NU/edit# |
we'll want to make sure this addresses by api "re-scoping" for example, when someone adds a new correlation field and want it to be immediately visible. Currently, someone needs to call This recently bumped again spring-cloud/spring-cloud-sleuth#927 raised by @dhoffi |
501f355
to
1fe4a25
Compare
still waiting as prefer to see what census decides before continuing this: census-instrumentation/opencensus-specs#200 |
Most notably, there is an api for filtering proposed. This is something that's currently possible in brave anyway as the "getter' concept can internally implement filtering for any key (trace id or a tag) OTOH has discussed in this PR, the fields are likely to be concatenated into a single header.. propagation decision logic makes sense. Also, while this we are interested in what census does, Kamon has already rolled out an impressive feature (hats off @ivantopo @dpsoft) kamon-io/Kamon#552 What we do here will consider all angles in other words (cc also long time friend @nicmunroe) |
This lets you use the extra field propagation mechanism to propagate in, but not out of process. This can be used to correlate properties with logs in a way that doesn't add network overhead. This design was chosen as it incurs very little in-process overhead and has the least burden on the code base. It can be used as a stop-gap until a more fully featured correlation context feature is developed. Fixes #929 See also #577
This lets you use the extra field propagation mechanism to propagate in, but not out of process. This can be used to correlate properties with logs in a way that doesn't add network overhead. This design was chosen as it incurs very little in-process overhead and has the least burden on the code base. It can be used as a stop-gap until a more fully featured correlation context feature is developed. Fixes #929 See also #577
fixed by #1130 |
Overview
This change aims to support correlation fields which may be request-scoped (and propagated downstream), or locally scoped (propagated in process, but not to the next). Example use cases include request-scoped fields like current game or country code, or local fields, such as the current operation name. Where cross-process propagation is desired, the ideal encoding is the w3c Correlation Context.
Background on existing work in Brave
The first propagation mechanics of brave included propagation of the TraceContext in and to the next process. This object's duty is to place spans in the correct position in a potentially distributed operation (via parent IDs). Secondary duties include consistent sampling decisions, and correlation with other contexts such as logging. The data is mostly the same as the trace-context header being defined in w3c.
Brave has two roles in propagation: making data visible in-process, and injecting and extracting data out of process.
The CurrentTraceContext plugin synchronizes a
TraceContext
struct with implicitly visible state, such as a thread local or a logging context. Users can then callTracer.getCurrentSpan
for example, or read the logging context variabletraceId
. This in-process propagation is critical for correlation between systems, as often libraries have no other means to coordinate.The Propagation plugin connects traces across message boundaries by injecting or extracting the trace context from headers. For example, an incoming trace from another process is extracted from headers, the tracer creates a child, and later instrumentation use the above
CurrentTraceContext
to synchronize the current trace ID with whatever contexts that need to see it.Notably due to demand created by OpenTracing and Sleuth, we have had requests to push arbitrary data along with the trace-context (implicitly via the above apis described). We've also had a number of non-arbitrary requests to propagate extra data. For example, domain-specific data like the current playstation game, or extra properties needed for Amazon X-Ray.
Towards that end, we added an extension to our trace context object called "extra". Extra is a list of opaque data plugins can use to ensure their propagation needs are met.
We also added a default implementation that handles the common use-case of sending a mix of system and user fields. Here's an example:
Setup your tracing instance with allowed fields:
Later, you can call below to affect the country code of the current trace context
Why a change is needed
The current code is ok, but limited in implementation which makes it difficult to integrate with metrics contexts and scopes smaller than trace.
Using
ExtraFieldPropagation
, correlation with other fields involve the following trade-offs:ExtraFieldPropagation
(and OpenTracing) propagating something in-process also means adding overhead out of processExtraFieldPropagation
is harder to manage systematically as it uses a separate header per fieldExtraFieldPropagation
coupling with propagation fields implies a whitelist approachExtraFieldPropagation
as it isn't a great user-level apiFor example, if you propagate the span name, to export changes to it for log correlation, you'll need to customize the log correlation plugins and also end up with another header on the wire that isn't useful on the other side.
On the other hand, with a
CorrelationFields
type, we can decouple the usage of propagated tags, their policy and their encoding scheme (defaulting to the correlation-context header). LikeExtraFieldPropagation
, this could be added to B3, allowing a clean transition. In fact,ExtraFieldPropagation
could be an implementation. At the end of the day, we end up with flexibly scoped propagation code with coherent usage patterns for context synchronization.Scoping constraints
An understood limitation is that since this api exists in the scope of Brave (a tracing api), Correlation fields cannot be added before a trace context has been created. For example, an integration with metrics context could imply ordering metrics after tracing, eager provision of an unstarted trace context, or an integration to copy currently scoped metrics tags into the a newly provisioned
CorrelationFields
object.See also
A lot of this work is biased towards the emerging w3c Correlation Context specification.
This also has feature overlap (and integration possibilities) with Baggage (both the original and OpenTracing) and OpenCensus Tags. Note that feature overlap isn't 100% between any of these.