Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds CorrelationFields in support of in and between process propagation #577

Closed
wants to merge 1 commit into from

Conversation

codefromthecrypt
Copy link
Member

@codefromthecrypt codefromthecrypt commented Jan 16, 2018

Overview

This change aims to support correlation fields which may be request-scoped (and propagated downstream), or locally scoped (propagated in process, but not to the next). Example use cases include request-scoped fields like current game or country code, or local fields, such as the current operation name. Where cross-process propagation is desired, the ideal encoding is the w3c Correlation Context.

Background on existing work in Brave

The first propagation mechanics of brave included propagation of the TraceContext in and to the next process. This object's duty is to place spans in the correct position in a potentially distributed operation (via parent IDs). Secondary duties include consistent sampling decisions, and correlation with other contexts such as logging. The data is mostly the same as the trace-context header being defined in w3c.

Brave has two roles in propagation: making data visible in-process, and injecting and extracting data out of process.

The CurrentTraceContext plugin synchronizes a TraceContext struct with implicitly visible state, such as a thread local or a logging context. Users can then call Tracer.getCurrentSpan for example, or read the logging context variable traceId. This in-process propagation is critical for correlation between systems, as often libraries have no other means to coordinate.

The Propagation plugin connects traces across message boundaries by injecting or extracting the trace context from headers. For example, an incoming trace from another process is extracted from headers, the tracer creates a child, and later instrumentation use the above CurrentTraceContext to synchronize the current trace ID with whatever contexts that need to see it.

Notably due to demand created by OpenTracing and Sleuth, we have had requests to push arbitrary data along with the trace-context (implicitly via the above apis described). We've also had a number of non-arbitrary requests to propagate extra data. For example, domain-specific data like the current playstation game, or extra properties needed for Amazon X-Ray.

Towards that end, we added an extension to our trace context object called "extra". Extra is a list of opaque data plugins can use to ensure their propagation needs are met.

We also added a default implementation that handles the common use-case of sending a mix of system and user fields. Here's an example:

Setup your tracing instance with allowed fields:

tracingBuilder.propagationFactory(
  ExtraFieldPropagation.newFactoryBuilder(B3Propagation.FACTORY)
                       .addField("x-vcap-request-id")
                       .addPrefixedFields("baggage-", Arrays.asList("country-code", "user-id"))
                       .build()
);

Later, you can call below to affect the country code of the current trace context

ExtraFieldPropagation.set("country-code", "FO");
String countryCode = ExtraFieldPropagation.get("country-code");

Why a change is needed

The current code is ok, but limited in implementation which makes it difficult to integrate with metrics contexts and scopes smaller than trace.

Using ExtraFieldPropagation, correlation with other fields involve the following trade-offs:

  • In ExtraFieldPropagation (and OpenTracing) propagating something in-process also means adding overhead out of process
  • ExtraFieldPropagation is harder to manage systematically as it uses a separate header per field
  • ExtraFieldPropagation coupling with propagation fields implies a whitelist approach
  • Users would want to wrap ExtraFieldPropagation as it isn't a great user-level api

For example, if you propagate the span name, to export changes to it for log correlation, you'll need to customize the log correlation plugins and also end up with another header on the wire that isn't useful on the other side.

On the other hand, with a CorrelationFields type, we can decouple the usage of propagated tags, their policy and their encoding scheme (defaulting to the correlation-context header). Like ExtraFieldPropagation, this could be added to B3, allowing a clean transition. In fact, ExtraFieldPropagation could be an implementation. At the end of the day, we end up with flexibly scoped propagation code with coherent usage patterns for context synchronization.

Scoping constraints

An understood limitation is that since this api exists in the scope of Brave (a tracing api), Correlation fields cannot be added before a trace context has been created. For example, an integration with metrics context could imply ordering metrics after tracing, eager provision of an unstarted trace context, or an integration to copy currently scoped metrics tags into the a newly provisioned CorrelationFields object.

See also

A lot of this work is biased towards the emerging w3c Correlation Context specification.

This also has feature overlap (and integration possibilities) with Baggage (both the original and OpenTracing) and OpenCensus Tags. Note that feature overlap isn't 100% between any of these.

@codefromthecrypt
Copy link
Member Author

cc @felixbarny @wu-sheng @marcingrzejszczak @devinsba @llinder @jcchavezs @basvanbeek @shakuzen @nicmunroe @jkschneider (and any others) please check me on commentary etc.

Actual implementation will follow shortly as homework (the hard part) is done

}
};

static final class Real extends CorrelationFields {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PS I am starting with LHM eventhough it has more memory cost than array-offset indexing.

/**
* Retrieves a correlation key by name or returns null if not found.
*
* @throws IllegalArgumentException if the input name is not valid
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is out-of-date. throwing is bad as it can break users (who are likely to ignore this). Current impl just silently drops on problems.

}

private static boolean log(String message) {
return false;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO add logger?

@codefromthecrypt
Copy link
Member Author

moved the huge rationale section to where the other rationale are in the brave README. Added CurrentCorrelationFields which makes this DI friendly

* Census tags are types, encouraging prevalidation and use as constants.
* Sharing mutable data across threads can lead to inconsistent reads of fields.
* For example, child threads affect the same state their parent reads.
* Immutable tag contexts eliminate this

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth mentioning that immutable tags eliminate this, at the cost of having to manage scopes for the TagContext.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point.. author died in the middle of this point i think

* For example, child threads affect the same state their parent reads.
* Immutable tag contexts eliminate this
* A tracer api is more complex than a tagger api, and has an ordering concern.
* To contribute tags, you need to have tracing start before metrics.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure I get the point of bringing up metrics here. Ideally only a subset of span tags (the ones ensured to be low cardinality + http.template) and none of these seem to be related to correlation fields. What's the link I'm missing here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah wording was a bit crap. (tags are the census name for correlation fields).

The jist is that if the correlation api is separate, it can scope fields before or even if tracing doesn't exist. One use case is to use the same correlation field in metrics and tracing (and further down the line, such as country-code)

To have metrics re-use correlation fields from tracing, for example, it would likely need to start after tracing started (or at the same time). Similar problem exists in logging. If you want logs to contain a correlation field, that field should be scoped as early as possible. Otherwise, you see request handling without and MDC value until sometime later when tracing kicks in.

The idea is that a tagging api could be made at the head of a request, and be simpler or at least less dangerously, ensuring that whether it is tracing metrics or logging (or something else), they all read exactly the same values, with no ordering interdependency apart from tagger.

What I'm calling out here is that if we assign that responsibility to brave (for convenience because parsing, inject extract propagation etc is already handled), there is at least an interdependency problem, which is most easily sorted if trace starts before other subsystems on a request, or at least as early as possible.

Now.. if I did clear up anything (mildly possible), could you help me revise it into a shorter bullet or two? :)

@aldex32
Copy link
Contributor

aldex32 commented Jan 16, 2018

Hi guys,

is it possible to have the option of setting a custom header by using ExtraFieldPropagation?

@codefromthecrypt
Copy link
Member Author

@aldex32 yeah I'd expect ExtraFieldPropagation to use this feature for mutability

@codefromthecrypt
Copy link
Member Author

codefromthecrypt commented Jan 17, 2018

@aldex32 I'm going to address the mutability of ExtraFieldPropagation independently so that there's less time pressure on this PR. I have some discomfort on local changes, particularly around extracting a context which are not a concern with ExtraFieldPropagation

@codefromthecrypt
Copy link
Member Author

@aldex32 Actually, there's a limitation in the current Propagation design where it can only affect traces that originated remotely. That impacts this, too.. will raise something independently

@codefromthecrypt
Copy link
Member Author

#579 needs to be explored prior to finishing this design. Will work on that

@codefromthecrypt
Copy link
Member Author

Had some chats with folks like @ivantopo and I feel I want to think this through a bit more. I generally don't like how this feels like it needs to hook together (want to see if there's another way to avoid special hooks in the tracer or not).

Meanwhile I've rejigged things in support of prefixed keys (and mutation) of the existing extra field propagation code. I'm fairly happy with this, as the logic is completely optional to those not pushing extra fields. Going to cut that below before picking this up again. #581

@codefromthecrypt
Copy link
Member Author

codefromthecrypt commented Jan 18, 2018

I've pushed some interim improvements to extra field propagation in brave 4.13.3 to buy time here. Will revisit shortly

@codefromthecrypt codefromthecrypt changed the title WIP Adds CorrelationFields in support of OpenTracing and Sleuth migration Adds CorrelationFields in support of in and between process propagation Jan 22, 2018
@codefromthecrypt
Copy link
Member Author

did a huge revamp on the description. updates around the code will follow next

@shakuzen
Copy link
Member

Just sharing our use case that I'm hoping this can solve: we have an additional ID that we would like to propagate throughout the trace and add to all services' logs for log correlation based on that ID.

@codefromthecrypt
Copy link
Member Author

codefromthecrypt commented Jan 22, 2018 via email

@shakuzen
Copy link
Member

Is that ID constant for all downstream requests?

The service that starts the trace (or possibly an uninstrumented service before the trace starts) should discern and set the ID. It shouldn't be changed.

Also, does that need to be sent in a specific header, or does it matter as long as in-process
correlation occurs?

Things aren't quite set in stone yet, but correlation is the primary goal. As alluded to above, the service initially setting this ID may be uninstrumented and therefore a well-known header is probably the best way to receive this in an instrumented app, was my line of thinking. If everything were instrumented, I guess it wouldn't matter how it is propagated.

@codefromthecrypt
Copy link
Member Author

@SergeyKanzhelev ps if you like please review the PR description eventhough the code has drifted a bit This is in response to notes in the census issue that came out of the same discussion census-instrumentation/opencensus-specs#45

@codefromthecrypt
Copy link
Member Author

feedback from @mpetazzoni included that if this feature didn't require a header-per-field encoding (whitelist), it would be easier to use

In the context of a broader code base where I define the tracer in one place (and then available via DI), it's not really convenient to define that whitelist of fields. Really depends on the service, use-case, etc.

@codefromthecrypt
Copy link
Member Author

I've rejigged the code so that it is compilable now, but it isn't right. I'll try to peck at it over a flight

@codefromthecrypt
Copy link
Member Author

here are some design thoughts:

An implementation of correlation context could be "accumulate and publish later". For example, data is placed into the correlation fields object, but only made visible when the span is re-scoped.

There are generally 3 scopes of data which one might correlate with:

process scoped: ex current amazon instance ID
request scoped: ex origin user id, country code
span scoped: ex span name

To address the scoping of data, we could consider options such as a flag per field, indicating whether it is to be published downstream or not, or if it is to expire with the span. Note that in the Correlation-Context spec, there's been discussion of a ttl flag (interpreted remotely), which for example the value 1 means drop at the next hop.

@codefromthecrypt
Copy link
Member Author

deprioritizing this as during the tracing workshop, there are especially security concerns about adding user data to outgoing requests by default. Basically, what some call "baggage" is not only dangerous to the size of production requests, but also a potentially large security risk (imagine accidentally propagating data to a third party)

We should be careful when implementing it to make 100% optional and something that potentially re-validates per hop whether to send the correlation keys out or not.

In the mean time, users can use the whitelist approach with ExtraFieldPropagation. We may still want to continue with in-process propagation efforts, but at any rate be very careful with remote propagation (cc @garyd203 @SirTyro)

https://docs.google.com/document/d/1YJ0oriH62Ixk3W37z31X0SeuXpgXpXBetiz8EWxG6NU/edit#

@codefromthecrypt
Copy link
Member Author

we'll want to make sure this addresses by api "re-scoping" for example, when someone adds a new correlation field and want it to be immediately visible. Currently, someone needs to call Tracer.withSpan or CurrentTraceContext.newScope to re-scope data (so that it is visible in log or other contexts).

This recently bumped again spring-cloud/spring-cloud-sleuth#927 raised by @dhoffi

@codefromthecrypt
Copy link
Member Author

still waiting as prefer to see what census decides before continuing this: census-instrumentation/opencensus-specs#200

@codefromthecrypt
Copy link
Member Author

Most notably, there is an api for filtering proposed. This is something that's currently possible in brave anyway as the "getter' concept can internally implement filtering for any key (trace id or a tag) OTOH has discussed in this PR, the fields are likely to be concatenated into a single header.. propagation decision logic makes sense.

Also, while this we are interested in what census does, Kamon has already rolled out an impressive feature (hats off @ivantopo @dpsoft) kamon-io/Kamon#552

What we do here will consider all angles in other words (cc also long time friend @nicmunroe)

codefromthecrypt pushed a commit that referenced this pull request Jun 26, 2019
This lets you use the extra field propagation mechanism to propagate in,
but not out of process. This can be used to correlate properties with
logs in a way that doesn't add network overhead.

This design was chosen as it incurs very little in-process overhead and
has the least burden on the code base. It can be used as a stop-gap
until a more fully featured correlation context feature is developed.

Fixes #929
See also #577
codefromthecrypt pushed a commit that referenced this pull request Jul 2, 2019
This lets you use the extra field propagation mechanism to propagate in,
but not out of process. This can be used to correlate properties with
logs in a way that doesn't add network overhead.

This design was chosen as it incurs very little in-process overhead and
has the least burden on the code base. It can be used as a stop-gap
until a more fully featured correlation context feature is developed.

Fixes #929
See also #577
@codefromthecrypt
Copy link
Member Author

fixed by #1130

@codefromthecrypt codefromthecrypt deleted the correlation-fields branch April 2, 2020 08:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants