-
Notifications
You must be signed in to change notification settings - Fork 893
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Equivalent of "logLevel" for spans #3205
Comments
This is very interesting to me, when thinking about PHP's auto-instrumentation. When developing auto-instrumentation implementations, we currently lack the mechanism to optionally create spans for certain methods which are possibly interesting but also potentially frequent, eg open-telemetry/opentelemetry-demo#659 or a database query's fetch operation. I've been thinking over the past couple of days whether to invent a variable to control this, something like I don't mind the idea of a priority or importance as a span attribute which could be used to ignore a trace (provided that it doesn't break parent/child relationships, if an ignored span happened to have a child span that was not ignored?) |
Generally, it's good idea to have priority or level. One of the problem with this approach is deciding for user. For some use cases the same span generated from the same call might have high priority, but low for others, in other words priority might depend on the context. @brettmc recently, I'm thinking more and more about configurability per specific call. Having said that I think that some calls should be mandatory. |
I found another use-case that could fit in here.... Using auto-instrumentation to create spans for some low-level i/o functions (eg writing to a file/stream), can have a side-effect of creating unwanted spans when doing things like emitting a log after a http request is processed. Because the logging happens outside of the http request processing, the spans are root/parent which is slightly annoying.
As a work-around for this, I discovered that with a bit of hackery I could create and activate a valid, non-recording span around the post-request code, which has the effect of collecting and silencing a bunch of annoying i/o spans. Applying the concept of span attributes here could achieve the same result - in this case, I'm imagining creating a span at debug/silly/off/ignore level, and leaving it to the sampler to decide to ignore them, or not. |
@brettmc @pdelewski thanks for sharing your thoughts on this!
This is probably the hart issue that needs to be solved but that eventually will give the true benefit of a decent solution. I think there is already some implicit priority within spans (Root Spans are important, spans of SpanKind Client,Server,Consumer,Producer since the represent cross-service communcation are also important, spans with SpanKind internal might be skipped more often, etc.) but as @brettmc outlined there are also corner cases like that example of i/o functions or some end-users complaining that auto-instrumentation creates too much spans, etc. |
Sampling that works both in the broad context of a span's outcome (tail-based sampling) and spans part (is this span or sub-span interesting?) would be an awesome way to cut down noise. The idea of introducing a priority value to spans would be a neat way to get there. |
I would also be very interested in this functionality. I've been working on a project and basically using tracing instead of logging to debug and understand performance. But my app is a batch job that will processes 5000+ items for a request, (each iteration over item consisting of multiple steps and tasks). While developing I've added nested spans under a request to a pretty low level. But in production that would create extremely verbose traces. With just two instrumented spans I'm generating over 10000 spans for a single request trace. Using go routines I even hit the default batch limit and spans were being dropped, I increased the limit and solved that specific problem. But at this point I have started commenting out span code in select functions for prod deployment. I could probably wrap my spans in a flag A extremely simple example of a function I was gathering spans from during dev. There's probably little value to this span in production, but it really helped me visualize the steps in my process and debug as I was iterating on the app.
Supporting a Tracing Level would be extremely useful to allow controlling how detailed a trace is. It would be preferable not to remove instrumentation just because they are in too noisy for production. I feel like this pattern is well established in traditional logging. I would love to see it extended to tracing. <edit: removed fictitious code example by ChatGPT> |
Looking closer at the opentelemetry-go it looks like I might be able to add the functionality I want by writing a custom SpanProcessor that would drop a span if it has a specific level Attribute. Using this as an inspiration: https://github.com/open-telemetry/opentelemetry-go/blob/main/sdk/trace/span_processor_filter_example_test.go#L51 I might create something like this. Psudo code just trying to capture the concept:
I'll experiment with this and let you know if it works as hoped. |
I was able to implement the functionality I was looking for in Go creating a custom SpanProcessor. I can specify a level (eg INFO) and Spans below that level (eg DEBUG) can remain instrumented but are not exported. I basically followed the pattern used in the filter_example I linked above. This seems a little involved and I don't really like having to explicitly code the batchExporter into the span processing chain. This is not a clearly documented configuration and breaks if an exporter is added as an option to the NewTracerProvider(), ie all spans get exported. Maybe there is cleaner approach. But TBH the otel libraries and docs are a bit challenging to grok. I'm open to suggestions for improving. Ideally it would be nicer if the attribute was a string, eg "trace_level": "DEBUG". An int isn't as clear when reviewing spans. Probably could create a map[int]string of level name/num and update attributes in the filter. Anyways, I'm happy I can toggle spans without removing or commenting out code. // Begin TraceLevel Implementation
// Define some const for convenience.
const (
traceLevelKey = "trace_level"
traceLevelTRACE = 0
traceLevelDEBUG = 1
traceLevelINFO = 2
traceLevelWARN = 3
traceLevelERROR = 4
traceLevelFATAL = 5
)
// This could be set from a config or env var
var traceLevel = traceLevelINFO
// The struct handling the filtering logic
type TraceLevelFilter struct {
// Next is the next SpanProcessor in the chain.
Next trace.SpanProcessor
// Level is an int similar to a numeric LogLevel
Level int
}
func (f TraceLevelFilter) OnStart(parent context.Context, s trace.ReadWriteSpan) {
f.Next.OnStart(parent, s)
}
func (f TraceLevelFilter) Shutdown(ctx context.Context) error { return f.Next.Shutdown(ctx) }
func (f TraceLevelFilter) ForceFlush(ctx context.Context) error {
return f.Next.ForceFlush(ctx)
}
// The actually filtering logic
func (f TraceLevelFilter) OnEnd(s trace.ReadOnlySpan) {
for _, a := range s.Attributes() {
if a.Key == traceLevelKey {
if int(a.Value.AsInt64()) < f.Level {
// Drop spans less then requested level
return
}
// Continue to next SpanProcessor
f.Next.OnEnd(s)
}
// setupTraceProvider is a sets up all the necessary pieces and should be called early in main()
func setupTraceProvider() *trace.TracerProvider {
tp := trace.NewTracerProvider(
trace.WithResource(resource.Default()),
trace.WithSpanProcessor(filter),
)
filter := TraceLevelFilter{
Next: trace.NewBatchSpanProcessor(exporter, trace.WithMaxQueueSize(10000)),
Level: traceLevel,
}
return tp
}
// End TraceLevel implementation code
// Example usage of traceLevel attribute
// ownerEmail is a noisy function. Tracing is instrumented but spans are only exported when filter is set to traceLevelDEBUG or lower.
func ownerEmail(ctx context.Context, owner AppOwner) string {
var userEmail string
// start a child span from using parent context. this span should only be exported at debug level
_, span := tracer.Start(ctx, "ownerUser",
trace.WithAttributes(
attribute.String("owner", owner.DisplayName),
attribute.Int(traceLevelKey, traceLevelDEBUG), // <- Setting the level via an attribute
))
defer span.End()
// search for email in owner fields
m := regexp.MustCompile(`^([a-zA-Z0-9_\-\.]+)@example.com`)
// For now we are just checking UserPrincipalName
result := m.FindString(owner.UserPrincipalName)
span.SetAttributes(attribute.String("userEmail", result))
return userEmail
} |
This feature would be great as would bring consistency in observability APIs providing the same verbosity concept for developers when capturing logs and traces. The Jenkins OpenTelemetry Plugin has been happily using this pattern a lot for 12+ months. Tracer tracer = logger.isLoggable(Level.FINE) ? this.tracer : TracerProvider.noop().get("noop");
Span span = tracer.spanBuilder("a-span-for-detailed-troubleshooting").startSpan();
... See examples here: https://github.com/jenkinsci/opentelemetry-plugin/blob/opentelemetry-2.11.0/src/main/java/io/jenkins/plugins/opentelemetry/job/log/OverallLog.java |
As far as I understand, the approaches above all introduce broken traces, where the Suppose a simple trace like this: if you drop The appropriate place to make these choices is a Sampler, at span creation time. The issue here imo is a combination of noisy instrumentation (which should be addressed at the specific instrumentation level, not at the broader api/sdk level), and a lack of understanding of how to write custom samplers. I don't think we should add cruft to the span or trace data model to work around this. Instead, users should open issues with the instrumentation they find noisy, and more education ought to occur wrt how to write and compose custom samplers. |
@ericmustin You are correct that my approach results in orphaned spans if a parent is dropped but a child is not. But I have run into that problem for a variety of other reasons as well. To generate useful traces it's on me to architect my span relationships, pass context, and add attributes correctly. At this point I don't have a unimportant parent with important children. In my use case I'm excluding spans from lowest child up, as a means of controlling depth of visibility. It's possible as I work with more complex traces this approach will break down. I will also look into Samplers as alternative so spans get filtered at creation time. I will add that one of the reasons the TraceLevel is relevant is because I'm using spans were I'd traditionally use log messages and LogLevels. Perhaps this is not the intent of tracing but spans provide significantly greater value as they are capturing timing, express relationships, can have tags, status and events. It remains to be seen if this is a sustainable usage pattern. |
What? OTel Go maintainer here and this is not at all how things work. Is this supposed to be a description of how you would like things to work? If so, it needs to be clarified. |
@Aneurysm9 I think you missed the preamble:
|
Yes, this is a problem you introduce with that. My assumption was that there might be a way to resolve this to
I read through Sampling and indeed it can see how this might be accomplished (i.e. use SpanKind
The initial premise of this issue was that there are examples in community built automatic instrumentations (e.g. PHP) which are noisy and that some users might like to see that noise and some users might not be interested in it. Sure, the PHP auto instrumentation can come up with a one off solution, but I assume the same can be relevant for other auto instrumentations.
Yes, that's the case for me |
To be clear, i'm not saying that Samplers alone today can provide the functionality you're hoping to provide (complete traces with arbitrary internal chunks removed), i'm saying that "Log levels" would generate fundamentally "broken" traces (ie, incomplete traces with arbitrary internal chunks removed), which i feel is an anti-pattern and should not be encouraged via specification. It doesn't help that there's huge blobs of totally imagined chatgpt code being thrown around in this issue 🙃 (would you mind editing your comment and removing that bit please @hmm01i ? TIA ) Rather than pushing forward this anti pattern, I think the first option should be to open issues with the "noisy" instrumentation and improve those instrumentations, such as by adding configuration to toggle on/off the "debatably" useless spans, or changing certain nested spans into span events on their parents, etc. Here's an example of how we've added such configuration in ruby. If that isn't enough, I think the second option should be leaning into Sampler configuration to control cost. Right now the appropriate way to manage noisy-ness is via a Sampler, which guarantees at least "complete" traces, and also ensures the reduced overhead of "non recording spans", although it may not offer the flexibility to perfectly paper-over bad/noisy auto-instrumentation. If neither of those solutions are practical, then I think the correct approach to address this at a specification level is something like what @brettmc suggests here wrt verbosity, but on a dynamic, per trace basis. This would also require modifications to the sdk tracer to handle re-parentage, as well as introduction of new sdk samplers that respect specific baggage keys that dictate verbosity, as well as an agreed definition of what the appropriate flavors of verbosity are and how that translates to specific span kinds (and potentially combinations of span kinds and span attributes). I'm not sure whether this would all be additive changes to the specification, and the burden on individual language SIGS would be pretty high to implement all this functionality, so I'm hesitant to encourage a proposal along those lines when it's not clear the first two options I mentioned have been exhausted. So I will, again, politely encourage exhausting the first two options (improve instrumentation and sample more). Admittedly we leverage an internal implementation of verbosity at the moment that takes the approach i've outline above, I can inquire about whether that's something we want to OSS (I am not the author of this work). |
The levels (or any other solution) are not leading to broken traces, they are first of all just an annotation of a span like any other attribute. Like any other attribute they can be used to make sampling decisions (they could also be used in a backend to filter spans based on that attribute, similar to any other attribute. The question is: are they valuable or is there maybe even something better?
Agreed with that, while I appreciate the time you spend in researching that @hmm01i, I think it can be misleading for someone reading on this issue.
I disagree with that, there is nothing to fix: What is "noisy" and "not noise" lays in the eye of the end-user. The examples from the PHP auto instrumentation shared are a good example: sometimes the extra spans from the framework are helpful, sometimes not, it's not even always the same for the same person. It all depends on what you want to accomplish. Furthermore the person doing the implementation of the tracing might not be the same person consuming the telemetry, so at implementation time it might not be even clear what is "noise" and what is relevant? The configuration of that is so far a language-specific out-of-spec solution, so part of the answer I am looking for could be a standardized configuration for that. @brettmc suggested something similar as well (i.e. introducing a variable like OTEL_TRACE_VERBOSITY).
If I have 100 traces with 100 spans each I can reduce the amount of data by either sampling the number of traces or by reducing the number of spans. Right now I can only do sampling of traces, this seems incomplete to me.
I come to realize that my issue title is not choosen wisely because the "logLevel" is just part of the solution, not the problem. What I am looking for is exactly that per-trace sampling capability. Would "Per Trace Sampling" or "span-based sampling" be a more apropriate title?
I don't think it is required to do all of that:
I agree that this might add significant additional functionality, but I neither see improved instrumentation nor "sample more" as the solution to this problem. Just to add that, "out of scope" or "not right now" is not the answer I am looking for, but I anticipate that this might be the case.
That would be interesting to see indeed. |
While i think it would make sense to append an attribute signifying that a span has had some type of verbosity sampling applied to it, at the time of the sampling decision (using a Sampler ), what I understand is you're proposing we add attributes signifying choices a sampler could/should make out of process. I disagree with this approach, mainly bc i dont think it's possible to reliably do re-parentage out of process, at least via any of the existing processors in the otel-collector (maybe I am wrong here?)
in the same way that a user may want verbosity to mean only keeping producer/consumer , client/service spans, some users may want to only keep clients that are of a certain type, for example only those that contain
Sure fair enough, i'll followup with ruby SIG |
span verbosity came up a lot in discussions around open-telemetry/oteps#172 . One finding there was that custom samplers do not allow to suppress spans as they create a new span and context, just not record it. So in theory verbosity can be implemented with samplers if verbosity is an attribute on the span which is set at start time and therefore visible to samplers, but we'd need another sampling result like 'suppressed' and define what it means (no new context, no attributes attached, etc). |
@ericmustin I've removed the ChatGPT code as requested. I can see how it would be confusing. I have a couple of additional thoughts on this whole thing.
A question related to 3. With the high value of traces and the ability to add events to a span, what is the value of traditional logging? In my most recent project I've basically replaced most of my logging with detailed spans. This may be an unconventional use of tracing, but traces provide so much more value that logs, I could easily see myself almost completely replacing logging with tracing in the future. (I suppose the main difference would be that logs are generally more concise and human readable when sent to stdout.) |
Let me share my thoughts. The whole discussion regarding this topic started |
Following up on this after a while, hoping to reignite the discussion: From what I read throughout the comments, there are different situations where a span might be removed from a trace:
For all of these, the main reason for doing this is to reduce the amount of data being generated and preferable remove noisy spans. All that before the data hits the backend and is stored somehow. I believe that this way of reducing data can be an extremely valuable addition to trace-based-sampling (dropping the whole trace), but I owe you some data to backup this assumption. There are now different potential solutions for that, which have been proposed throughout the discussion:
The challenging ones are those which make decisions if a span is recorded at runtime (3?,4,5,6,7), because by removing the span, relationships could be broken:
For (1) there is probably a requirement to climb ob the tree and change the parent of a child span to their parent's parent. (2) is either a non-issue (if a trace is dropped the attributes are also gone) or there is maybe an opportunity to "bubble them up" while changing the parent. This could for example be relevant if a span representing a network connection (HTTP) has some sub-spans for DNS, TLS and other connection-level interactions that might not be relevant all the time. For (3) and (4) I am wondering if/how this is solved with trace-based sampling today? What I mean: if a trace is dropped by a sampler, what happens to potential span links or logs/metrics that point to that trace? |
I've played around with custom samplers a bit and one thing that seems unfortunate is that |
@adriangb The sampler intentionally does not receive the span ID, see #621. But I wonder if the arguments from there are still usable, after all an unsampled span also receives its own span ID, doesn't it? CC @carlosalberto |
Thanks for linking me to that! It seems to me like the reasoning was that there was no use case so why not remove it? Because as you say the span ID is generated anyway so even if it were expensive the fact that the Sampler doesn't receive it changes nothing. If so I agree this use case seems like a strong argument to add it back. And to be clear by "this use case" I don't mean any specific implementation but rather giving users the ability to implement a custom Sampler along the lines of what's been discussed here, in #3725 and other places. |
@adriangb span attribute are already available for making sampling decisions (and the spec mentions that span attributes should be set early so that they are available to custom samplers). I might be missing something, but the main use-case for having trace id available is for ratio-based sampling. If you also tried to do ratio-based sampling based on a span id, and based on that you dropped some spans within a trace, then that would lead to broken traces wouldn't it? (which is one of the potential issues identified up-thread here) |
Yes they are but you can only do all-of-nothing sampling: you don't have any source of randomness that you can sample a ratio from. The reason to make the span ID available is to allow span ID ratio-based sampling based on an attribute.
If you do it in an exporter yes. If you do it in a Sampler and return a non-recording span then all child spans are also not sampled and thus you prune an entire subtree of spans and the trace is not broken. |
Hi folks, |
I opened https://github.com/open-telemetry/opentelemetry-python/pull/3574 as a proposal to address these feature requests. I'd love some feedback on it. |
I was told to discuss that PR here further. Does anyone have feedback on it? As far as I can tell it allows resolving all of the use cases presented here with minimal code changes. |
Thanks for creating a prototype @adriangb, I am happy to take a look next week when normal operations continue :) |
Hi @svrnm did you have a chance to look? Thanks! |
yes, I did. I think the feature I was asking for is different from what you are looking for: your proposal focuses on having the span_id available while sampling, while my initial ask was around "sampling" individual spans (and potentially even keeping their child spans) based on different criteria. So, I don't know much about the span_id being available during the shouldSample method, except the issues @Oberon00 pointed out already |
The point of having the span id available is that it gives you something random to sample off of. Consider that you could write (or eventually OTEL could ship) a sampler that considers a "debug" level 0% sampled and anything above 100% sampled. Currently, you can write a sampler that has access to attributes and could use that to sample individual spans but it's all or nothing, it can't statistically sample spans. This approach is compatible with that feature request but is more general, e.g. to allow setting a sample rate for a specific span.
The keeping child spans thing is going to be a problem for any approach because as others have mentioned there's no easy way to avoid "broken" traces with gaps in the span tree. |
I am fully aware of that, but this is the subject of this issue. |
Perhaps that is lost in the discussion. Several issues point to this issue and for better or worse it is the most popular issue for something like a log level for spans, which certainly doesn't require that children spans be kept and from a re-skim of the title and OP I can't tell that this issue is strictly about that either. Would it not be useful to you at all if a non-sampled span (say |
You are right, the initial issue description does not indicate that, for me it was implicit part of the original problem statement because of the initial example from PHP (see screenshots here: open-telemetry/opentelemetry-demo#659): the verbose span-sub-tree has 14 spans, the not-so-verbose one has 3, but they are not exclusively at the top of that sub-tree (POST /getquote is, but not {closure} and calculate-quote), so if they would have some concept of which ones are important and which ones are not, you could not simply drop an unimportant span and then decide that all children are unimportant as well. |
I do see how that could be useful but I also see how it will be more difficult to get across the finish line. Also I don’t want to take over your issue. Do you ant to edit the title and description to make the use case clearer, or should we open a new issue? |
@svrnm do you want to edit this issue to make it clearer in one direction or another and if you want to keep the "child spans when parents are not sampled" aspect of this please let me know so we can move the rest of the discussion to different issue. Thanks. |
done, see the updated description of this issue. |
What are you trying to achieve?
Note: Maybe this has been discussed before and I didn't find the conclusion/reason behind it, so please point me to it if so.
A very common property of logs is the level (info, warn, debug, error, ...). This level is useful for plenty of applications, so I was wondering if a similar equivalent would be valuable for spans:
What did you expect to see?
By either updating the Span Status to allow more attributes or by attributing an additional attribute or a semantic convention, this could be addressed. Based on the fact that the tracing API is stable, I assume mapping this with a semantic convention would be best.
Additional Context
There might be alternative approaches to this, like setting a priority or importance on a span. I just wanted to get some feedback if this might be a valuable addition to spans&traces.
Updated 2024/01/26
Based on the discussions a key point of this issue is the need for being able to drop a partial span tree, e.g.
Span 1 (important) -> Span 2 (not important) -> Span 3 (not important) -> Span 4 (important)
would become
Span 1 -> Span 4.
As stated throughout the discussions this is not a trivial thing to do, but my hope was that there are potential ways to accomplish the same and that this issue could provide a place to discuss
The text was updated successfully, but these errors were encountered: