Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ForceFlush to the TracerProvider Trace API specification #2944

Open
richardmcmillen opened this issue Nov 14, 2022 · 6 comments
Open

Add ForceFlush to the TracerProvider Trace API specification #2944

richardmcmillen opened this issue Nov 14, 2022 · 6 comments
Assignees
Labels
area:sdk Related to the SDK [label deprecated] triaged-needmoreinfo [label deprecated] The issue is triaged - the OTel community needs more information to decide spec:trace Related to the specification/trace directory

Comments

@richardmcmillen
Copy link

richardmcmillen commented Nov 14, 2022

What are you trying to achieve?

I expected to see that the OpenTelemetry Trace API specify that TracerProvider should implement ForceFlush

My proposal is to specify that the Trace API TracerProvider should implement ForceFlush as a no-op which always reports success.

Additional context.

The Trace SDK TracerProvider specification contain sections on Shutdown and ForceFlush

Neither of these are specified in the API.

An issue that was created as a result of some of the above proposals and subsequent PR: #1457

As an instrumentation author, writing instrumentation for libraries. I know when a library will perform work and then exit, with this knowledge I would like to ensure spans are exported by using ForceFlush. However, when this is only implemented in the SDK I can not write my instrumentation code purely against the API.

Here is a discussion I had for opentelemetry-ruby: open-telemetry/opentelemetry-ruby#1401

My understanding that as instrumentation authors we should be depending on only the API, I would consider ForceFlush a method that instrumentation authors would want to perform.

@richardmcmillen richardmcmillen added the spec:trace Related to the specification/trace directory label Nov 14, 2022
@ahayworth
Copy link
Contributor

ahayworth commented Nov 14, 2022

For context, the discussion linked above ☝️ talks about how two of our ruby instrumentation libraries inadvertently rely upon an SDK tracer method (rather than solely using API methods). One such example is here

The motivation for that reliance in the instrumentation is because in some contexts (background jobs, task runners, etc) - the language runtime quickly exits after finishing work and may not live long enough to flush out spans batched for export. The instrumentation wanted to provide a convenience for users to ensure that spans would be exported.

We have options to clean this up on our end (including some that preserve the functionality via ✨ metaprogramming ✨), but we did wonder why these weren't part of the API to begin with.

@reyang
Copy link
Member

reyang commented Nov 18, 2022

Could you use something like this? https://github.com/open-telemetry/opentelemetry-dotnet-contrib/tree/main/src/OpenTelemetry.Extensions#autoflushactivityprocessor

@reyang
Copy link
Member

reyang commented Nov 18, 2022

I can understand why folks would want to have ForceFlush in the API, similar to many I/O APIs that provide flush().

I would vote for not having this because the negative impact to the ecosystem:

  1. ForceFlush allows some timeout, it is impossible for instrumentation libraries to know the application developers' intention regarding timeout.
  2. Imagine a developer using dozens of instrumentation libraries and all of them use ForceFlush somewhere, it'll be super inefficient and hard to troubleshoot random latency/contention.
  3. What if ForceFlush failed when the instrumentation library called it? Most libraries will have no idea since the application is the one which decides the expected behavior in most cases.

@reyang reyang added the [label deprecated] triaged-needmoreinfo [label deprecated] The issue is triaged - the OTel community needs more information to decide label Nov 18, 2022
@richardmcmillen
Copy link
Author

richardmcmillen commented Nov 19, 2022

Thanks for the response to the issue @reyang, I’ll have a go at responding to your commented below.

ForceFlush allows some timeout, it is impossible for instrumentation libraries to know the application developers' intention regarding timeout.

I think what you are saying here is that instrumentation libraries can’t assume timeout values. To this point I would suggest that instrumentation library authors should allow this timeout to be configurable. When the instrumentation is installed there would be some mechanism to say, timeout after 5 seconds.

Imagine a developer using dozens of instrumentation libraries and all of them use ForceFlush somewhere, it'll be super inefficient and hard to troubleshoot random latency/contention.

To a large extent I think we trust instrumentation authors to do the right thing. I don’t think ForceFlush will be commonly called in instrumentation libraries. But if an instrumentation author knows the code they are tracing will exit its process, possibly stopping batched spans from being exported, I think it is reasonable to give them the affordance to flush spans. And not put the burden on the end users have to wire this up.

Instrumentation authors could do all sorts of chaotic things that are or are not in the specification. Some guidance on when to make calls to force flush could help with this concern. If the author is in any doubt it would definitely be advisable for instrumentation authors to make the force flushing optional.

For your specific concern, in what circumstance do you think multiple instrumentation authors would be calling ForceFlush? Perhaps I am missing how common this would be.

What if ForceFlush failed when the instrumentation library called it? Most libraries will have no idea since the application is the one which decides the expected behavior in most cases.

I am not quite sure what you are getting at here. Could you expand on this a little? If I was to guess I think you mean, if it is deep in instrumentation code that the force flush fails, the end user applicatoin might have a hard time knowing that this happened. You are concerned about how instrumentation code should behave when a force flush fails, you think this should be a an end user responsibility? Is this correct?

Again, thanks for the discussion. I am not totally across the philosophy of the specification, API vs SDK, so please take all my comments as purely an attempt to improve my understanding of your concerns.

@reyang
Copy link
Member

reyang commented Nov 21, 2022

Short version, I feel ForceFlush is more like an SDK feature, I suggest that we get more evidence before considering it in the API specification, so we know that 1) there is a high demand 2) the outcome is greater than the problem it'll introduce.

@richardmcmillen I suggest that you reach out to other community members and users to collect the requirements and see how important this is.

@ahayworth
Copy link
Contributor

Could you use something like this? https://github.com/open-telemetry/opentelemetry-dotnet-contrib/tree/main/src/OpenTelemetry.Extensions#autoflushactivityprocessor

Maybe? Flushing in general still requires SDK-only methods, so even if we used some kind of processor to accomplish it we'd still call SDK-only methods.

That said, it's a great idea overall - I could see this kind of processor being useful for folks. We might implement it regardless.

[all of your other comments]
I suggest that we get more evidence before considering it in the API specification, so we know that 1) there is a high demand 2) the outcome is greater than the problem it'll introduce.

Your concerns make sense to me; and I agree that we should collect more information to see if anyone else even needs this. It's possible that in trying to make life easy for users, we might have actually gone too far 😆

(And in any case, I think we can work around it in the ruby repo with magic for now, so we can wait to see what the broader community thinks).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:sdk Related to the SDK [label deprecated] triaged-needmoreinfo [label deprecated] The issue is triaged - the OTel community needs more information to decide spec:trace Related to the specification/trace directory
Projects
None yet
Development

No branches or pull requests

5 participants