Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infinite trace generator #761

Open
hamiltop opened this issue Mar 14, 2022 · 12 comments
Open

Infinite trace generator #761

hamiltop opened this issue Mar 14, 2022 · 12 comments
Assignees
Labels
A-log Area: Issues related to logs A-trace Area: issues related to tracing release:required-for-stable Must be resolved before GA release, or nice to have before GA.

Comments

@hamiltop
Copy link

When I use the opentelemetry_datadog library with reqwest as the client, each batch of spans will generate an additional 2 spans. These spans look like this:

[[{'service': 'test', 'name': 'opentelemetry-datadog', 'resource': 'encode_headers', 'trace_id': 9313706354471130378, 'span_id': 2181458161362850175, 'parent_id': 0, 'start': 1647298645595811251, 'duration': 156890, 'error': 0, 'meta': {'busy_ns': '66163', 'code.namespace': 'hyper::proto::h1::role', 'code.filepath': '/home/hamiltop/.cargo/registry/src/github.aaakk.us.kg-1ecc6299db9ec823/hyper-0.14.17/src/proto/h1/role.rs', 'code.lineno': '1061', 'idle_ns': '90678'}}], [{'service': 'test', 'name': 'opentelemetry-datadog', 'resource': 'parse_headers', 'trace_id': 11785853278508097401, 'span_id': 9996945950193117364, 'parent_id': 0, 'start': 1647298645596617264, 'duration': 159966, 'error': 0, 'meta': {'code.namespace': 'hyper::proto::h1::role', 'code.lineno': '1183', 'busy_ns': '119852', 'code.filepath': '/home/hamiltop/.cargo/registry/src/github.aaakk.us.kg-1ecc6299db9ec823/hyper-0.14.17/src/proto/h1/role.rs', 'idle_ns': '36137'}}]]

I believe this is from the http request to datadog to save the original spans. Now, the request to send these 2 spans also generates 2 new spans, which generate 2 new spans, etc.

To reproduce:

  1. Run this script as a mock datadog agent: https://gist.github.com/kongo2002/a4d560dca89e24bf27c8d59dc63f4b1e This will echo all incoming traces to stdout for easy inspecting.
  2. Run this code:
use opentelemetry_datadog::{new_pipeline, ApiVersion};
use tracing::trace_span;
use tracing_subscriber::layer::SubscriberExt;

#[tokio::main]
async fn main() {
    let _tracer = new_pipeline()
        .with_agent_endpoint("http://localhost:8126")
        .with_service_name("test")
        .with_version(ApiVersion::Version03)
        .install_batch(opentelemetry::runtime::TokioCurrentThread)
        .unwrap();

    let telemetry = tracing_opentelemetry::layer().with_tracer(_tracer);

    let subscriber = tracing_subscriber::Registry::default().with(telemetry);

    tracing::subscriber::set_global_default(subscriber)
        .expect("Failed to install `tracing` subscriber.");

    trace_span!({ "test" }, "test span").in_scope(|| {
        println!("Hello, world!");
    });

    tokio::time::sleep(tokio::time::Duration::from_secs(60)).await;

    opentelemetry::global::shutdown_tracer_provider();
}
  1. Watch output in script (1) to see the encode_headers/parse_headers traces get generated over and over again.

I would expect to have some way to disable tracing of the tracer. Is this a bug? Do I just need to configure trace levels more precisely?

@TommyCpp TommyCpp added the A-trace Area: issues related to tracing label Mar 16, 2022
@jherico
Copy link

jherico commented May 26, 2022

I think the answer here is that you're seeing a lot of spam in the output because you haven't included any filtering in the registry stack. The default behavior is to include everything, even items that are tagged with trace (here meaning the logging level that's one step more verbose than debug, not to be confused with service traces).

In this case the hyper crate being used for transport itself uses the tracing facade and can produce events. This might be equivalent to configuring Java logging for the Apache HTTP package to spit out logs for all headers.

A typical application will have some filtering mechanism baked into the registry and would normally pick INFO or WARN as the default logging level rather than TRACE, and only increase logging as needed for specific packages. See below for an example which would respect the RUST_LOG environment variable as specified by the logging crate, or if none exists default to excluding all output below the level of INFO

    let subscriber = tracing_subscriber::Registry::default().with(telemetry)
      .with(tracing_subscriber::EnvFilter::try_from_default_env().unwrap_or_else(|_| EnvFilter::new("info")))

@hdost hdost added this to the Tracing API And SDK Stable milestone Mar 2, 2023
@cijothomas
Copy link
Member

Just found this issue, before i opened #1171. Same solution should work for both, I think..

@cijothomas cijothomas added the A-log Area: Issues related to logs label Oct 19, 2023
@hdost hdost added the release:required-for-stable Must be resolved before GA release, or nice to have before GA. label Oct 24, 2023
@hdost
Copy link
Contributor

hdost commented Feb 21, 2024

Sounds like this might be related to #473

@lalitb
Copy link
Member

lalitb commented Mar 4, 2024

There was some attempt made to fix this as part of #1330, if someone wants to have a look further. Otherwise, I will get back to this issue shortly. The fix was to suppress the logs invoked in the context of the (async) task doing the export.

@NickLarsenNZ
Copy link

@jherico, filtering is one way (in fact, what I'm having to do right now, eg: h2=off to stop this problem occurring), but that means I miss out on h2 traces from my application.

Hopefully something like #1330 can fix this properly?

@lalitb
Copy link
Member

lalitb commented May 21, 2024

Trying to document few options we discussed in various forums to address this issue:

  1. Create a Tracing Filter: Use a filter to drop all events (logs/traces) emitted by the crates used in specific exporters in the opentelemetry pipeline. This solution is straightforward but may not be ideal as it will also filter out events from these crates when used within the application.

  2. Use a Suppression Flag with Thread-Local Storage: Add a suppression flag in opentelemetry-sdk as thread-local storage. Define a struct that wraps the future and ensures this flag is set before polling the future. We could use the existing WithContext struct for this, as it already handles OtelContext. However, this flag would only be set within the first async task invoked by the exporter. Ensuring that any async task created in external crates also sets this flag would be challenging. This approach was attempted in PR [WIP] Suppression of nested logs from dependencies #1330.

  3. Single-Threaded Export with Suppression Flag: Similar to the second option of adding the suppression flag and additionally ensure the exporter invokes all async tasks under a single thread by explicitly using the Tokio current_thread runtime or async_std::task::block_on. This confines changes to opentelemetry-rust but may impact performance due to single-threaded execution.

  4. Use Tokio Task-Local Storage: Implement the suppression flag using Tokio's task-local storage. This approach still requires scoping the variable when calling async tasks, necessitating cooperation from other crates to scope this flag properly. The only minimal advantage over option 2 and 3 is that no explicit Future wrapper is needed. And it adds hard dependency on Tokio runtime, unless async-std has such feature too.

  5. Utilize Tracing Context and Span Extensions: Store the suppression flag as a span extension in the Context provided by the tracing crate. External crates must ensure to copy this flag from the current active span to any newly created spans. This approach leverages the existing context propagation supported by most tracing-instrumented crates.

Would like to discuss this further in the community meeting.

@cijothomas
Copy link
Member

From 5/21 Community Call:

  1. We need to see how other tracing layers/subscribers are solving this problem (or have they solved it?)
  2. What do we need to ensure this is done? Lets write down the exact thing the authors of hyper, etc. must do to make sure [WIP] Suppression of nested logs from dependencies #1330 works.
  3. If 2 is not feasible (for example, if hyper don't want to depend on opentelemetry), what is the alternate? Could we see if tracing can expose something to help us.
  4. Proposal to Adopt Tokio Tracing as the OTel Tracing API #1689 will influence our decision OR be influenced by our decision!

NickLarsenNZ added a commit to stackabletech/operator-rs that referenced this issue May 24, 2024
github-merge-queue bot pushed a commit to stackabletech/operator-rs that referenced this issue May 24, 2024
* fix(stackable-telemetry): disable export of h2 events via OTLP

It causes infinite cascading events to be emitted. See open-telemetry/opentelemetry-rust#761

* chore(stackable-telemetry): update changelog
@mladedav
Copy link

mladedav commented Jun 2, 2024

This feels like it boils down to propagating a piece of information through a stack of spawned tasks. Or having access to parent tasks inside a spawned task to get information like this. Something like looking up ppid of a process to check that it was not spawned by the exporter task/process.

I don't know about anything like that in tokio but it might be useful in other cases so it might be worth it following up there.

But otherwise, I don't think it's currently possible to propagate information through tokio::spawn and std::thread::spawn which the exporter can call whenever and however it wants.

@lalitb
Copy link
Member

lalitb commented Jun 2, 2024

This feels like it boils down to propagating a piece of information through a stack of spawned tasks. Or having access to parent tasks inside a spawned task to get information like this. Something like looking up ppid of a process to check that it was not spawned by the exporter task/process.

Yes @mladedav, this is something we are trying to achieve. I think the problem is not specific to tracing-opentelemetry, infact any tracing subscriber may want to suppress the internal traces generated through the chain of the spawned tasks.

Most of the OpenTelemetry language implementations provide their own tracing library for instrumentation, and they add this piece of information (as a suppression flag) into the context. Then all the instrumented libraries will propagate the context into spawned tasks, and this flag can be eventually checked in the subscriber and dropped. As done in opentelemetry-js:

In the case of rust, the libraries are instrumented using tokio-rs tracing, do you think similar flag can be added in tracing context which can be used by the subscribers? Sorry if this doesn't make sense, I haven't yet gone through this library in depth. But I can try to dig further if this is the right direction.

I don't know about anything like that in tokio but it might be useful in other cases so it might be worth it following up there.

opentelemetry-rust provides an abstraction over the async runtimes, so any such mechanism if available should be there across all the async implementations. Tokio runtime provides an option (as mentioned in option 3 in the listed suggestions), but this has its performance overhead, and not sure if a similar is there in async-std.

@mladedav
Copy link

mladedav commented Jun 2, 2024

Other subscribers I've worked with usually don't need a spawned task or don't produce any further tracing information. See the tracing-tree example I've linked in the other discussion.

I think the tasks can be assigned a slightly different subscriber that would suppress the otel layer, but it would ideally leave other layers intact. I'm not sure if that's possible, maybe the task would have to just have its tracing completely disabled.

But then the libraries also need to propagate the subscriber if they spawn further tasks. I'm not exactly sure about the details of this though, I vaguely remember someone advising not to do that, but I don't remember why.

I know that otel is runtime agnostic, but some kind of local-storage preserving some kind of structural concurrency ownership concept might be interesting to all of them. Not sure though. Maybe it's a bad idea, maybe there would be technical limitations.

@cijothomas
Copy link
Member

Other subscribers I've worked with usually don't need a spawned task or don't produce any further tracing information

Any subscriber which need to make a grpc/http call should face this. For example: https://github.com/krojew/tracing-elastic-apm/blob/master/src/apm_client.rs#L103-L104

(not just grpc/http call, but pretty much any API that is already instrumented with tracing. grpc/http is showing up now, as they are instrumented.)

@mladedav
Copy link

mladedav commented Jun 3, 2024

I didn't mean it as that the problem isn't valid or that no one has tried to solve it before, it's just I don't have personal experience with it.

By the way the example you linked seems to also go the way of spawning a single-threaded runtime, but I think it should be possible to use standard multi-threaded one with multiple threads and on_thread_start where you set the NoSubscriber as the current thread subscriber.

It might still be a bit wasteful since you're starting new threads for the runtime but at least it seems we're not constrained to a single thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-log Area: Issues related to logs A-trace Area: issues related to tracing release:required-for-stable Must be resolved before GA release, or nice to have before GA.
Projects
None yet
Development

No branches or pull requests

8 participants