Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shutting down the global tracer hangs #868

Open
dnaka91 opened this issue Aug 28, 2022 · 14 comments
Open

Shutting down the global tracer hangs #868

dnaka91 opened this issue Aug 28, 2022 · 14 comments
Assignees
Labels
A-trace Area: issues related to tracing question Further information is requested release:required-for-stable Must be resolved before GA release, or nice to have before GA. shutdown&runtime

Comments

@dnaka91
Copy link

dnaka91 commented Aug 28, 2022

I need to be able to swap out the global tracer instance at runtime, because it should be able to enable/disable without restarting the application (or change the endpoint URL).

I'm using opentelemetry-otlp for trace exporting, as the opentelemetry-jaeger crate is just too broken. When trying to shutdown the global tracer or trying to swap it, it just blocks on the internal Mutex forever.

I've seen several closed issues about this topic, but it seems this problem is not fully resolved yet, as it still hangs for me.

@TommyCpp TommyCpp added question Further information is requested A-trace Area: issues related to tracing labels Aug 28, 2022
@TommyCpp
Copy link
Contributor

Could you provide more information on

  • which runtime are you using?
  • are you using BatchSpanProcssor?

Also, could you expand on what you are looking to achieve? If you want to adjust how many spans to send to the backend Sampler may be a better choice than swapping the global tracer instances.

@dnaka91
Copy link
Author

dnaka91 commented Aug 28, 2022

Thanks for the quick reply. And sorry if I picked a bit of a spicy tone in the ticket description, have run into several pain points recently while using the opentelemetry crates, sadly.

which runtime are you using?

Tokio runtime.

are you using BatchSpanProcssor?

Yes.

The sample wouldn't really help me here. The application is a rather simple web server that works out of the box with zero configuration. Several global options can then later be activated and customized, which would soon include a tracing collector. That means the admin can change the endpoint at runtime, and in that case the URL must be either updated or the whole tracing client be disabled (in case the URL is empty).

If I would just leave the global tracer set, and say, filter out all traces instead, it'd still continue to connect to the tracing collector, right? Therefore, I'd like to fully shut it down in case the tracing URL is removed.

@TommyCpp
Copy link
Contributor

TommyCpp commented Sep 2, 2022

have run into several pain points recently while using the opentelemetry crates, sadly.

Sorry to hear that. I will try to help as much as I could.

Regarding your problem, I think the Tracer is not the right place to replace. Note that Tracer is just a "Span Creator". It doesn't holds any information on where to export, how to export etc. Instead, Tracer holds a weak pointer to TracerProvider. We do have a global TracerProvider instance and you can replace it using set_tracer_provider method to change it.

I will see if I can write an example over the weekend but feel free to try it yourself.

@dnaka91
Copy link
Author

dnaka91 commented Sep 2, 2022

That's the thing I'm using I think. Not using opentelemetry directly but through the tracing-opentelemetry crate, so there are a few more layers of indirection.

For initialization I use the following code, which returns me a tracer that is wrapped in a tracing Layer for use in the tracing stack. This calls internally set_tracer_provider:

fn init_tracing<S>(otlp_endpoint: String) -> Result<impl Layer<S>>
where
    for<'span> S: Subscriber + LookupSpan<'span>,
{
    opentelemetry::global::set_error_handler(|error| {
        error!(target: "opentelemetry", ?error);
    })?;

    let tracer = opentelemetry_otlp::new_pipeline()
        .tracing()
        .with_exporter(
            opentelemetry_otlp::new_exporter()
                .tonic()
                .with_endpoint(otlp_endpoint),
        )
        .with_trace_config(trace::config().with_resource(Resource::new([
            resource::SERVICE_NAME.string(env!("CARGO_CRATE_NAME")),
            resource::SERVICE_VERSION.string(env!("CARGO_PKG_VERSION")),
        ])))
        .install_batch(runtime::Tokio)?;

    Ok(tracing_opentelemetry::layer().with_tracer(tracer))
}

Then, to disable it again I use the following code, where it hangs in the shutdown_tracer_provider:

fn disable_tracing() -> Result<()> {
    opentelemetry::global::set_error_handler(|_| {})?;
    opentelemetry::global::shutdown_tracer_provider();
    Ok(())
}

This is all pulled together in the whole tracing setup, where I use a reload layer to dynamically enable/disable the telemetry endpoint. The TracingToggle instance is later wrapped in an Arc and shared as web app state:

pub struct TracingToggle {
    enable: Box<dyn Fn(String) -> Result<()> + Send + Sync + 'static>,
    disable: Box<dyn Fn() -> Result<()> + Send + Sync + 'static>,
}

#[allow(clippy::missing_errors_doc)]
impl TracingToggle {
    pub fn enable(&self, otlp_endpoint: String) -> Result<()> {
        (self.enable)(otlp_endpoint)
    }

    pub fn disable(&self) -> Result<()> {
        (self.disable)()
    }
}

fn init_logging(otlp_endpoint: Option<String>) -> Result<TracingToggle> {
    let opentelemetry = otlp_endpoint.map(init_tracing).transpose()?;
    let (opentelemetry, handle) = reload::Layer::new(opentelemetry);
    let handle2 = handle.clone();

    let enable = move |endpoint: String| {
        let layer = init_tracing(endpoint)?;
        handle2.reload(Some(layer))?;
        anyhow::Ok(())
    };

    let disable = move || {
        disable_tracing()?;
        handle.reload(None)?;
        anyhow::Ok(())
    };

    tracing_subscriber::registry()
        .with(tracing_subscriber::fmt::layer())
        .with(opentelemetry)
        .with(
            Targets::new()
                .with_target(env!("CARGO_PKG_NAME"), Level::TRACE)
                .with_target("tower_http", Level::TRACE)
                .with_default(Level::INFO),
        )
        .init();

    Ok(TracingToggle {
        enable: Box::new(enable),
        disable: Box::new(disable),
    })
}

@TommyCpp
Copy link
Contributor

Sorry for the late response. Based on your comment I imagine you change the tracer provider on a spawned task. It may cause issues as shutdown_tracer_provider/set_tracer_provider involves blocking operations. So it may need to run a thread that allows blocking operations. i.e spawn_blocking.

The reason why it's blocking is by spec when tracer providers shut down it should call shutdown function on all span processors, which will invoke forceFlush method to export any remaining spans.

@hdost hdost added this to the Tracing API And SDK Stable milestone Mar 2, 2023
@hdost hdost added the release:required-for-stable Must be resolved before GA release, or nice to have before GA. label Mar 2, 2023
@blueforesticarus
Copy link

I have a similar issue, where I am calling global::shutdown_tracer_provider(); from a panic hook.
Though I don't know why it would hang there.

ololobus added a commit to neondatabase/neon that referenced this issue Apr 11, 2023
Shutting down OTEL tracing provider may hang for quite some time,
see, for example:
- open-telemetry/opentelemetry-rust#868
- and our problems with staging neondatabase/cloud#3707 (comment)

Yet, we want computes to shut down fast enough, as we may need a new one
for the same timeline ASAP. So wait no longer than 2s for the shutdown to
complete, then just error out and exit the main thread.

Related to neondatabase/cloud#3707
ololobus added a commit to neondatabase/neon that referenced this issue Apr 11, 2023
)

Shutting down OTEL tracing provider may hang for quite some time, see,
for example:
- open-telemetry/opentelemetry-rust#868
- and our problems with staging
neondatabase/cloud#3707 (comment)

Yet, we want computes to shut down fast enough, as we may need a new one
for the same timeline ASAP. So wait no longer than 2s for the shutdown
to complete, then just error out and exit the main thread.

Related to neondatabase/cloud#3707
@dnaka91
Copy link
Author

dnaka91 commented May 23, 2023

So this is still an issue for me. For my own small projects, I moved away from OpenTelemetry and built my own OTLP collector that doesn't need as much memory (as OpenTelemetry seems to cache a lot and overall causing a multitude of memory usage compared to before I added it).

But in larger projects this is still an issue. Even a simple test case that setups the tracer and immediately shuts it down, hangs.

If it requires to be run on a separate thread, IMO that should be done internally, as such an essential operation should not fail or hang by default.

@jsoverson
Copy link

I think this is the same issue I'm having.

I'm experiencing a problem when using a crate similar to test-log. Logging/telemetry gets initialized for a test, the test is run, then the logging guard is dropped. If the test fails with a panic, it can cause the span's export thread to hang because it blocks on a future that never progresses. That means the processor's shutdown can't complete, which causes the global shutdown to hang when the processor's Drop impl gets triggered.

I'm not sure what the best solution is, but by replacing the block_on call here with a spawn_local() and subsequent run_until_stalled() from a LocalPool, it prevents the thread from locking up.

@fnichol
Copy link

fnichol commented Feb 2, 2024

I'm seeing that the GLOBAL_TRACER_PROVIDER is using a std::sync::RwLock to manage access:

RwLock::new(GlobalTracerProvider::new(
trace::noop::NoopTracerProvider::new(),
))

I was a bit surprised to see a RwLock here as I believe there's potential for outstanding readers to block a request to write in the lock. In the RwLock.write() docs it says that:

This function will not return while other writers or other readers currently have access to the lock.

I'm wondering if RwLock might be a less than ideal lock type, at least when it comes to shutdown time (I'm assuming it's very likely that there are outstanding RwLockReadGuard's in the system)?

@ShaneMurphy2
Copy link

Yeah, no matter what I do I can't seem to get a call to global::shutdown_tracer_provider to complete.

@hdost
Copy link
Contributor

hdost commented Feb 21, 2024

Seems related #1143

stintel added a commit to toverainc/ai-router that referenced this issue Feb 21, 2024
When an otlp_endpoint is configured in the config file, ai-router hangs
on shutdown. Spawning shutdown_tracer_provider in Tokio runtime thread
pool for running blocking functions seems to solve this.

See open-telemetry/opentelemetry-rust#868
ThomAub added a commit to ThomAub/tracing-opentelemetry-instrumentation-sdk that referenced this issue Jun 9, 2024
davidB pushed a commit to davidB/tracing-opentelemetry-instrumentation-sdk that referenced this issue Jun 30, 2024
@howardjohn
Copy link
Contributor

Simple reproducer

extern crate core;

use opentelemetry::global;
use opentelemetry::trace::Tracer;
use opentelemetry_otlp::WithExportConfig;
use opentelemetry_sdk::Resource;
use std::time::Duration;

#[tokio::main(flavor = "current_thread")]
async fn main() {
    let result = opentelemetry_otlp::new_pipeline()
        .tracing()
        .with_trace_config(
            opentelemetry_sdk::trace::Config::default().with_resource(Resource::new(vec![])),
        )
        .with_exporter(
            opentelemetry_otlp::new_exporter()
                .tonic()
                .with_endpoint("http://127.0.0.1:4317"),
        )
        .install_batch(opentelemetry_sdk::runtime::Tokio);
    let r = result.unwrap();
    global::set_tracer_provider(r.clone());

    let tracer = global::tracer("my_tracer");
    tracer.in_span("doing_work", |_cx| {
        tracing::error!("hello world");
    });
    println!("shutdown start");
    global::shutdown_tracer_provider();
    println!("shutdown done");
    global::set_tracer_provider(r);
    println!("shutdown start");
    global::shutdown_tracer_provider();
    println!("shutdown done");
}

Since we are blocking on shutdown(), we never call process_message to actually continue processing the shutdown

@TommyCpp
Copy link
Contributor

TommyCpp commented Sep 3, 2024

Simple reproducer

extern crate core;

use opentelemetry::global;
use opentelemetry::trace::Tracer;
use opentelemetry_otlp::WithExportConfig;
use opentelemetry_sdk::Resource;
use std::time::Duration;

#[tokio::main(flavor = "current_thread")]
async fn main() {
    let result = opentelemetry_otlp::new_pipeline()
        .tracing()
        .with_trace_config(
            opentelemetry_sdk::trace::Config::default().with_resource(Resource::new(vec![])),
        )
        .with_exporter(
            opentelemetry_otlp::new_exporter()
                .tonic()
                .with_endpoint("http://127.0.0.1:4317"),
        )
        .install_batch(opentelemetry_sdk::runtime::Tokio);
    let r = result.unwrap();
    global::set_tracer_provider(r.clone());

    let tracer = global::tracer("my_tracer");
    tracer.in_span("doing_work", |_cx| {
        tracing::error!("hello world");
    });
    println!("shutdown start");
    global::shutdown_tracer_provider();
    println!("shutdown done");
    global::set_tracer_provider(r);
    println!("shutdown start");
    global::shutdown_tracer_provider();
    println!("shutdown done");
}

Since we are blocking on shutdown(), we never call process_message to actually continue processing the shutdown

You are using Tokio runtime, which doesn't work with Tokio current thread scheduler. If current thread scheduler is used, we should use TokioCurrentThread runtime

@xoac
Copy link

xoac commented Oct 10, 2024

Maybe would be possible for open-telemetry to decide on Runtime using this function:
https://docs.rs/tokio/latest/tokio/runtime/struct.Handle.html#method.runtime_flavor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-trace Area: issues related to tracing question Further information is requested release:required-for-stable Must be resolved before GA release, or nice to have before GA. shutdown&runtime
Projects
None yet
Development

No branches or pull requests

10 participants