Shutting down the global tracer hangs #868

dnaka91 · 2022-08-28T03:01:19Z

I need to be able to swap out the global tracer instance at runtime, because it should be able to enable/disable without restarting the application (or change the endpoint URL).

I'm using opentelemetry-otlp for trace exporting, as the opentelemetry-jaeger crate is just too broken. When trying to shutdown the global tracer or trying to swap it, it just blocks on the internal Mutex forever.

I've seen several closed issues about this topic, but it seems this problem is not fully resolved yet, as it still hangs for me.

TommyCpp · 2022-08-28T04:14:58Z

Could you provide more information on

which runtime are you using?
are you using BatchSpanProcssor?

Also, could you expand on what you are looking to achieve? If you want to adjust how many spans to send to the backend Sampler may be a better choice than swapping the global tracer instances.

dnaka91 · 2022-08-28T17:04:20Z

Thanks for the quick reply. And sorry if I picked a bit of a spicy tone in the ticket description, have run into several pain points recently while using the opentelemetry crates, sadly.

which runtime are you using?

Tokio runtime.

are you using BatchSpanProcssor?

Yes.

The sample wouldn't really help me here. The application is a rather simple web server that works out of the box with zero configuration. Several global options can then later be activated and customized, which would soon include a tracing collector. That means the admin can change the endpoint at runtime, and in that case the URL must be either updated or the whole tracing client be disabled (in case the URL is empty).

If I would just leave the global tracer set, and say, filter out all traces instead, it'd still continue to connect to the tracing collector, right? Therefore, I'd like to fully shut it down in case the tracing URL is removed.

TommyCpp · 2022-09-02T07:12:53Z

have run into several pain points recently while using the opentelemetry crates, sadly.

Sorry to hear that. I will try to help as much as I could.

Regarding your problem, I think the Tracer is not the right place to replace. Note that Tracer is just a "Span Creator". It doesn't holds any information on where to export, how to export etc. Instead, Tracer holds a weak pointer to TracerProvider. We do have a global TracerProvider instance and you can replace it using set_tracer_provider method to change it.

I will see if I can write an example over the weekend but feel free to try it yourself.

dnaka91 · 2022-09-02T08:31:49Z

That's the thing I'm using I think. Not using opentelemetry directly but through the tracing-opentelemetry crate, so there are a few more layers of indirection.

For initialization I use the following code, which returns me a tracer that is wrapped in a tracing Layer for use in the tracing stack. This calls internally set_tracer_provider:

fn init_tracing<S>(otlp_endpoint: String) -> Result<impl Layer<S>>
where
    for<'span> S: Subscriber + LookupSpan<'span>,
{
    opentelemetry::global::set_error_handler(|error| {
        error!(target: "opentelemetry", ?error);
    })?;

    let tracer = opentelemetry_otlp::new_pipeline()
        .tracing()
        .with_exporter(
            opentelemetry_otlp::new_exporter()
                .tonic()
                .with_endpoint(otlp_endpoint),
        )
        .with_trace_config(trace::config().with_resource(Resource::new([
            resource::SERVICE_NAME.string(env!("CARGO_CRATE_NAME")),
            resource::SERVICE_VERSION.string(env!("CARGO_PKG_VERSION")),
        ])))
        .install_batch(runtime::Tokio)?;

    Ok(tracing_opentelemetry::layer().with_tracer(tracer))
}

Then, to disable it again I use the following code, where it hangs in the shutdown_tracer_provider:

fn disable_tracing() -> Result<()> {
    opentelemetry::global::set_error_handler(|_| {})?;
    opentelemetry::global::shutdown_tracer_provider();
    Ok(())
}

This is all pulled together in the whole tracing setup, where I use a reload layer to dynamically enable/disable the telemetry endpoint. The TracingToggle instance is later wrapped in an Arc and shared as web app state:

pub struct TracingToggle {
    enable: Box<dyn Fn(String) -> Result<()> + Send + Sync + 'static>,
    disable: Box<dyn Fn() -> Result<()> + Send + Sync + 'static>,
}

#[allow(clippy::missing_errors_doc)]
impl TracingToggle {
    pub fn enable(&self, otlp_endpoint: String) -> Result<()> {
        (self.enable)(otlp_endpoint)
    }

    pub fn disable(&self) -> Result<()> {
        (self.disable)()
    }
}

fn init_logging(otlp_endpoint: Option<String>) -> Result<TracingToggle> {
    let opentelemetry = otlp_endpoint.map(init_tracing).transpose()?;
    let (opentelemetry, handle) = reload::Layer::new(opentelemetry);
    let handle2 = handle.clone();

    let enable = move |endpoint: String| {
        let layer = init_tracing(endpoint)?;
        handle2.reload(Some(layer))?;
        anyhow::Ok(())
    };

    let disable = move || {
        disable_tracing()?;
        handle.reload(None)?;
        anyhow::Ok(())
    };

    tracing_subscriber::registry()
        .with(tracing_subscriber::fmt::layer())
        .with(opentelemetry)
        .with(
            Targets::new()
                .with_target(env!("CARGO_PKG_NAME"), Level::TRACE)
                .with_target("tower_http", Level::TRACE)
                .with_default(Level::INFO),
        )
        .init();

    Ok(TracingToggle {
        enable: Box::new(enable),
        disable: Box::new(disable),
    })
}

TommyCpp · 2022-09-18T21:12:25Z

Sorry for the late response. Based on your comment I imagine you change the tracer provider on a spawned task. It may cause issues as shutdown_tracer_provider/set_tracer_provider involves blocking operations. So it may need to run a thread that allows blocking operations. i.e spawn_blocking.

The reason why it's blocking is by spec when tracer providers shut down it should call shutdown function on all span processors, which will invoke forceFlush method to export any remaining spans.

blueforesticarus · 2023-03-17T00:09:17Z

I have a similar issue, where I am calling global::shutdown_tracer_provider(); from a panic hook.
Though I don't know why it would hang there.

Shutting down OTEL tracing provider may hang for quite some time, see, for example: - open-telemetry/opentelemetry-rust#868 - and our problems with staging neondatabase/cloud#3707 (comment) Yet, we want computes to shut down fast enough, as we may need a new one for the same timeline ASAP. So wait no longer than 2s for the shutdown to complete, then just error out and exit the main thread. Related to neondatabase/cloud#3707

) Shutting down OTEL tracing provider may hang for quite some time, see, for example: - open-telemetry/opentelemetry-rust#868 - and our problems with staging neondatabase/cloud#3707 (comment) Yet, we want computes to shut down fast enough, as we may need a new one for the same timeline ASAP. So wait no longer than 2s for the shutdown to complete, then just error out and exit the main thread. Related to neondatabase/cloud#3707

dnaka91 · 2023-05-23T04:51:43Z

So this is still an issue for me. For my own small projects, I moved away from OpenTelemetry and built my own OTLP collector that doesn't need as much memory (as OpenTelemetry seems to cache a lot and overall causing a multitude of memory usage compared to before I added it).

But in larger projects this is still an issue. Even a simple test case that setups the tracer and immediately shuts it down, hangs.

If it requires to be run on a separate thread, IMO that should be done internally, as such an essential operation should not fail or hang by default.

jsoverson · 2023-08-04T20:13:33Z

I think this is the same issue I'm having.

I'm experiencing a problem when using a crate similar to test-log. Logging/telemetry gets initialized for a test, the test is run, then the logging guard is dropped. If the test fails with a panic, it can cause the span's export thread to hang because it blocks on a future that never progresses. That means the processor's shutdown can't complete, which causes the global shutdown to hang when the processor's Drop impl gets triggered.

I'm not sure what the best solution is, but by replacing the block_on call here with a spawn_local() and subsequent run_until_stalled() from a LocalPool, it prevents the thread from locking up.

fnichol · 2024-02-02T21:39:48Z

I'm seeing that the GLOBAL_TRACER_PROVIDER is using a std::sync::RwLock to manage access:

opentelemetry-rust/opentelemetry/src/global/trace.rs

Lines 356 to 358 in dd4c13b

    
           RwLock::new(GlobalTracerProvider::new( 
        
               trace::noop::NoopTracerProvider::new(), 
        
           ))

I was a bit surprised to see a RwLock here as I believe there's potential for outstanding readers to block a request to write in the lock. In the RwLock.write() docs it says that:

This function will not return while other writers or other readers currently have access to the lock.

I'm wondering if RwLock might be a less than ideal lock type, at least when it comes to shutdown time (I'm assuming it's very likely that there are outstanding RwLockReadGuard's in the system)?

ShaneMurphy2 · 2024-02-14T18:47:35Z

Yeah, no matter what I do I can't seem to get a call to global::shutdown_tracer_provider to complete.

hdost · 2024-02-21T10:41:41Z

Seems related #1143

When an otlp_endpoint is configured in the config file, ai-router hangs on shutdown. Spawning shutdown_tracer_provider in Tokio runtime thread pool for running blocking functions seems to solve this. See open-telemetry/opentelemetry-rust#868

See discussion: open-telemetry/opentelemetry-rust#868 And inspiration here neondatabase/neon#3982

howardjohn · 2024-07-24T21:35:22Z

Simple reproducer

extern crate core;

use opentelemetry::global;
use opentelemetry::trace::Tracer;
use opentelemetry_otlp::WithExportConfig;
use opentelemetry_sdk::Resource;
use std::time::Duration;

#[tokio::main(flavor = "current_thread")]
async fn main() {
    let result = opentelemetry_otlp::new_pipeline()
        .tracing()
        .with_trace_config(
            opentelemetry_sdk::trace::Config::default().with_resource(Resource::new(vec![])),
        )
        .with_exporter(
            opentelemetry_otlp::new_exporter()
                .tonic()
                .with_endpoint("http://127.0.0.1:4317"),
        )
        .install_batch(opentelemetry_sdk::runtime::Tokio);
    let r = result.unwrap();
    global::set_tracer_provider(r.clone());

    let tracer = global::tracer("my_tracer");
    tracer.in_span("doing_work", |_cx| {
        tracing::error!("hello world");
    });
    println!("shutdown start");
    global::shutdown_tracer_provider();
    println!("shutdown done");
    global::set_tracer_provider(r);
    println!("shutdown start");
    global::shutdown_tracer_provider();
    println!("shutdown done");
}

Since we are blocking on shutdown(), we never call process_message to actually continue processing the shutdown

TommyCpp · 2024-09-03T16:45:07Z

Simple reproducer

extern crate core;

use opentelemetry::global;
use opentelemetry::trace::Tracer;
use opentelemetry_otlp::WithExportConfig;
use opentelemetry_sdk::Resource;
use std::time::Duration;

#[tokio::main(flavor = "current_thread")]
async fn main() {
    let result = opentelemetry_otlp::new_pipeline()
        .tracing()
        .with_trace_config(
            opentelemetry_sdk::trace::Config::default().with_resource(Resource::new(vec![])),
        )
        .with_exporter(
            opentelemetry_otlp::new_exporter()
                .tonic()
                .with_endpoint("http://127.0.0.1:4317"),
        )
        .install_batch(opentelemetry_sdk::runtime::Tokio);
    let r = result.unwrap();
    global::set_tracer_provider(r.clone());

    let tracer = global::tracer("my_tracer");
    tracer.in_span("doing_work", |_cx| {
        tracing::error!("hello world");
    });
    println!("shutdown start");
    global::shutdown_tracer_provider();
    println!("shutdown done");
    global::set_tracer_provider(r);
    println!("shutdown start");
    global::shutdown_tracer_provider();
    println!("shutdown done");
}

Since we are blocking on shutdown(), we never call process_message to actually continue processing the shutdown

You are using Tokio runtime, which doesn't work with Tokio current thread scheduler. If current thread scheduler is used, we should use TokioCurrentThread runtime

xoac · 2024-10-10T13:26:43Z

Maybe would be possible for open-telemetry to decide on Runtime using this function:
https://docs.rs/tokio/latest/tokio/runtime/struct.Handle.html#method.runtime_flavor

TommyCpp added question Further information is requested A-trace Area: issues related to tracing labels Aug 28, 2022

hdost added this to the Tracing API And SDK Stable milestone Mar 2, 2023

hdost added the release:required-for-stable Must be resolved before GA release, or nice to have before GA. label Mar 2, 2023

ololobus mentioned this issue Apr 7, 2023

[compute_ctl] Add timeout for tracing_utils::shutdown_tracing() neondatabase/neon#3982

Merged

5 tasks

cijothomas assigned TommyCpp Oct 24, 2023

TommyCpp mentioned this issue Nov 22, 2023

[Bug]: trace BatchSpanProcessor::force_flush can deadlock #1395

Open

svix-james mentioned this issue Mar 1, 2024

Tracing initialization improvements svix/svix-webhooks#1244

Merged

ThomAub added a commit to ThomAub/tracing-opentelemetry-instrumentation-sdk that referenced this issue Jun 9, 2024

🐛 Force shutdown_tracer_provider

62db3f9

See discussion: open-telemetry/opentelemetry-rust#868 And inspiration here neondatabase/neon#3982

ThomAub mentioned this issue Jun 9, 2024

Fix/axum example shutdown davidB/tracing-opentelemetry-instrumentation-sdk#143

Merged

davidB pushed a commit to davidB/tracing-opentelemetry-instrumentation-sdk that referenced this issue Jun 30, 2024

🐛 Force shutdown_tracer_provider

cbd3c01

See discussion: open-telemetry/opentelemetry-rust#868 And inspiration here neondatabase/neon#3982

howardjohn mentioned this issue Jul 24, 2024

[Bug]: panic during shutdown with TokioCurrentThread batch tracer #1963

Closed

lalitb mentioned this issue Jul 27, 2024

BatchExporting - issues #1968

Open

cijothomas mentioned this issue Aug 30, 2024

[Bug]: PeriodicReader::shutdown() deadlocks on current thread runtime. #2056

Closed

sgrebnov mentioned this issue Nov 11, 2024

Add integration tests for chat completion API (HF and OpenAI) spiceai/spiceai#3433

Merged

cijothomas added the shutdown&runtime label Dec 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shutting down the global tracer hangs #868

Shutting down the global tracer hangs #868

dnaka91 commented Aug 28, 2022

TommyCpp commented Aug 28, 2022

dnaka91 commented Aug 28, 2022

TommyCpp commented Sep 2, 2022

dnaka91 commented Sep 2, 2022

TommyCpp commented Sep 18, 2022

blueforesticarus commented Mar 17, 2023

dnaka91 commented May 23, 2023

jsoverson commented Aug 4, 2023

fnichol commented Feb 2, 2024

ShaneMurphy2 commented Feb 14, 2024

hdost commented Feb 21, 2024

howardjohn commented Jul 24, 2024

TommyCpp commented Sep 3, 2024

xoac commented Oct 10, 2024

Shutting down the global tracer hangs #868

Shutting down the global tracer hangs #868

Comments

dnaka91 commented Aug 28, 2022

TommyCpp commented Aug 28, 2022

dnaka91 commented Aug 28, 2022

TommyCpp commented Sep 2, 2022

dnaka91 commented Sep 2, 2022

TommyCpp commented Sep 18, 2022

blueforesticarus commented Mar 17, 2023

dnaka91 commented May 23, 2023

jsoverson commented Aug 4, 2023

fnichol commented Feb 2, 2024

ShaneMurphy2 commented Feb 14, 2024

hdost commented Feb 21, 2024

howardjohn commented Jul 24, 2024

TommyCpp commented Sep 3, 2024

xoac commented Oct 10, 2024