-
Notifications
You must be signed in to change notification settings - Fork 468
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shutting down the global tracer hangs #868
Comments
Could you provide more information on
Also, could you expand on what you are looking to achieve? If you want to adjust how many spans to send to the backend |
Thanks for the quick reply. And sorry if I picked a bit of a spicy tone in the ticket description, have run into several pain points recently while using the opentelemetry crates, sadly.
Tokio runtime.
Yes. The sample wouldn't really help me here. The application is a rather simple web server that works out of the box with zero configuration. Several global options can then later be activated and customized, which would soon include a tracing collector. That means the admin can change the endpoint at runtime, and in that case the URL must be either updated or the whole tracing client be disabled (in case the URL is empty). If I would just leave the global tracer set, and say, filter out all traces instead, it'd still continue to connect to the tracing collector, right? Therefore, I'd like to fully shut it down in case the tracing URL is removed. |
Sorry to hear that. I will try to help as much as I could. Regarding your problem, I think the I will see if I can write an example over the weekend but feel free to try it yourself. |
That's the thing I'm using I think. Not using For initialization I use the following code, which returns me a tracer that is wrapped in a tracing fn init_tracing<S>(otlp_endpoint: String) -> Result<impl Layer<S>>
where
for<'span> S: Subscriber + LookupSpan<'span>,
{
opentelemetry::global::set_error_handler(|error| {
error!(target: "opentelemetry", ?error);
})?;
let tracer = opentelemetry_otlp::new_pipeline()
.tracing()
.with_exporter(
opentelemetry_otlp::new_exporter()
.tonic()
.with_endpoint(otlp_endpoint),
)
.with_trace_config(trace::config().with_resource(Resource::new([
resource::SERVICE_NAME.string(env!("CARGO_CRATE_NAME")),
resource::SERVICE_VERSION.string(env!("CARGO_PKG_VERSION")),
])))
.install_batch(runtime::Tokio)?;
Ok(tracing_opentelemetry::layer().with_tracer(tracer))
} Then, to disable it again I use the following code, where it hangs in the fn disable_tracing() -> Result<()> {
opentelemetry::global::set_error_handler(|_| {})?;
opentelemetry::global::shutdown_tracer_provider();
Ok(())
} This is all pulled together in the whole tracing setup, where I use a reload layer to dynamically enable/disable the telemetry endpoint. The pub struct TracingToggle {
enable: Box<dyn Fn(String) -> Result<()> + Send + Sync + 'static>,
disable: Box<dyn Fn() -> Result<()> + Send + Sync + 'static>,
}
#[allow(clippy::missing_errors_doc)]
impl TracingToggle {
pub fn enable(&self, otlp_endpoint: String) -> Result<()> {
(self.enable)(otlp_endpoint)
}
pub fn disable(&self) -> Result<()> {
(self.disable)()
}
}
fn init_logging(otlp_endpoint: Option<String>) -> Result<TracingToggle> {
let opentelemetry = otlp_endpoint.map(init_tracing).transpose()?;
let (opentelemetry, handle) = reload::Layer::new(opentelemetry);
let handle2 = handle.clone();
let enable = move |endpoint: String| {
let layer = init_tracing(endpoint)?;
handle2.reload(Some(layer))?;
anyhow::Ok(())
};
let disable = move || {
disable_tracing()?;
handle.reload(None)?;
anyhow::Ok(())
};
tracing_subscriber::registry()
.with(tracing_subscriber::fmt::layer())
.with(opentelemetry)
.with(
Targets::new()
.with_target(env!("CARGO_PKG_NAME"), Level::TRACE)
.with_target("tower_http", Level::TRACE)
.with_default(Level::INFO),
)
.init();
Ok(TracingToggle {
enable: Box::new(enable),
disable: Box::new(disable),
})
} |
Sorry for the late response. Based on your comment I imagine you change the tracer provider on a spawned task. It may cause issues as The reason why it's blocking is by spec when tracer providers shut down it should call |
I have a similar issue, where I am calling |
Shutting down OTEL tracing provider may hang for quite some time, see, for example: - open-telemetry/opentelemetry-rust#868 - and our problems with staging neondatabase/cloud#3707 (comment) Yet, we want computes to shut down fast enough, as we may need a new one for the same timeline ASAP. So wait no longer than 2s for the shutdown to complete, then just error out and exit the main thread. Related to neondatabase/cloud#3707
) Shutting down OTEL tracing provider may hang for quite some time, see, for example: - open-telemetry/opentelemetry-rust#868 - and our problems with staging neondatabase/cloud#3707 (comment) Yet, we want computes to shut down fast enough, as we may need a new one for the same timeline ASAP. So wait no longer than 2s for the shutdown to complete, then just error out and exit the main thread. Related to neondatabase/cloud#3707
So this is still an issue for me. For my own small projects, I moved away from OpenTelemetry and built my own OTLP collector that doesn't need as much memory (as OpenTelemetry seems to cache a lot and overall causing a multitude of memory usage compared to before I added it). But in larger projects this is still an issue. Even a simple test case that setups the tracer and immediately shuts it down, hangs. If it requires to be run on a separate thread, IMO that should be done internally, as such an essential operation should not fail or hang by default. |
I think this is the same issue I'm having. I'm experiencing a problem when using a crate similar to test-log. Logging/telemetry gets initialized for a test, the test is run, then the logging guard is dropped. If the test fails with a panic, it can cause the span's export thread to hang because it blocks on a future that never progresses. That means the processor's I'm not sure what the best solution is, but by replacing the |
I'm seeing that the opentelemetry-rust/opentelemetry/src/global/trace.rs Lines 356 to 358 in dd4c13b
I was a bit surprised to see a RwLock here as I believe there's potential for outstanding readers to block a request to write in the lock. In the
I'm wondering if RwLock might be a less than ideal lock type, at least when it comes to shutdown time (I'm assuming it's very likely that there are outstanding |
Yeah, no matter what I do I can't seem to get a call to |
Seems related #1143 |
When an otlp_endpoint is configured in the config file, ai-router hangs on shutdown. Spawning shutdown_tracer_provider in Tokio runtime thread pool for running blocking functions seems to solve this. See open-telemetry/opentelemetry-rust#868
See discussion: open-telemetry/opentelemetry-rust#868 And inspiration here neondatabase/neon#3982
See discussion: open-telemetry/opentelemetry-rust#868 And inspiration here neondatabase/neon#3982
Simple reproducer extern crate core;
use opentelemetry::global;
use opentelemetry::trace::Tracer;
use opentelemetry_otlp::WithExportConfig;
use opentelemetry_sdk::Resource;
use std::time::Duration;
#[tokio::main(flavor = "current_thread")]
async fn main() {
let result = opentelemetry_otlp::new_pipeline()
.tracing()
.with_trace_config(
opentelemetry_sdk::trace::Config::default().with_resource(Resource::new(vec![])),
)
.with_exporter(
opentelemetry_otlp::new_exporter()
.tonic()
.with_endpoint("http://127.0.0.1:4317"),
)
.install_batch(opentelemetry_sdk::runtime::Tokio);
let r = result.unwrap();
global::set_tracer_provider(r.clone());
let tracer = global::tracer("my_tracer");
tracer.in_span("doing_work", |_cx| {
tracing::error!("hello world");
});
println!("shutdown start");
global::shutdown_tracer_provider();
println!("shutdown done");
global::set_tracer_provider(r);
println!("shutdown start");
global::shutdown_tracer_provider();
println!("shutdown done");
} Since we are blocking on shutdown(), we never call process_message to actually continue processing the shutdown |
You are using |
Maybe would be possible for open-telemetry to decide on Runtime using this function: |
I need to be able to swap out the global tracer instance at runtime, because it should be able to enable/disable without restarting the application (or change the endpoint URL).
I'm using opentelemetry-otlp for trace exporting, as the opentelemetry-jaeger crate is just too broken. When trying to shutdown the global tracer or trying to swap it, it just blocks on the internal Mutex forever.
I've seen several closed issues about this topic, but it seems this problem is not fully resolved yet, as it still hangs for me.
The text was updated successfully, but these errors were encountered: