Document generic approach for span status (code + description) and exception event when instrumented code throws #1536

lmolkova · 2024-10-31T04:15:53Z

The common approach seems to be:

do nothing if the exception is handled by the instrumented library (retries, etc)
for unhandled (by client lib) exceptions:
- set span status to error
- set span status description to exception message
- set error.type attribute on spans/metrics based on the exception type or more specific low-cardinality error code
- DO NOT record exception event (by default) unless recording local root span or when the instrumentation knows that user code didn't handle the exception. Reason: exceptions are huge and expensive. Users decide if/how to record them when they catch them. If exception stays unhandled, the server instrumentation will record it.

We should document it and link from DB/messaging/gen-ai and other conventions.

The text was updated successfully, but these errors were encountered:

alanwest · 2024-10-31T16:55:24Z

Another aspect of this to consider is whether information should be captured from the outermost exception or innermost in the case of nested exceptions.

In capturing error.type and span status description, my intuition is that the innermost exception type/message is the most useful because it most closely describes the root of the problem.

On the other hand, when a user opts in to recording exception events, my intuition is that the outermost exception may be the most useful because the stack trace captured usually contains information from all nested exceptions.

Interestingly, the DB semantic conventions seems to suggest capturing details from the innermost exception:

[9]: The error.type SHOULD match the db.response.status_code returned by the database or the client library, or the canonical name of exception that occurred. When using canonical exception type name, instrumentation SHOULD do the best effort to report the most relevant type. For example, if the original exception is wrapped into a generic one, the original exception SHOULD be preferred. Instrumentations SHOULD document how error.type is populated.

The HTTP semantic conventions seem less opinionated and does not have a similar statement.

trask · 2024-11-05T03:35:47Z

DO NOT record exception event (by default) unless recording SERVER/CONSUMER span when the instrumentation knows that user code didn't handle the exception.

a slight variation on this is to replace SERVER/CONSUMER span above with "local root" span

trask · 2024-11-06T16:33:42Z

since it appears that github discussions don't get backreferenced, linking here: open-telemetry/opentelemetry-java-instrumentation#12125

cheempz · 2024-11-19T00:42:33Z

Tacking on a related question: in the issue description it seems setting span status to error and setting span error.type attribute should both happen when an operation errors. I'm wondering if this is implicit in the HTTP span semconv (that span status of error must also set error.type attribute, and vice versa), or if it's purposely left open for instrumentation libraries / end users to decide whether one or both of these are set. Thanks!

trask · 2024-11-19T02:30:50Z

hi @cheempz, check out this part of https://github.com/open-telemetry/semantic-conventions/blob/main/docs/http/http-spans.md:

If the request fails with an error before response status code was sent or received, error.type SHOULD be set to exception type (its fully-qualified class name, if applicable) or a component-specific low cardinality error identifier.

If response status code was sent or received and status indicates an error according to HTTP span status definition, error.type SHOULD be set to the status code number (represented as a string), an exception type (if thrown) or a component-specific error identifier.

cheempz · 2024-11-19T18:24:22Z

Thanks @trask! If I'm reading correctly it means span status of error should also set error.type, does it also mean conversely if error.type is set the span status should also be set to error? What we're trying to figure out is when analyzing the trace, can we simply look at span status error as the indicator that an operation had an error, then use span error.type attribute and span exception event as additional information about the error... or is it that any of these three (span status, span attribute, span event) could occur independently and indicate an error. I know it's ultimately in the end user's hands if they want to set/override any of these, just wondering if there is semconv guidelines around the intended use of these elements.

hi @cheempz, check out this part of https://github.com/open-telemetry/semantic-conventions/blob/main/docs/http/http-spans.md:

If the request fails with an error before response status code was sent or received, error.type SHOULD be set to exception type (its fully-qualified class name, if applicable) or a component-specific low cardinality error identifier.
If response status code was sent or received and status indicates an error according to HTTP span status definition, error.type SHOULD be set to the status code number (represented as a string), an exception type (if thrown) or a component-specific error identifier.

trask · 2024-11-19T20:23:43Z

can we simply look at span status error as the indicator that an operation had an error

👍 (I hope this is answered here: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/api.md#set-status)

lmolkova · 2024-11-19T23:08:33Z

Related: #1560 (comment)

Quoting @trask here for the context

Feedback from the spec SIG was that this sounds like a good default behavior, but we should make it configurable.

Some options for making it configurable:

(A) handle this all in the SDK:

new configuration added to TracerConfig
RecordException behavior could depend on this new configuration in TracerConfig

but I don't think we can change the default behavior of the SDK.

(B) Another option could be to introduce "trace advice" (similar to metric advice) that would allow a given instrumentation the ability to opt-in to this new behavior.

(C) Another option could be:

new configuration added to TracerConfig
instead of changing the (existing) behavior of RecordException, ask instrumentations to check this configuration value before calling RecordException

(D) Another option could be:

Introduce a standard "database" instrumentation configuration property (similar to the standard "http" configuration properties that we already have: https://github.com/open-telemetry/opentelemetry-configuration/blob/e1f89ef021f3c8e000d05e6b6e53bc4a2372fe52/examples/kitchen-sink.yaml#L396-L400)
ask instrumentations to check this configuration value before calling RecordException
we may also want to be able to override this at individual instrumentation level somehow

(E) Or even:

Introduce a global instrumentation configuration property that covers all instrumentations
ask instrumentations to check this configuration value before calling RecordException
we may also want to be able to override this at individual instrumentation level somehow

lmolkova · 2024-11-19T23:27:54Z

Ultimately I don't think that we should record exceptions on spans - we should use log-based events for it since
people may want to record errors when spans are sampled out or when tracing is disabled. Plus logs have severity and other future benefits such as body which could accommodate structured stack traces and exception chains.

If we think about logs there are at least two options that make sense:

whoever threw the exception, logs that exception - this prevents everyone who rethrows (wraps/aggregates) it from re-logging the stacktrace over and over again. In this case we log every exception once and it has the right context. But we don't yet know if it will be handled and might end up logging a lot of noise.
the catch-all handler in the application logs it - this would allow to record only unhandled exceptions, but they would (at best) have root span as the context.

There are options in between (do both, or log every time exception is created or suppressed by another one).

So I think we need some sort of an exception logging strategy enum and it should be on the LoggerConfig.

lmolkova · 2024-11-19T23:54:26Z

One more thing to consider: almost every library is instrumented natively with logs and records exceptions already. If we have OTel integration for the underlying logger, there is no benefit in re-recording lib exceptions in OTel instrumentations.

We should still provide guidance for green-field native instrumentations on when to log exceptions.

lmolkova added messaging-stability-blocker area:db area:messaging area:gen-ai labels Oct 31, 2024

lmolkova added this to Database Client Semantic Conventions and Spec: Messaging Semantics Oct 31, 2024

github-project-automation bot moved this to V1 - Stable Semantics in Spec: Messaging Semantics Oct 31, 2024

lmolkova moved this to Todo in Database Client Semantic Conventions Oct 31, 2024

lmolkova mentioned this issue Oct 31, 2024

[SqlClient] Add error.type and db.response.status_code attributes open-telemetry/opentelemetry-dotnet-contrib#2262

Merged

lmolkova self-assigned this Nov 7, 2024

lmolkova mentioned this issue Nov 8, 2024

Add section on span status for databases #1560

Merged

3 tasks

lmolkova mentioned this issue Dec 10, 2024

OTEP: Recording exceptions as log based events open-telemetry/opentelemetry-specification#4333

Open

5 tasks

lmolkova mentioned this issue Dec 28, 2024

Add common guidance on recording errors on spans and metrics, clarify DB conventions #1716

Merged

3 tasks

lmolkova mentioned this issue Jan 16, 2025

Deprecate exception.escaped attribute, update exception example open-telemetry/opentelemetry-specification#4368

Merged

5 tasks

lmolkova closed this as completed in #1716 Jan 16, 2025

github-project-automation bot moved this from Todo to Done in Database Client Semantic Conventions Jan 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document generic approach for span status (code + description) and exception event when instrumented code throws #1536

Document generic approach for span status (code + description) and exception event when instrumented code throws #1536

lmolkova commented Oct 31, 2024 •

edited

Loading

alanwest commented Oct 31, 2024

trask commented Nov 5, 2024

trask commented Nov 6, 2024

cheempz commented Nov 19, 2024

trask commented Nov 19, 2024

cheempz commented Nov 19, 2024

trask commented Nov 19, 2024

lmolkova commented Nov 19, 2024 •

edited

Loading

lmolkova commented Nov 19, 2024 •

edited

Loading

lmolkova commented Nov 19, 2024 •

edited

Loading

Document generic approach for span status (code + description) and exception event when instrumented code throws #1536

Document generic approach for span status (code + description) and exception event when instrumented code throws #1536

Comments

lmolkova commented Oct 31, 2024 • edited Loading

alanwest commented Oct 31, 2024

trask commented Nov 5, 2024

trask commented Nov 6, 2024

cheempz commented Nov 19, 2024

trask commented Nov 19, 2024

cheempz commented Nov 19, 2024

trask commented Nov 19, 2024

lmolkova commented Nov 19, 2024 • edited Loading

lmolkova commented Nov 19, 2024 • edited Loading

lmolkova commented Nov 19, 2024 • edited Loading

lmolkova commented Oct 31, 2024 •

edited

Loading

lmolkova commented Nov 19, 2024 •

edited

Loading

lmolkova commented Nov 19, 2024 •

edited

Loading

lmolkova commented Nov 19, 2024 •

edited

Loading