-
Notifications
You must be signed in to change notification settings - Fork 903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Semantic conventions for when to record exceptions #521
Conversation
|
||
An error event created by an API call SHOULD be associated with the current span from which the API call is made. | ||
Errors and exceptions SHOULD be recorded any time an unhandled exception or error leaves the boundary of a span. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can only happen one time per span though? Then we can use span attributes instead of events as currently suggested in most PRs. I would agree with that approach 👍 We can add semantic conventions for unhandled errors and handled errors separately though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not arguing, but why storing attributes in a span is better than storing same attributes in an event? It looks the same for me except that
- there can be multiple events (errors/logs)
- event has a time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is simpler/faster to process. If you want to know if your span has an error on the backend, you can check mySpan.hasAttribute('error.whatever')
instead of any(e.name == 'error' for e in mySpan.events)
.
EDIT: On second thought, you could also have an hasEventNamed('error')
method on your backend span, but that would still most likely have a more complex implementation than a hash map lookup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While it's true we would not run into double errors when we auto instrument, a user can still call an API to record an error. Error attributes don't handle if the user calls the API twice on the same span or if they call it once when we auto instrumented an error on the same span.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel this tension very strongly; my personal preference has been one error recorded per span, but in go we went in the direction of having separate (potentially multiple) spanevents recorded, one per error. I still do feel that in general it is a useful optimization to flag the spans that have spanevent children indicating error, so that they're more easily searchable (rather than requiring parent HAVING child.error = true
type search syntax)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still do feel that in general it is a useful optimization to flag the spans that have spanevent children indicating error, so that they're more easily searchable (rather than requiring
parent HAVING child.error = true
type search syntax)
@lizthegrey Do you mean marking parents of a failed child as failed?
This won't be doable since a child can end (and thus fail) after the parent has ended in async scenarious and at that time the parent might have been exported already (or is at least no longer accessible to the instrumentation). This would be an optimization the backend would have to take care of when ingesting and analyzing the tree of spans.
Furthermore, a parent can also gracefully handle failures, e.g., by retrying the failed child operation or falling back (to alternative/cached data, for example), and thus succeed as a whole despite failed child spans.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, it's possible that a parent span can succeed despite a child span failing; however, our current schema is using SpanEvents (not child Spans, and not storing the information on the span itself) to record error statuses. I guess that's what the overall status of the span in terms of gRPC code is for though.
|
||
## Recording an Error | ||
|
||
Errors and exceptions SHOULD be recorded any time an unhandled exception or error leaves the boundary of a span. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about nested spans and error propagating? E.g.
span1
span2
span3 <- error happens here and propagated up to span1
Should the same error be recorded 3 times in span3, span2, and span1? It seems wasteful, but it is also not clear how to prevent/deduplicate it...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then the user will know if/where it was handled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is worth including in the PR if this is your intention.
But:
- it is not true in my case - errors tend to be recorded several times, but propagation stops at instrumentation boundaries (it is fixable for sure), not when error is handled
- it is expensive and slightly annoying - especially if/when user stores backtraces
E.g. in Go call depth of 10-20 frames is pretty common and errors are usually handled at the very top level - duplicating error information with backtraces 10-20 times is a no go...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In e.g. Java the following situation is quite common:
- span3 throws an exception, let us call it "low level error". E.g. that database call has failed
- span2 catches it and wraps it into another exception object. This can be some sort of "business process" exception, e.g. "failed to update inventory"
- span1 catches that and wraps it again into yet another exception object. This time more user-oriented one, e.g. "checkout failed".
One can argue that all three exceptions should be recorded and propagated to the backend. And all three spans should be marked as failed.
typo in title |
I think this PR is superseded by #697 and propose to close it. @open-telemetry/specs-approvers @open-telemetry/specs-trace-approvers @open-telemetry/technical-committee |
Actually, this is still something we have to spec: We now have the recordException API but no guidance on |
@murphpdx Are you interested in refreshing this PR? There is now |
|
||
## Recording an Error | ||
|
||
Errors and exceptions SHOULD be recorded any time an unhandled exception or error leaves the boundary of a span. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I agree mostly with this spec, I would prefer to also specify:
Errors and exceptions MUST NOT be recorded unless an unhandled exception or error leaves the boundary of a span.
If we want to support any such situations, we would need to add something to distinguish them from unhandled (in the above sense) execptions, probably a simplified variant of what is suggested in #747.
I took the liberty to change the PR title. I hope it is more discoverable that way. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving this because I agree with the content (although editorial changes are required before merging this, namely moving the content to the existing recordException API and the exception semantic convention)
@@ -0,0 +1,10 @@ | |||
# Semantic conventions for errors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now we have recordException
API, thus this should probably talk only about exceptions, and not errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If @murphpdx is no longer working on this PR, I could create an updated version.
Handing this off to John Watson. |
who is handing it off to @Oberon00 |
|
41efc60
to
c6ecad8
Compare
Sorry, I accidentally pushed to this PR's branch which caused GH to re-request reviews in the name of @murphpdx it seems. |
What are we gonna do with the EasyCLA check anyway? If I base my PR on this one, the check will keep failing. I could remove the commit from @murphpdx and replace it with a Co-authored-by and hope that this is fine. |
I think Amanda would be fine with that. |
#697 is now merged. Do we still need this PR? |
This PR is not meant to compete with the the following PR's:
#427
#432
This addition is only intended to spec out when to instrument errors. Java currently uses this behavior.