Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Semantic conventions for when to record exceptions #521

Closed
wants to merge 3 commits into from

Conversation

murphpdx
Copy link

This PR is not meant to compete with the the following PR's:
#427
#432

This addition is only intended to spec out when to instrument errors. Java currently uses this behavior.


An error event created by an API call SHOULD be associated with the current span from which the API call is made.
Errors and exceptions SHOULD be recorded any time an unhandled exception or error leaves the boundary of a span.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can only happen one time per span though? Then we can use span attributes instead of events as currently suggested in most PRs. I would agree with that approach 👍 We can add semantic conventions for unhandled errors and handled errors separately though.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not arguing, but why storing attributes in a span is better than storing same attributes in an event? It looks the same for me except that

  • there can be multiple events (errors/logs)
  • event has a time

Copy link
Member

@Oberon00 Oberon00 Mar 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is simpler/faster to process. If you want to know if your span has an error on the backend, you can check mySpan.hasAttribute('error.whatever') instead of any(e.name == 'error' for e in mySpan.events).
EDIT: On second thought, you could also have an hasEventNamed('error') method on your backend span, but that would still most likely have a more complex implementation than a hash map lookup.

Copy link
Author

@murphpdx murphpdx Mar 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While it's true we would not run into double errors when we auto instrument, a user can still call an API to record an error. Error attributes don't handle if the user calls the API twice on the same span or if they call it once when we auto instrumented an error on the same span.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel this tension very strongly; my personal preference has been one error recorded per span, but in go we went in the direction of having separate (potentially multiple) spanevents recorded, one per error. I still do feel that in general it is a useful optimization to flag the spans that have spanevent children indicating error, so that they're more easily searchable (rather than requiring parent HAVING child.error = true type search syntax)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still do feel that in general it is a useful optimization to flag the spans that have spanevent children indicating error, so that they're more easily searchable (rather than requiring parent HAVING child.error = true type search syntax)

@lizthegrey Do you mean marking parents of a failed child as failed?
This won't be doable since a child can end (and thus fail) after the parent has ended in async scenarious and at that time the parent might have been exported already (or is at least no longer accessible to the instrumentation). This would be an optimization the backend would have to take care of when ingesting and analyzing the tree of spans.
Furthermore, a parent can also gracefully handle failures, e.g., by retrying the failed child operation or falling back (to alternative/cached data, for example), and thus succeed as a whole despite failed child spans.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, it's possible that a parent span can succeed despite a child span failing; however, our current schema is using SpanEvents (not child Spans, and not storing the information on the span itself) to record error statuses. I guess that's what the overall status of the span in terms of gRPC code is for though.


## Recording an Error

Errors and exceptions SHOULD be recorded any time an unhandled exception or error leaves the boundary of a span.
Copy link

@vmihailenco vmihailenco Mar 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about nested spans and error propagating? E.g.

span1
    span2
        span3 <- error happens here and propagated up to span1

Should the same error be recorded 3 times in span3, span2, and span1? It seems wasteful, but it is also not clear how to prevent/deduplicate it...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then the user will know if/where it was handled.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is worth including in the PR if this is your intention.

But:

  • it is not true in my case - errors tend to be recorded several times, but propagation stops at instrumentation boundaries (it is fixable for sure), not when error is handled
  • it is expensive and slightly annoying - especially if/when user stores backtraces

E.g. in Go call depth of 10-20 frames is pretty common and errors are usually handled at the very top level - duplicating error information with backtraces 10-20 times is a no go...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In e.g. Java the following situation is quite common:

  • span3 throws an exception, let us call it "low level error". E.g. that database call has failed
  • span2 catches it and wraps it into another exception object. This can be some sort of "business process" exception, e.g. "failed to update inventory"
  • span1 catches that and wraps it again into yet another exception object. This time more user-oriented one, e.g. "checkout failed".

One can argue that all three exceptions should be recorded and propagated to the backend. And all three spans should be marked as failed.

@dyladan
Copy link
Member

dyladan commented Mar 19, 2020

typo in title

@murphpdx murphpdx changed the title Semantic conversions for errors Semantic conventions for errors Mar 19, 2020
@arminru arminru added the area:error-reporting Related to error reporting label May 12, 2020
@carlosalberto carlosalberto added the area:semantic-conventions Related to semantic conventions label Jun 19, 2020
@reyang reyang added the release:required-for-ga Must be resolved before GA release, or nice to have before GA label Jul 6, 2020
@iNikem
Copy link
Contributor

iNikem commented Jul 30, 2020

I think this PR is superseded by #697 and propose to close it. @open-telemetry/specs-approvers @open-telemetry/specs-trace-approvers @open-telemetry/technical-committee

@Oberon00
Copy link
Member

Oberon00 commented Jul 30, 2020

Actually, this is still something we have to spec: We now have the recordException API but no guidance on how when to record exceptions. It's more similar to my #747.

@iNikem
Copy link
Contributor

iNikem commented Jul 30, 2020

@murphpdx Are you interested in refreshing this PR? There is now specification/trace/semantic_conventions/exceptions.md to hold this proposal. I think its content may remain the same.


## Recording an Error

Errors and exceptions SHOULD be recorded any time an unhandled exception or error leaves the boundary of a span.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I agree mostly with this spec, I would prefer to also specify:

Errors and exceptions MUST NOT be recorded unless an unhandled exception or error leaves the boundary of a span.

If we want to support any such situations, we would need to add something to distinguish them from unhandled (in the above sense) execptions, probably a simplified variant of what is suggested in #747.

@Oberon00 Oberon00 changed the title Semantic conventions for errors Semantic conventions for when to record exceptions Jul 31, 2020
@Oberon00
Copy link
Member

I took the liberty to change the PR title. I hope it is more discoverable that way.

Copy link
Member

@Oberon00 Oberon00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving this because I agree with the content (although editorial changes are required before merging this, namely moving the content to the existing recordException API and the exception semantic convention)

@@ -0,0 +1,10 @@
# Semantic conventions for errors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now we have recordException API, thus this should probably talk only about exceptions, and not errors.

Copy link
Member

@Oberon00 Oberon00 Aug 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If @murphpdx is no longer working on this PR, I could create an updated version.

@murphpdx
Copy link
Author

murphpdx commented Aug 5, 2020

Handing this off to John Watson.

@jkwatson
Copy link
Contributor

jkwatson commented Aug 5, 2020

Handing this off to John Watson.

who is handing it off to @Oberon00

@Oberon00 Oberon00 self-assigned this Aug 6, 2020
@murphpdx murphpdx requested a review from a team August 6, 2020 07:05
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Aug 6, 2020

CLA Check

@murphpdx murphpdx requested a review from a team August 6, 2020 07:06
@Oberon00
Copy link
Member

Oberon00 commented Aug 6, 2020

Sorry, I accidentally pushed to this PR's branch which caused GH to re-request reviews in the name of @murphpdx it seems.

@Oberon00
Copy link
Member

Oberon00 commented Aug 6, 2020

What are we gonna do with the EasyCLA check anyway? If I base my PR on this one, the check will keep failing. I could remove the commit from @murphpdx and replace it with a Co-authored-by and hope that this is fine.

@jkwatson
Copy link
Contributor

jkwatson commented Aug 6, 2020

What are we gonna do with the EasyCLA check anyway? If I base my PR on this one, the check will keep failing. I could remove the commit from @murphpdx and replace it with a Co-authored-by and hope that this is fine.

I think Amanda would be fine with that.

@tigrannajaryan
Copy link
Member

#697 is now merged. Do we still need this PR?

@arminru
Copy link
Member

arminru commented Aug 12, 2020

Superseded by #761. Thanks @murphpdx for the initial proposal!

@arminru arminru closed this Aug 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:error-reporting Related to error reporting area:semantic-conventions Related to semantic conventions release:required-for-ga Must be resolved before GA release, or nice to have before GA
Projects
None yet
Development

Successfully merging this pull request may close these issues.