Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error stack traces are insufficient for root-cause analysis #11

Closed
axw opened this issue Nov 7, 2018 · 5 comments
Closed

Error stack traces are insufficient for root-cause analysis #11

axw opened this issue Nov 7, 2018 · 5 comments

Comments

@axw
Copy link
Member

axw commented Nov 7, 2018

In the Go agent, like all the others, we include the stack trace for errors and spans with a configurable minimum duration. This can be useful for identifying the cause (or "culprit") of an error.

One issue with relying on stack traces for identifying the culprit is the assumption that the culprit exists in the call stack. For languages like Go, this may often not be the case. For example, see elastic/apm-agent-go#258: gocql (Cassandra client for Go) provides observers which notify of errors, but the observer is not run in the same goroutine as the original calling code.

This issue is probably more prevalent in Go programs due to the ease of concurrency, but it is not confined to Go. Other languages might distribute work via a thread-pool; an error in one of those threads will similarly have a different stack to the original caller.

We should investigate additional methods of identifying the cause/culprit of an error. A couple of ideas come to mind:

  • on the transaction pages, show errors in the transaction timeline (e.g. red markers along a span, or along the top like 'marks' for a transaction.)
  • on the error page, show the transaction timeline with an indication of where/when the error occurred
@felixbarny
Copy link
Member

on the transaction pages, show errors in the transaction timeline (e.g. red markers along a span, or along the top like 'marks' for a transaction.)

Big +1 on that! Right now there is no indication about whether a span in the timeline has an error. It would make it so much easier to analyze errors if we had that :)

@eyalkoren
Copy link
Contributor

This is a great idea! We should definitely include error info in the transaction views. The timeline is a good place for that. The transaction list can be also used, for example to color the transaction in which an error has occurred enough times to cross some predefined threshold (span errors should be attributed to parent transactions) and in the transaction sample view, highlight the error span in the table. The service map will be a good place for that as well - coloring the problematic node/service for DT root cause analysis.

One more thought- not only we can make more use of our error data to indicate a problem, same for the durations. We can do the same for latency issues by using predefined thresholds (it is tempting to say that we can employ ML for that, but in reality it is very difficult to get that right).

@felixbarny
Copy link
Member

I would also propose to add an error flag to spans and transactions. This would be set to true if there was at least one error captured or if the HTTP status code >= 400.

This enables us to add an error rate row in the transaction overview column. Users can also create custom visualizations to graph the error rate of spans.

We curently have a result property (for example HTTP 5xx) but that does not easily let you query for errornous transactions/spans in a protocol independent way.

If a span has it's error flag set to false, it's corresponding transaction is not necessarily errornous as well, as the business logic might take compensating actions.

Transactions and spans which have the error flag set can be color coded (for example red border) or marked with ❗️

@watson
Copy link
Contributor

watson commented Dec 10, 2018

This is going to be super useful 😃 This was actually the reason why we added transactionId to the errors over a year ago, but priorities changed so it never got implemented in the UI unfortunately. But there still is an open issue on the error->transaction link in the Kibana repo: elastic/kibana#21919

@graphaelli
Copy link
Member

Closing this out now that transaction view links to error view and vice versa.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants