Error stack traces are insufficient for root-cause analysis #11

axw · 2018-11-07T09:04:12Z

In the Go agent, like all the others, we include the stack trace for errors and spans with a configurable minimum duration. This can be useful for identifying the cause (or "culprit") of an error.

One issue with relying on stack traces for identifying the culprit is the assumption that the culprit exists in the call stack. For languages like Go, this may often not be the case. For example, see elastic/apm-agent-go#258: gocql (Cassandra client for Go) provides observers which notify of errors, but the observer is not run in the same goroutine as the original calling code.

This issue is probably more prevalent in Go programs due to the ease of concurrency, but it is not confined to Go. Other languages might distribute work via a thread-pool; an error in one of those threads will similarly have a different stack to the original caller.

We should investigate additional methods of identifying the cause/culprit of an error. A couple of ideas come to mind:

on the transaction pages, show errors in the transaction timeline (e.g. red markers along a span, or along the top like 'marks' for a transaction.)
on the error page, show the transaction timeline with an indication of where/when the error occurred

felixbarny · 2018-11-09T10:31:04Z

on the transaction pages, show errors in the transaction timeline (e.g. red markers along a span, or along the top like 'marks' for a transaction.)

Big +1 on that! Right now there is no indication about whether a span in the timeline has an error. It would make it so much easier to analyze errors if we had that :)

eyalkoren · 2018-12-10T06:31:53Z

This is a great idea! We should definitely include error info in the transaction views. The timeline is a good place for that. The transaction list can be also used, for example to color the transaction in which an error has occurred enough times to cross some predefined threshold (span errors should be attributed to parent transactions) and in the transaction sample view, highlight the error span in the table. The service map will be a good place for that as well - coloring the problematic node/service for DT root cause analysis.

One more thought- not only we can make more use of our error data to indicate a problem, same for the durations. We can do the same for latency issues by using predefined thresholds (it is tempting to say that we can employ ML for that, but in reality it is very difficult to get that right).

felixbarny · 2018-12-10T07:57:42Z

I would also propose to add an error flag to spans and transactions. This would be set to true if there was at least one error captured or if the HTTP status code >= 400.

This enables us to add an error rate row in the transaction overview column. Users can also create custom visualizations to graph the error rate of spans.

We curently have a result property (for example HTTP 5xx) but that does not easily let you query for errornous transactions/spans in a protocol independent way.

If a span has it's error flag set to false, it's corresponding transaction is not necessarily errornous as well, as the business logic might take compensating actions.

Transactions and spans which have the error flag set can be color coded (for example red border) or marked with ❗️

watson · 2018-12-10T08:34:07Z

This is going to be super useful 😃 This was actually the reason why we added transactionId to the errors over a year ago, but priorities changed so it never got implemented in the UI unfortunately. But there still is an open issue on the error->transaction link in the Kibana repo: elastic/kibana#21919

graphaelli · 2019-05-24T18:04:15Z

Closing this out now that transaction view links to error view and vice versa.

jalvz mentioned this issue Jan 15, 2019

Revisit error fields and grouping logic elastic/apm-server#1769

Closed

graphaelli closed this as completed May 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error stack traces are insufficient for root-cause analysis #11

Error stack traces are insufficient for root-cause analysis #11

axw commented Nov 7, 2018

felixbarny commented Nov 9, 2018

eyalkoren commented Dec 10, 2018

felixbarny commented Dec 10, 2018

watson commented Dec 10, 2018

graphaelli commented May 24, 2019

Error stack traces are insufficient for root-cause analysis #11

Error stack traces are insufficient for root-cause analysis #11

Comments

axw commented Nov 7, 2018

felixbarny commented Nov 9, 2018

eyalkoren commented Dec 10, 2018

felixbarny commented Dec 10, 2018

watson commented Dec 10, 2018

graphaelli commented May 24, 2019