-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error stack traces are insufficient for root-cause analysis #11
Comments
Big +1 on that! Right now there is no indication about whether a span in the timeline has an error. It would make it so much easier to analyze errors if we had that :) |
This is a great idea! We should definitely include error info in the transaction views. The timeline is a good place for that. The transaction list can be also used, for example to color the transaction in which an error has occurred enough times to cross some predefined threshold (span errors should be attributed to parent transactions) and in the transaction sample view, highlight the error span in the table. The service map will be a good place for that as well - coloring the One more thought- not only we can make more use of our error data to indicate a problem, same for the durations. We can do the same for latency issues by using predefined thresholds (it is tempting to say that we can employ ML for that, but in reality it is very difficult to get that right). |
I would also propose to add an This enables us to add an error rate row in the transaction overview column. Users can also create custom visualizations to graph the error rate of spans. We curently have a result property (for example If a span has it's error flag set to Transactions and spans which have the error flag set can be color coded (for example red border) or marked with ❗️ |
This is going to be super useful 😃 This was actually the reason why we added |
Closing this out now that transaction view links to error view and vice versa. |
In the Go agent, like all the others, we include the stack trace for errors and spans with a configurable minimum duration. This can be useful for identifying the cause (or "culprit") of an error.
One issue with relying on stack traces for identifying the culprit is the assumption that the culprit exists in the call stack. For languages like Go, this may often not be the case. For example, see elastic/apm-agent-go#258: gocql (Cassandra client for Go) provides observers which notify of errors, but the observer is not run in the same goroutine as the original calling code.
This issue is probably more prevalent in Go programs due to the ease of concurrency, but it is not confined to Go. Other languages might distribute work via a thread-pool; an error in one of those threads will similarly have a different stack to the original caller.
We should investigate additional methods of identifying the cause/culprit of an error. A couple of ideas come to mind:
The text was updated successfully, but these errors were encountered: