-
Notifications
You must be signed in to change notification settings - Fork 535
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adjust error instrumentation for easier aggregation #5426
Comments
It turns out that the example errors above that caused a lot of duplication issues went away with shortcodes (#5619). However, digging into the long tail of errors there are still cases of duplication or empty errors that can be fixed, so I will use this issue to log some of those cases. |
Some errors are duplicated because they have a secondary message attached, though they boil down to the same event. Examples:
In some of these cases, we have two occurrences because of the fact that we join the "performance" and "error" event tables in Kusto to track errors. These should be dealt with via better queries. Examples to fix by hand: |
edit: I updated the original description to this effect too There's a design #6676 I'm socializing around adding a new field |
Now that Will be something like opening specific issues for the couple areas that are a priority to reinstrument, and not tracking the long tail but rather depending on training the team to fix them up as needed/convenient. At that point I'll close this issue. |
Moving this to "Next" milestone. For August, let's just focus on #6754 (Updating our own instrumentation) which will cover many of these cases. Then we can see where the noise is and what area to look at next. Also, if specific cases emerge from live site issues from the Office Fluid team then we can take those on a case-by-case basis. |
Closing this issue, its utility has dried up now that |
Work Item
Describe the outcome you expect
As it is, we have many many cases where error messages contain variable info, whether from our own system or a dependency, which dilutes error counts and can hide top errors.
Assert/Error messages should be "static" for a particular code point, so that we can properly aggregate errors to recognize top errors.See #6676 design, we'll useerrorCode
as a distinct field frommessage
, containing static strings only.Approach
Query our Microsoft internal data to look at distinct values of
Data_error
and find patterns where a similar message has variations that preclude aggregation. Meanwhile we can also proactively update our own instrumentation to use error code (e.g #6754)There are going to be three main buckets of reasons from my experience so far:
message
(Data_error
) field.undefined
or an exception arising from Scriptor or another DataStore. In these cases, error code would be added vianormalizeError
witherrorCodeIfNone
, so consider if there is a choke point where we should be normalizing.Open questions
How to handle error messages like this from dependencies? Related to Redact error messages from dependencies we don't control to avoid PII leaks #4908message
will always be uncontrolled chaos. That's why we're introducingerrorCode
.Acceptance criteria
Querying production logs and grouping by
errorType
anderrorCode
should yield a relatively scoped set of buckets containing no variable data in those two fields, with not too much noise under thenone
buckets.The text was updated successfully, but these errors were encountered: