Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjust error instrumentation for easier aggregation #5426

Closed
markfields opened this issue Mar 9, 2021 · 8 comments
Closed

Adjust error instrumentation for easier aggregation #5426

markfields opened this issue Mar 9, 2021 · 8 comments
Assignees
Milestone

Comments

@markfields
Copy link
Member

markfields commented Mar 9, 2021

Work Item

Describe the outcome you expect
As it is, we have many many cases where error messages contain variable info, whether from our own system or a dependency, which dilutes error counts and can hide top errors. Assert/Error messages should be "static" for a particular code point, so that we can properly aggregate errors to recognize top errors. See #6676 design, we'll use errorCode as a distinct field from message, containing static strings only.

Approach
Query our Microsoft internal data to look at distinct values of Data_error and find patterns where a similar message has variations that preclude aggregation. Meanwhile we can also proactively update our own instrumentation to use error code (e.g #6754)

There are going to be three main buckets of reasons from my experience so far:

  • Our own instrumentation including variables that should be pulled to telemetry props (hopefully Update error classes to implement IFluidErrorBase #6754 addresses these)
  • Our own instrumentation copying the message from a response payload from the server, e.g. Socket IO errors. In these cases, we probably want to attempt to model the server response (e.g. JSON structure or certain expected cases based on message substrings), and produce error codes based on the model, leaving the full message in the message (Data_error) field.
  • Unexpected errors arising from our own code or other client code we call that aren't instrumented with error code (e.g. dereferencing undefined or an exception arising from Scriptor or another DataStore. In these cases, error code would be added via normalizeError with errorCodeIfNone, so consider if there is a choke point where we should be normalizing.

Open questions

Acceptance criteria
Querying production logs and grouping by errorType and errorCode should yield a relatively scoped set of buckets containing no variable data in those two fields, with not too much noise under the none buckets.

@markfields markfields added this to the Next milestone Mar 9, 2021
@markfields markfields modified the milestones: Next, April 2021 Mar 27, 2021
@ksbrar
Copy link
Contributor

ksbrar commented Apr 21, 2021

It turns out that the example errors above that caused a lot of duplication issues went away with shortcodes (#5619).

However, digging into the long tail of errors there are still cases of duplication or empty errors that can be fixed, so I will use this issue to log some of those cases.

@ksbrar ksbrar modified the milestones: April 2021, May 2021 Apr 30, 2021
@ksbrar
Copy link
Contributor

ksbrar commented May 18, 2021

Some errors are duplicated because they have a secondary message attached, though they boil down to the same event. Examples:

  • "Error 403" vs "Error 403 (Forbidden)"
  • "Error 404" vs "Error 404 (Not Found)"

In some of these cases, we have two occurrences because of the fact that we join the "performance" and "error" event tables in Kusto to track errors. These should be dealt with via better queries.

Examples to fix by hand:
*

@ksbrar ksbrar modified the milestones: May 2021, June 2021 May 28, 2021
@markfields markfields modified the milestones: June 2021, July 2021 Jun 21, 2021
@markfields
Copy link
Member Author

markfields commented Jun 25, 2021

I've written a tool that boils our 30k+ unique error messages down to about 100 canonical forms (e.g. prefix, or fuzzy match pattern string). The top-hitter was from us logging the message field off the socket.io response object in case of an error. I just fixed that in PR #6557

See this query

image

@markfields
Copy link
Member Author

markfields commented Jul 1, 2021

edit: I updated the original description to this effect too

There's a design #6676 I'm socializing around adding a new field errorCode, kept separate from the original error message. In cases where we originate an error (e.g. asserts or anywhere else we new up an error object not to wrap another), we will be sure to use error code, and likewise have a simple/static error message.

@markfields markfields changed the title Dedupe error messages containing variable information Adjust error instrumentation for better aggregation Jul 8, 2021
@markfields markfields changed the title Adjust error instrumentation for better aggregation Adjust error instrumentation for easier aggregation Jul 8, 2021
@vladsud vladsud added the design-required This issue requires design thought label Jul 8, 2021
@markfields markfields assigned markfields and unassigned ksbrar Jul 8, 2021
@markfields
Copy link
Member Author

@vladsud once I get a bit more consensus on the direction in #6676 I'll remove the design-required label here.

@markfields
Copy link
Member Author

markfields commented Aug 3, 2021

Now that IFluidErrorBase is in, with fluidErrorCode, we can start this work in earnest. I'm about to head offline for 2 months for paternity leave, but before I go I hope to land with a concrete plan for this issue.

Will be something like opening specific issues for the couple areas that are a priority to reinstrument, and not tracking the long tail but rather depending on training the team to fix them up as needed/convenient. At that point I'll close this issue.

@markfields markfields modified the milestones: August 2021, Next Aug 9, 2021
@markfields
Copy link
Member Author

Moving this to "Next" milestone. For August, let's just focus on #6754 (Updating our own instrumentation) which will cover many of these cases. Then we can see where the noise is and what area to look at next. Also, if specific cases emerge from live site issues from the Office Fluid team then we can take those on a case-by-case basis.

@anthony-murphy anthony-murphy modified the milestones: Next, October 2021 Oct 4, 2021
@markfields
Copy link
Member Author

Closing this issue, its utility has dried up now that IFluidErrorBase and fluidErrorCode are in and it's just a matter of iterating on cleaning up our errors signal release after release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants