Improve Dev WF by making PR results actionable #5963

markwilkie · 2020-08-14T16:23:53Z

Motivation and Business Impact

The north star of this epic is to improve the developer workflow by focusing on making the PR results accurate and actionable. 'Red' should mean that the dev can (and should) fix something, and 'Green' should mean that the change is good.

Today, it can be tough to figure out what the actual root problem is, compounded by periodic infrastructure outages and/or transient issues that are outside of the devs control. When "bad actor" tests are added, it becomes clear why we as devs are frustrated with seemingly "never" getting a green PR.

Jared Parsons wrote a great doc that much of this thinking came from: Resiliency in our infrastructure.docx

To follow the conversations, check out our V-Team's Teams Channel which is where you should be able to find most of the context.

Business Objectives

The dev knows what action to take regarding the failures in their PR.
PR results are resilient to transient issues
Clear visibility to which failures need attention at more of a macro level (allows leads to reason about where investments need to be made)
Known Issues/outages do not result in unactionable PR results
~~Improve MSPoll values around engineering, specifically engr107 and engr108~~ (survey values no longer available)

Functional Deliverables

Configurable retries for tests to allow passing PR’s even with flaky tests
Configurable retries for builds to allow passing PR’s even with intermittent infrastructure issues
Failing PRs indicate what is failing to the user
Every build includes a best effort highlighting of most relevant failures in a single place, with deep links
Customers have a way to query on test results to determine which tests are behaving abnormally (e.g. failing more often than usual
Ongoing outages or known issues that affect the PR are made known to the dev (e.g. including a link to the outage/known issue)
~~Stretch goal: When a persistent outage is resolved, PRs are updated~~

Deliverables for Engineering

Ensure monitoring and alerting coverage on all services created for this epic and address any outstanding alerts from monitoring before closing the epic.
Comprehensive documentation (including architectural diagrams where necessary) for engineering hand-off.

Metrics for Success

Where applicable, metrics should be sliceable by repo.

User Feedback

Customer feedback from the sentiment tracker is tracked via a GitHub issues and the customer agrees with the next steps.
Stretch Goal: NPS-style question on Customer Survey (best effort to get feedback from customers on this) and/or Sentiment Feedback widget to gauge customer satisfaction with the feature. A goal of 7 or higher would be favorable.
- Results from the March 2023 .NET Engineering Services Satisfaction Survey
  - Build Analysis (Unique build/test failures): 8
  - Automatic Build Retries: 9.25
  - Automatic Test Retries: 8
  - Known Issues - 8.33

Usage

Less than 10% of Pull Requests are "merged on red"
- Link to data, see "Merge on Red Trend"
- Between Jan 2023 and March 2023: max of 28%, min of 16%
Build Result Analysis should be viewed at least 60% of the time on failing Pull Requests
- Link to data
- Between Nov 2022 and March 2023: max of 62%, min of 27%

One-Pagers:

Notes

Networking notes on keeping CI failures low

Milestones

Achievements

Milestone 1

Build Result Analysis Summary GitHub check available on Pull Requests (currently available in Arcade Services)

Immediately viewable reports on tests and build failures, with deep links to Azure DevOps console logs and test results.
Sentiment tracking for feedback from users. Feedback goes directly to dnceng, and we'll review it and use it to help drive design and prioritization.

Recently Triaged Issues

All issues in this section should be triaged by the v-team into one of their business objectives or features.

The text was updated successfully, but these errors were encountered:

tkapin · 2020-08-28T15:52:43Z

@rokonec - please work with @ChadNedzlek to refine the Phase 1 business priorities into more detailed requirements so we can came up with a design proposal.

ChadNedzlek · 2020-10-09T21:16:17Z

Changed title to give "build" it's own epic. Tests and builds are very different beasts when it comes to tackling them, because the quantity and quality of data is vastly different.

markwilkie · 2020-10-21T14:54:38Z

After our v-team discussion yesterday where it proved difficult to make forward progress, I'll give a hand to see if the discussion can be "jump started". What I'll present should be thought of as a "starter fluid", not the actual "fuel". Meaning.....the right approach is probably still "out there" - at least to a degree.

Perhaps this proposed approach can be applied the the data we have to see how it does (or doesn't) bring value.

With that said, here are my "starter" thoughts:

Action:
(1) Retry test on the same machine
Criteria:

Test failed
The last test run was successful

Action:
(2) Retry test on the different machine
Criteria:

Test failed
The last test run failed when run on same machine (1)
The test before last was successful

Action:
(3) Fail test
Criteria:

Test failed
One of the following is true:
- Actions (1) and (2) were failures
- Tests have not been marked as chronically failing or as flaky (as those are dealt with separately)
- Last (n) tests were failures
- Specific test annotations (e.g. threshold, etc) result in marking the test as a failure
Action:
(4) Mark test as chronically failing, but not necessarily flaky
Criteria:
Test failed
Test has failed for the last (n) times

Action:
(5) Mark test as flaky
Criteria:
Test failed
The pattern shows the last (n) runs had (y) failures (demonstrating a pattern of pass/fails)

Action:
(6) Skip test (don't run it at all)
Criteria:
Tests are marked as chronically failing (4) or as flaky (5)

ChadNedzlek · 2020-10-21T23:37:58Z

There are some interesting investigations we can take that can help distill this starter fluid. :-). Like what does it look like to run a test on the same machine... a different machine... how can we report that in a useful way to Azure Pipelines/something else so that we have the data we need to "do the right thing".

I have a feeling that maybe for non-PR builds, we run enough of them that maybe just always looking at the last N builds might be statistically good enough without having to retry a test within a single build (sort of treating the N+1 as, more or less, an implicit retry of the N build). There were 6000 executions of every test in the last two weeks for runtime, for example... that's a ton of data that a 6001st test probably won't add much to. :-) That might save resources/complication, and help with reporting, since Azure Pipelines has some pretty good reporting around the analytics for a branch. When it gets down to the wire of shipping, maybe that means we need to investigate 2-3 random failures for the build we want to ship... but we'd probably want to do that anyway, even if we deemed them "flaky and irrelevant".

markwilkie · 2020-10-22T00:49:11Z

I like the thinking Chad. :) Let's chat again and see what the right next steps are.

Cheers

sunandabalu · 2020-10-22T16:46:33Z

I have a feeling that maybe for non-PR builds, we run enough of them that maybe just always looking at the last N builds might be statistically good enough without having to retry a test within a single build (sort of treating the N+1 as, more or less, an implicit retry of the N build). There were 6000 executions of every test in the last two weeks for runtime, for example... that's a ton of data that a 6001st test probably won't add much to. :-) That might save resources/complication, and help with reporting, since Azure Pipelines has some pretty good reporting around the analytics for a branch. When it gets down to the wire of shipping, maybe that means we need to investigate 2-3 random failures for the build we want to ship... but we'd probably want to do that anyway, even if we deemed them "flaky and irrelevant".

Makes sense for non-PR builds, to compare against previous builds in the same branch. For PR builds, the mechanism to identify whether a test is chronically failing or is it flaky would require some sort of retry and bubbling up that info for reporting. We need to ensure the reporting structure can accomodate both, or way to demarcate the PR vs Non-PR if that is more convenient.

sunandabalu · 2020-10-22T16:48:44Z

Action:
(6) Skip test (don't run it at all)
Criteria:

Tests are marked as chronically failing (4) or as flaky (5)

For tests that are consistently failing need to be looked at and either turn the test off or mark it as flaky so we can skip it. Just blindly skipping a chronically failing test would mean sweeping it under the rug :)

ChadNedzlek · 2020-10-23T19:59:49Z

I'm not sure where the epic about getting everyone on a shared testing infrastructure... but any "retry" logic we write won't help anyone until that epic is complete. Right now, everyone has implemented their own test execution framework, so any work would have to be hand written into every single repository, which isn't a good use of our time (since it would all get deleted when that other epic started up anyway).

We can make it work in arcade, assuming that will be the template for other teams test runs, since it's supposed to be the shared infrastructure place. Or maybe we need to bump that other epic up a bit so that we've got a shared execution place that every repo uses that we can put the retry stuff in.

ChadNedzlek · 2020-10-23T20:02:53Z

Found it. It was epic #5132. But that lost the "centralize the testing infrastructure" part of it in the title, and I'm not sure if it's focused on that or if we need another epic? Or to do that here?

missymessa · 2023-03-13T22:38:47Z

WE'RE DONE!! Closing!!

markwilkie added the Epic label Aug 14, 2020

markwilkie self-assigned this Aug 14, 2020

markwilkie added the dev-workflow Provides some benefit to dev workflow label Aug 14, 2020

rokonec mentioned this issue Sep 8, 2020

Research 'flaky test handling' process #6132

Closed

5 tasks

ChadNedzlek changed the title ~~Improve Dev WF by increasing build and testing resiliency~~ Improve Dev WF by increasing testing resiliency Oct 9, 2020

ChadNedzlek mentioned this issue Oct 26, 2023

Improve Dev WF by increasing build resiliency dotnet/dnceng#1242

Closed

1 task

ilyas1974 mentioned this issue Nov 18, 2020

Fail AzDO Pipeline if number of passed tests fall below a certain threashold. dotnet/aspnetcore#24687

Open

alexperovich mentioned this issue Dec 8, 2020

Adding individual test reruns dotnet/sdk#14493

Closed

markwilkie changed the title ~~Improve Dev WF by increasing testing resiliency~~ Improve Dev WF by making PR results actionable Jan 6, 2021

dotnet-bot mentioned this issue Apr 7, 2022

Don’t Merge on Red – failures are meaningful and actionable for devs #8828

Open

5 tasks

missymessa mentioned this issue Oct 26, 2023

Make Build Result Analysis Even Better! dotnet/dnceng#1211

Open

missymessa mentioned this issue Jun 2, 2022

Helix WorkItems should be able to know about previous test retries (via the Helix Test Retry functionality) #9539

Closed

mmitche added the area-eng-services label Sep 27, 2022

missymessa closed this as completed Mar 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Dev WF by making PR results actionable #5963

Improve Dev WF by making PR results actionable #5963

markwilkie commented Aug 14, 2020 •

edited by missymessa

Loading

tkapin commented Aug 28, 2020

ChadNedzlek commented Oct 9, 2020

markwilkie commented Oct 21, 2020

ChadNedzlek commented Oct 21, 2020

markwilkie commented Oct 22, 2020

sunandabalu commented Oct 22, 2020

sunandabalu commented Oct 22, 2020

ChadNedzlek commented Oct 23, 2020

ChadNedzlek commented Oct 23, 2020

missymessa commented Mar 13, 2023

Improve Dev WF by making PR results actionable #5963

Improve Dev WF by making PR results actionable #5963

Comments

markwilkie commented Aug 14, 2020 • edited by missymessa Loading

Motivation and Business Impact

Business Objectives

Functional Deliverables

Deliverables for Engineering

Metrics for Success

User Feedback

Usage

One-Pagers:

Notes

Milestones

Achievements

Milestone 1

Recently Triaged Issues

tkapin commented Aug 28, 2020

ChadNedzlek commented Oct 9, 2020

markwilkie commented Oct 21, 2020

ChadNedzlek commented Oct 21, 2020

markwilkie commented Oct 22, 2020

sunandabalu commented Oct 22, 2020

sunandabalu commented Oct 22, 2020

ChadNedzlek commented Oct 23, 2020

ChadNedzlek commented Oct 23, 2020

missymessa commented Mar 13, 2023

markwilkie commented Aug 14, 2020 •

edited by missymessa

Loading