Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Dev WF by making PR results actionable #5963

Closed
Tracked by #8828
markwilkie opened this issue Aug 14, 2020 · 10 comments
Closed
Tracked by #8828

Improve Dev WF by making PR results actionable #5963

markwilkie opened this issue Aug 14, 2020 · 10 comments
Assignees
Labels
area-eng-services dev-workflow Provides some benefit to dev workflow Epic

Comments

@markwilkie
Copy link
Member

markwilkie commented Aug 14, 2020

Motivation and Business Impact

The north star of this epic is to improve the developer workflow by focusing on making the PR results accurate and actionable. 'Red' should mean that the dev can (and should) fix something, and 'Green' should mean that the change is good.

Today, it can be tough to figure out what the actual root problem is, compounded by periodic infrastructure outages and/or transient issues that are outside of the devs control. When "bad actor" tests are added, it becomes clear why we as devs are frustrated with seemingly "never" getting a green PR.

Jared Parsons wrote a great doc that much of this thinking came from: Resiliency in our infrastructure.docx

To follow the conversations, check out our V-Team's Teams Channel which is where you should be able to find most of the context.

Business Objectives

  • The dev knows what action to take regarding the failures in their PR.
  • PR results are resilient to transient issues
  • Clear visibility to which failures need attention at more of a macro level (allows leads to reason about where investments need to be made)
  • Known Issues/outages do not result in unactionable PR results
  • Improve MSPoll values around engineering, specifically engr107 and engr108 (survey values no longer available)

Functional Deliverables

  • Configurable retries for tests to allow passing PR’s even with flaky tests
  • Configurable retries for builds to allow passing PR’s even with intermittent infrastructure issues
  • Failing PRs indicate what is failing to the user
  • Every build includes a best effort highlighting of most relevant failures in a single place, with deep links
  • Customers have a way to query on test results to determine which tests are behaving abnormally (e.g. failing more often than usual
  • Ongoing outages or known issues that affect the PR are made known to the dev (e.g. including a link to the outage/known issue)
  • Stretch goal: When a persistent outage is resolved, PRs are updated

Deliverables for Engineering

  • Ensure monitoring and alerting coverage on all services created for this epic and address any outstanding alerts from monitoring before closing the epic.
  • Comprehensive documentation (including architectural diagrams where necessary) for engineering hand-off.

Metrics for Success

Where applicable, metrics should be sliceable by repo.

User Feedback

  • Customer feedback from the sentiment tracker is tracked via a GitHub issues and the customer agrees with the next steps.
  • Stretch Goal: NPS-style question on Customer Survey (best effort to get feedback from customers on this) and/or Sentiment Feedback widget to gauge customer satisfaction with the feature. A goal of 7 or higher would be favorable.
    • Results from the March 2023 .NET Engineering Services Satisfaction Survey
      • Build Analysis (Unique build/test failures): 8
      • Automatic Build Retries: 9.25
      • Automatic Test Retries: 8
      • Known Issues - 8.33

Usage

  • Less than 10% of Pull Requests are "merged on red"
    • Link to data, see "Merge on Red Trend"
    • Between Jan 2023 and March 2023: max of 28%, min of 16%
  • Build Result Analysis should be viewed at least 60% of the time on failing Pull Requests
    • Link to data
    • Between Nov 2022 and March 2023: max of 62%, min of 27%

One-Pagers:


Notes

Networking notes on keeping CI failures low


Milestones

  • Infra
    • Set up test repos and AzDO instances that can plug into our webhooks
    • Finish POC, make sure we have something working
    • Determine success metrics.
    • CLI tools for testing/running things locally
  • Retry Tests
    • Setup: Migrate Helix scripts out of Arcade and into Helix. Deliverable: Customer functionality shouldn't change. Work was not necessary
    • MVP: Re-run work item if error is detected that was configured in customer's configuration file (JSON). DataMigrationProcessor will need to understand that a retry occurred. (Reporting will rely on this).
    • V2: none at this moment
  • Retry Build
    • Setup: Need to verify that the existing infra can support this functionality.
    • MVP: Re-run build if error is detected that was configured in the customer's configuration file (JSON). Reporting will also need to know about this. Need to let customer know that we are doing this.
    • V2: none at this moment
  • Failure Guessing
    • MVP: Wall-of-text of what AzDO spits out at us with links to the logs/console log page; maybe a link to the mock-up?
    • V2: Implement mock-ups to be real data that is published to the Build Result Analysis Summary checks page.
    • V3: Iterate based on customer feedback (direct feedback; sentiment tracking data)
    • V4??: "Notifications"?? Comments on PR? Teams notifications?
  • Outages/Known Issues
    • November 2021 EOM: Include widespread issues in build analysis tab
    • January 2022 EOM: Known issues matching PoC
    • February 2022: Build analysis identifies build failures affected by known issues
    • March 2022: Reporting shows impact of a known issue
    • March 2022: Contributors can submit new issues using Build Analysis
  • Quarantine Tests
    • MVP: To be determined
  • Test Reporting
    • October 2021 EOM: Team approval on PR to turn POC into a Service Fabric service.
    • November 2021 EOM: Test Reporting service stood up in staging.
      • Service in SF (with multiple instances) to ingest the shunted passing data into the AzureDevOpsTestsSummary table in Kusto (should run hourly)
      • Service to do calculations on the aggregated data in AzureDevOpsTestsSummary and AzureDevOpsTests tables (should run daily)
      • Create new service account user for AzDO for Dev WF
    • December 2021 EOM: Test Reporting service stood up in production.
      • Must match set up in Staging
      • Ensure any new users or secrets are usable in Production
    • January 2022 EOM: Queries available for customers to use
      • Chi-squared service calculating reliable data
      • Queries in Kusto for reports available for customers to us
    • June 2022 EOM: Test Results Clean-up
      • Fix Chi-squared service (and remove no truncation band-aid)
      • Decommission TestResults table
  • UX Improvements and customer-facing work (mid-July 2022)
    • Address customer feedback
    • Documentation for customers
    • Evangelism of product to .NET org
    • Create GitHub app/bot specific for Build Analysis
    • UX user study with product devs
  • Reporting for Repo Owners Replaced by Known Issues epic
    • MVP: Be able to report on flaky tests; quarantined tests; any test that didn't pass for any reason. Likely a dashboard in Power BI.
    • V2: Make the results more actionable for the repo owners. (e.g. click a button and it opens an issue to investigate X test).
    • V3: none at this moment

Achievements

Milestone 1

  • Build Result Analysis Summary GitHub check available on Pull Requests (currently available in Arcade Services)

image

  • Immediately viewable reports on tests and build failures, with deep links to Azure DevOps console logs and test results.
  • Sentiment tracking for feedback from users. Feedback goes directly to dnceng, and we'll review it and use it to help drive design and prioritization.

image

Recently Triaged Issues

All issues in this section should be triaged by the v-team into one of their business objectives or features.

@markwilkie markwilkie self-assigned this Aug 14, 2020
@markwilkie markwilkie added the dev-workflow Provides some benefit to dev workflow label Aug 14, 2020
@tkapin
Copy link
Member

tkapin commented Aug 28, 2020

@rokonec - please work with @ChadNedzlek to refine the Phase 1 business priorities into more detailed requirements so we can came up with a design proposal.

@ChadNedzlek
Copy link
Member

Changed title to give "build" it's own epic. Tests and builds are very different beasts when it comes to tackling them, because the quantity and quality of data is vastly different.

@ChadNedzlek ChadNedzlek changed the title Improve Dev WF by increasing build and testing resiliency Improve Dev WF by increasing testing resiliency Oct 9, 2020
@markwilkie
Copy link
Member Author

After our v-team discussion yesterday where it proved difficult to make forward progress, I'll give a hand to see if the discussion can be "jump started". What I'll present should be thought of as a "starter fluid", not the actual "fuel". Meaning.....the right approach is probably still "out there" - at least to a degree.

Perhaps this proposed approach can be applied the the data we have to see how it does (or doesn't) bring value.

With that said, here are my "starter" thoughts:

Action:
(1) Retry test on the same machine
Criteria:

  • Test failed
  • The last test run was successful

Action:
(2) Retry test on the different machine
Criteria:

  • Test failed
  • The last test run failed when run on same machine (1)
  • The test before last was successful

Action:
(3) Fail test
Criteria:

  • Test failed

  • One of the following is true:

    • Actions (1) and (2) were failures
    • Tests have not been marked as chronically failing or as flaky (as those are dealt with separately)
    • Last (n) tests were failures
    • Specific test annotations (e.g. threshold, etc) result in marking the test as a failure

    Action:
    (4) Mark test as chronically failing, but not necessarily flaky
    Criteria:

  • Test failed

  • Test has failed for the last (n) times

    Action:
    (5) Mark test as flaky
    Criteria:

  • Test failed

  • The pattern shows the last (n) runs had (y) failures (demonstrating a pattern of pass/fails)

    Action:
    (6) Skip test (don't run it at all)
    Criteria:

  • Tests are marked as chronically failing (4) or as flaky (5)

@ChadNedzlek
Copy link
Member

There are some interesting investigations we can take that can help distill this starter fluid. :-). Like what does it look like to run a test on the same machine... a different machine... how can we report that in a useful way to Azure Pipelines/something else so that we have the data we need to "do the right thing".

I have a feeling that maybe for non-PR builds, we run enough of them that maybe just always looking at the last N builds might be statistically good enough without having to retry a test within a single build (sort of treating the N+1 as, more or less, an implicit retry of the N build). There were 6000 executions of every test in the last two weeks for runtime, for example... that's a ton of data that a 6001st test probably won't add much to. :-) That might save resources/complication, and help with reporting, since Azure Pipelines has some pretty good reporting around the analytics for a branch. When it gets down to the wire of shipping, maybe that means we need to investigate 2-3 random failures for the build we want to ship... but we'd probably want to do that anyway, even if we deemed them "flaky and irrelevant".

@markwilkie
Copy link
Member Author

I like the thinking Chad. :) Let's chat again and see what the right next steps are.

Cheers

@sunandabalu
Copy link
Member

I have a feeling that maybe for non-PR builds, we run enough of them that maybe just always looking at the last N builds might be statistically good enough without having to retry a test within a single build (sort of treating the N+1 as, more or less, an implicit retry of the N build). There were 6000 executions of every test in the last two weeks for runtime, for example... that's a ton of data that a 6001st test probably won't add much to. :-) That might save resources/complication, and help with reporting, since Azure Pipelines has some pretty good reporting around the analytics for a branch. When it gets down to the wire of shipping, maybe that means we need to investigate 2-3 random failures for the build we want to ship... but we'd probably want to do that anyway, even if we deemed them "flaky and irrelevant".

Makes sense for non-PR builds, to compare against previous builds in the same branch. For PR builds, the mechanism to identify whether a test is chronically failing or is it flaky would require some sort of retry and bubbling up that info for reporting. We need to ensure the reporting structure can accomodate both, or way to demarcate the PR vs Non-PR if that is more convenient.

@sunandabalu
Copy link
Member

Action:
(6) Skip test (don't run it at all)
Criteria:

  • Tests are marked as chronically failing (4) or as flaky (5)

For tests that are consistently failing need to be looked at and either turn the test off or mark it as flaky so we can skip it. Just blindly skipping a chronically failing test would mean sweeping it under the rug :)

@ChadNedzlek
Copy link
Member

I'm not sure where the epic about getting everyone on a shared testing infrastructure... but any "retry" logic we write won't help anyone until that epic is complete. Right now, everyone has implemented their own test execution framework, so any work would have to be hand written into every single repository, which isn't a good use of our time (since it would all get deleted when that other epic started up anyway).

We can make it work in arcade, assuming that will be the template for other teams test runs, since it's supposed to be the shared infrastructure place. Or maybe we need to bump that other epic up a bit so that we've got a shared execution place that every repo uses that we can put the retry stuff in.

@ChadNedzlek
Copy link
Member

Found it. It was epic #5132. But that lost the "centralize the testing infrastructure" part of it in the title, and I'm not sure if it's focused on that or if we need another epic? Or to do that here?

@missymessa
Copy link
Member

WE'RE DONE!! Closing!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-eng-services dev-workflow Provides some benefit to dev workflow Epic
Projects
None yet
Development

No branches or pull requests

6 participants