Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update flaky test script to print more details and detect flaky exceptions and timeouts #14731

Merged
merged 8 commits into from
Jan 27, 2021

Conversation

rmiller14
Copy link
Contributor

Commit Message:
Update the flaky test script to print more details and detect flaky, unexpected test errors like exceptions and timeouts, with the goal of making the notifications more actionable.

Signed-off-by: Randy Miller [email protected]

Additional Description:
The flaky test script, ci/flaky/process_xml.py, is executed on every CI run, delivering a notification to the Slack channel "test-flaky" if there are any flaky test failures. Those notifications aren't as useful as they could be though, for a number of reasons:

  1. Direct links to the CI run, the related commit, and the related PR are not included.
  2. There's no indication of which stage or job experienced the flake.
  3. The notifications are not uniformly formatted, so they can be a bit hard to read.
  4. Some notifications do not include any information about the flake(s).

The goal of this PR is to make these flaky test notifications more actionable by addressing the 4 bullets above. Below is what a notification would look like should this PR get merged. The last 2 flakes are not captured at all today by the current state of the script, as those flakes are unexpected test "errors" (eg, exceptions or timeouts) rather than test "failures" (eg, test assert failed).

Target:         bazel.release
Stage:          Windows release
Pull request:   https://github.com/envoyproxy/envoy/pull/14665
Commmit:        https://github.com/envoyproxy/envoy/commit/f1184f2d74d052942f7484beecf98d7cfde137e0
CI results:     https://dev.azure.com/cncf/envoy/_build/results?buildId=63454

Origin:         https://github.com/rmiller14/envoy
Upstream:       https://github.com/envoyproxy/envoy
Latest ref:     heads/flaky_test_script

Last commit:
        commit f1184f2d74d052942f7484beecf98d7cfde137e0
        Author: Randy Miller <[email protected]>
        Date:   Fri Jan 15 00:58:51 2021 -0800

            Update flaky test script to print more actionable details as well as detect flaky, unexpected test errors like exceptions and timeouts.

            Signed-off-by: Randy Miller <[email protected]>

---------------------------------------------------------------------------------------------------

Test flake details:
- Test suite:   IpVersions/DnsImplTest
- Test case:    LocalLookup/IPv4
- Log path:     C:/_eb/_bazel_LocalAdmin/sonr4fdz/external/envoy/bazel-testlogs/test/common/network/dns_impl_test/test_attempts/attempt_1.log
- Details:
        test/common/network/dns_impl_test.cc:609
        Expected equality of these values:
          nullptr
            Which is: NULL
          resolveWithExpectations("localhost", DnsLookupFamily::V4Only, DnsResolver::ResolutionStatus::Success, {"127.0.0.1"}, {"::1"}, absl::nullopt)
            Which is: 0000017500F82190
        Stack trace:
          00007FF69688B586: (unknown)
          00007FF6968A4468: (unknown)
          00007FF6968A462D: (unknown)
          00007FF6968A508D: (unknown)
        ... Google Test internal frames ...

---------------------------------------------------------------------------------------------------

Test flake details:
- Test suite:   ThriftConnManagerIntegrationTest
- Test case:    IDLException/HeaderCompact
- Log path:     C:/_eb/_bazel_LocalAdmin/sonr4fdz/external/envoy/bazel-testlogs/test/extensions/filters/network/thrift_proxy/integration_test/shard_1_of_4/test_attempts/attempt_1.log
- Error:        Exited with error code 3 (No such process)
- Relevant snippet:
        Traceback (most recent call last):
          File "\\?\C:\Windows\TEMP\Bazel.runfiles_3v_lc0rq\runfiles\envoy\test\extensions\filters\network\thrift_proxy\driver\server.py", line 232, in <module>
            main(cfg)
          File "\\?\C:\Windows\TEMP\Bazel.runfiles_3v_lc0rq\runfiles\envoy\test\extensions\filters\network\thrift_proxy\driver\server.py", line 175, in main
            server.serve()
          File "\\?\C:\Windows\TEMP\Bazel.runfiles_3v_lc0rq\runfiles\thrift_pip3_pypi__thrift_0_13_0\thrift\server\TServer.py", line 121, in serve
            self.serverTransport.listen()
          File "\\?\C:\Windows\TEMP\Bazel.runfiles_3v_lc0rq\runfiles\thrift_pip3_pypi__thrift_0_13_0\thrift\transport\TSocket.py", line 208, in listen
            self.handle.bind(res[4])
        OSError: [WinError 10013] An attempt was made to access a socket in a way forbidden by its access permissions
        Could not connect to any of [('127.0.0.1', 50670)]
        Unhandled Thrift Exception: Could not connect to any of [('127.0.0.1', 50670)]
        C:/envoy/test/extensions/filters/network/thrift_proxy/driver/generate_fixture.sh: line 1: kill: (1819) - No such process
        Failed bash -c "PYTHONPATH=$(dirname C:/envoy/test/extensions/filters/network/thrift_proxy/driver/generate_fixture.sh) C:/envoy/test/extensions/filters/network/thrift_proxy/driver/generate_fixture.sh idl-exception header compact -H x-header-1=x-value-1,x-header-2=0.6,x-header-3=150,x-header-4=user_id:10,x-header-5=garbage_asdf -T C:/_eb/execroot/envoy/_tmp/2540819d34883b5a5d1e62d549fbcdeb execute "
        [2021-01-15 11:35:27.958][5624][critical][assert] [test/test_common/environment.cc:414] assert failure: false.

---------------------------------------------------------------------------------------------------

Test flake details:
- Test suite:   TcpProxyIntegrationTest
- Test case:    TestCloseOnHealthFailure/IPv6_OriginalConnPool
- Log path:     C:/_eb/_bazel_LocalAdmin/sonr4fdz/external/envoy/bazel-testlogs/test/integration/tcp_proxy_integration_test/shard_1_of_2/test_attempts/attempt_1.log
- Error:        Exited with error code 142 (Unknown error)
- Note:         This error is likely a timeout (test duration == 300, a well known timeout value).
- Last 1 line(s):
        [ RUN      ] TcpProxyIntegrationTestParams/TcpProxyIntegrationTest.TestCloseOnHealthFailure/IPv6_OriginalConnPool

---------------------------------------------------------------------------------------------------

Risk Level: N/A for code/test, low for the flaky test script due to the amount of churn.

Testing: Ran locally many, many times. For a portion of those runs, I treated normal failures as flakes to get better coverage on the parsing helpers. Not sure how to test the changes to bazel.yml though.

…detect flaky, unexpected test errors like exceptions and timeouts.

Signed-off-by: Randy Miller <[email protected]>
Copy link
Member

@sunjayBhatia sunjayBhatia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Big fan of this change, looks like it gives much more relevant content in the flake report 👍🏽

Might drive by and do some more comments on the implementation details later but as an overview looks good

.azure-pipelines/bazel.yml Outdated Show resolved Hide resolved
Copy link
Member

@lizan lizan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really awesome. Thanks!

A bonus point would be log those with AZP commands: https://github.com/microsoft/azure-pipelines-tasks/blob/master/docs/authoring/commands.md
and set CI status to SucceededWithIssues. That can be in a follow up PR.

.azure-pipelines/bazel.yml Outdated Show resolved Hide resolved
@lizan lizan added the waiting label Jan 19, 2021
@rmiller14
Copy link
Contributor Author

This is really awesome. Thanks!

A bonus point would be log those with AZP commands: https://github.com/microsoft/azure-pipelines-tasks/blob/master/docs/authoring/commands.md
and set CI status to SucceededWithIssues. That can be in a follow up PR.

That sounds cool. I'll do this in a follow-up PR.

@rmiller14
Copy link
Contributor Author

There seems to be a bug in shell_utils.sh, so I'll fix in the next commit.

ci/flaky_test/process_xml.py Outdated Show resolved Hide resolved
ci/flaky_test/process_xml.py Outdated Show resolved Hide resolved
Signed-off-by: Randy Miller <[email protected]>
Signed-off-by: Randy Miller <[email protected]>
Signed-off-by: Randy Miller <[email protected]>
lizan
lizan previously approved these changes Jan 20, 2021
@rmiller14
Copy link
Contributor Author

The script isn't compatible with python 3.6. Updating.

… available (even for Windows)

Signed-off-by: Randy Miller <[email protected]>
@rmiller14
Copy link
Contributor Author

The script isn't compatible with python 3.6. Updating.

Fixed and tested.

@rmiller14
Copy link
Contributor Author

@lizan CI is passing now. There aren't any major changes to the script since you last reviewed.

@rmiller14
Copy link
Contributor Author

Ping.

@lizan lizan merged commit 268b94b into envoyproxy:main Jan 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants