Fix flaky Affinity Assistant test #6925

QuanZhang-William · 2023-07-12T20:16:27Z

Changes

Prior to this commit, the TestAffinityAssistant_PerWorkspace integration test validates the lifecycle status of the Affinity Assistant StatefulSet of the PipelineRun when it is created and completed. However, the StatefulSet cannot be created (and deleted) immediately after the PipelineRun is created (and completed) due to API latency, which makes the test flaky (see example).

This commit removes StatefulSet status check in the integration test. This functionality is covered in the unit test.

/kind flake

Submitter Checklist

As the author of this PR, please check off the items in this checklist:

Has Docs if any changes are user facing, including updates to minimum requirements e.g. Kubernetes version bumps
Has Tests included if any functionality added or changed
Follows the commit message standard
Meets the Tekton contributor standards (including functionality, content, code)
Has a kind label. You can add one by adding a comment on this PR that contains /kind <type>. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tep
Release notes block below has been updated with any user facing changes (API changes, bug fixes, changes requiring upgrade notices or deprecation warnings). See some examples of good release notes.
Release notes contains the string "action required" if the change requires additional action from users switching to the new release

Release Notes

NONE

QuanZhang-William · 2023-07-12T20:22:18Z

/kind flake

test/affinity_assistant_test.go

tekton-robot · 2023-07-12T20:23:09Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lbernick

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [lbernick]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Prior to this commit, the `TestAffinityAssistant_PerWorkspace` integration test validates the lifecycle status of the Affinity Assistant `StatefulSet` of the `pipeleinrun` when it is created and completed. However, the `StatefulSet` cannot be created (and deleted) immediately after the `pipelinerun` is created (and completed) due to API latency, which makes the test flacky (see [example](https://prow.tekton.dev/view/gs/tekton-prow/pr-logs/pull/tektoncd_pipeline/6921/pull-tekton-pipeline-integration-tests/1679203026977427456)). This commit removes statefulset status check in the integration test. This functionality is covered in the [unit test](https://github.com/tektoncd/pipeline/blob/b769b5620300d7fb6d6638083124d03636caa503/pkg/reconciler/pipelinerun/affinity_assistant_test.go#L188). /kind bug

QuanZhang-William · 2023-07-12T20:32:14Z

/remove-kind bug

afrittoli · 2023-07-12T21:47:13Z

@QuanZhang-William thanks for this PR!
Would you mind filling the PR template?
The changes in the PR should go under the "Changes" section, and the submitter check-list should be ticket as appropriate. Thank you!

QuanZhang-William · 2023-07-13T00:01:45Z

@QuanZhang-William thanks for this PR! Would you mind filling the PR template? The changes in the PR should go under the "Changes" section, and the submitter check-list should be ticket as appropriate. Thank you!

Ahh, sorry @afrittoli, kind of sending this PR in rush as the CI could be blocked by this test... I have updated the PR message and PTAL

afrittoli · 2023-07-13T14:18:12Z

Thanks for updating the PR.

Changes

Prior to this commit, the TestAffinityAssistant_PerWorkspace integration test validates the lifecycle status of the Affinity Assistant StatefulSet of the PipelineRun when it is created and completed. However, the StatefulSet cannot be created (and deleted) immediately after the PipelineRun is created (and completed) due to API latency, which makes the test flaky (see example).

This commit removes StatefulSet status check in the integration test.

It removes more than that, it also removes the check that the pipeline terminates successfully.
It seems to me that what is left in the test doesn't verify anything associated with the affinity assistant.

This functionality is covered in the unit test.

The unit test covers the functions createOrUpdateAffinityAssistantsAndPVCs and cleanupAffinityAssistants, but it does not verify that those functions are invoked when a pipeline run is started and deleted, so it does not provide coverage for the removed section of the test.

E2E tests that create resources are always at risk that those resources may take longer to be created and deleted, depending on the status of the cluster and node where the test is running. I think the solution for that is to add a bit of tolerance in the test, in case of a slow node. That does not guarantee that the test will always pass 100% but on a bad or very slow node the tests would most likely fail or timeout anyways.

QuanZhang-William · 2023-07-13T14:35:50Z

Thanks for updating the PR.

Changes

Prior to this commit, the TestAffinityAssistant_PerWorkspace integration test validates the lifecycle status of the Affinity Assistant StatefulSet of the PipelineRun when it is created and completed. However, the StatefulSet cannot be created (and deleted) immediately after the PipelineRun is created (and completed) due to API latency, which makes the test flaky (see example).

This commit removes StatefulSet status check in the integration test.

It removes more than that, it also removes the check that the pipeline terminates successfully. It seems to me that what is left in the test doesn't verify anything associated with the affinity assistant.

This E2E test verifies 2 things: 1) pods sharing the same PVC are scheduled to the same node - so it validates the final result of Affinity Assistant; 2) the state of PVC is in bounded status when the pr is completed (We missed this check so the PVCs are changed to terminating status in #6741). And 2) is the original motivation to add this E2E test.

These 2 points cannot be covered by UT

This functionality is covered in the unit test.

The unit test covers the functions createOrUpdateAffinityAssistantsAndPVCs and cleanupAffinityAssistants, but it does not verify that those functions are invoked when a pipeline run is started and deleted, so it does not provide coverage for the removed section of the test.

I think the end to end usage of Affinity Assistant is covered in the pipelinerun test?
https://github.com/tektoncd/pipeline/blob/b769b5620300d7fb6d6638083124d03636caa503/pkg/reconciler/pipelinerun/pipelinerun_test.go#L3968C1-L3968C67

E2E tests that create resources are always at risk that those resources may take longer to be created and deleted, depending on the status of the cluster and node where the test is running. I think the solution for that is to add a bit of tolerance in the test, in case of a slow node. That does not guarantee that the test will always pass 100% but on a bad or very slow node the tests would most likely fail or timeout anyways.

I'm open to add some tolerance to the test if we are doing it today, but I feel the lifecycle status of the StatefulSet is already covered by UT. This E2E tests just add more coverage on top it.

afrittoli · 2023-07-14T13:52:48Z

This E2E test verifies 2 things: 1) pods sharing the same PVC are scheduled to the same node - so it validates the final result of Affinity Assistant; 2) the state of PVC is in bounded status when the pr is completed (We missed this check so the PVCs are changed to terminating status in #6741). And 2) is the original motivation to add this E2E test.

Oh, right, I missed the last bit of the test.
The fact that Pods are scheduled on the same node during one run does not necessarily prove that the affinity assistant is working, but over multiple runs it probably does, so it's a good check to have.
The fact that the PVC is Bound when the PR is completed is also a good check to have, thanks.

I'm open to add some tolerance to the test if we are doing it today, but I feel the lifecycle status of the StatefulSet is already covered by UT. This E2E tests just add more coverage on top it.

I think that the only missing part is the cleanup, which is not covered elsewhere afaik.
The cleanup function itself is tested, but I don't think we test elsewhere that when the pipelinerun is done, that cleanup function is invoked to perform the cleanup. That specific bit could be covered by unit test like TestReconcileWithAffinityAssistantStatefulSet but starting from a "Done" pipeline run, and running Reconcile

afrittoli · 2023-07-14T14:55:01Z

This E2E test verifies 2 things: 1) pods sharing the same PVC are scheduled to the same node - so it validates the final result of Affinity Assistant; 2) the state of PVC is in bounded status when the pr is completed (We missed this check so the PVCs are changed to terminating status in #6741). And 2) is the original motivation to add this E2E test.

Oh, right, I missed the last bit of the test. The fact that Pods are scheduled on the same node during one run does not necessarily prove that the affinity assistant is working, but over multiple runs it probably does, so it's a good check to have. The fact that the PVC is Bound when the PR is completed is also a good check to have, thanks.

I'm open to add some tolerance to the test if we are doing it today, but I feel the lifecycle status of the StatefulSet is already covered by UT. This E2E tests just add more coverage on top it.

I think that the only missing part is the cleanup, which is not covered elsewhere afaik. The cleanup function itself is tested, but I don't think we test elsewhere that when the pipelinerun is done, that cleanup function is invoked to perform the cleanup. That specific bit could be covered by unit test like TestReconcileWithAffinityAssistantStatefulSet but starting from a "Done" pipeline run, and running Reconcile

It'd be ok to add the new unit test in a separate PR

/lgtm

QuanZhang-William · 2023-07-14T14:57:24Z

Synced with @afrittoli offline. We will merge the PR as it is to unblock CI, I will prioritize to add the discussed coverage by UT in a separate PR.

QuanZhang-William · 2023-07-20T19:07:50Z

This E2E test verifies 2 things: 1) pods sharing the same PVC are scheduled to the same node - so it validates the final result of Affinity Assistant; 2) the state of PVC is in bounded status when the pr is completed (We missed this check so the PVCs are changed to terminating status in #6741). And 2) is the original motivation to add this E2E test.

Oh, right, I missed the last bit of the test. The fact that Pods are scheduled on the same node during one run does not necessarily prove that the affinity assistant is working, but over multiple runs it probably does, so it's a good check to have. The fact that the PVC is Bound when the PR is completed is also a good check to have, thanks.

I'm open to add some tolerance to the test if we are doing it today, but I feel the lifecycle status of the StatefulSet is already covered by UT. This E2E tests just add more coverage on top it.

I think that the only missing part is the cleanup, which is not covered elsewhere afaik. The cleanup function itself is tested, but I don't think we test elsewhere that when the pipelinerun is done, that cleanup function is invoked to perform the cleanup. That specific bit could be covered by unit test like TestReconcileWithAffinityAssistantStatefulSet but starting from a "Done" pipeline run, and running Reconcile

When exploring the test I found that we do have UT TestReconcileOnCompletedPipelineRun which tests the cleanup function is called at pipelinerun completion time!

tekton-robot added release-note-none Denotes a PR that doesnt merit a release note. kind/bug Categorizes issue or PR as related to a bug. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jul 12, 2023

tekton-robot requested review from dibyom and lbernick July 12, 2023 20:16

tekton-robot added the kind/flake Categorizes issue or PR as related to a flakey test label Jul 12, 2023

lbernick approved these changes Jul 12, 2023

View reviewed changes

test/affinity_assistant_test.go Show resolved Hide resolved

tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 12, 2023

QuanZhang-William force-pushed the affinity-assistant-flaky-test branch from c4e3b9f to 8d25a18 Compare July 12, 2023 20:25

tekton-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jul 12, 2023

tekton-robot removed the kind/bug Categorizes issue or PR as related to a bug. label Jul 12, 2023

tekton-robot assigned afrittoli Jul 14, 2023

tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 14, 2023

tekton-robot merged commit 9c249d6 into tektoncd:main Jul 14, 2023

QuanZhang-William mentioned this pull request Jul 25, 2023

TEP-0135: Coscheduling PipelineRun pods Implementation #6740

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix flaky Affinity Assistant test #6925

Fix flaky Affinity Assistant test #6925

QuanZhang-William commented Jul 12, 2023 •

edited

Loading

QuanZhang-William commented Jul 12, 2023

tekton-robot commented Jul 12, 2023

QuanZhang-William commented Jul 12, 2023

afrittoli commented Jul 12, 2023

QuanZhang-William commented Jul 13, 2023

afrittoli commented Jul 13, 2023

Changes

QuanZhang-William commented Jul 13, 2023

Changes

afrittoli commented Jul 14, 2023

afrittoli commented Jul 14, 2023

QuanZhang-William commented Jul 14, 2023

QuanZhang-William commented Jul 20, 2023

Fix flaky Affinity Assistant test #6925

Fix flaky Affinity Assistant test #6925

Conversation

QuanZhang-William commented Jul 12, 2023 • edited Loading

Changes

Submitter Checklist

Release Notes

QuanZhang-William commented Jul 12, 2023

tekton-robot commented Jul 12, 2023

QuanZhang-William commented Jul 12, 2023

afrittoli commented Jul 12, 2023

QuanZhang-William commented Jul 13, 2023

afrittoli commented Jul 13, 2023

Changes

QuanZhang-William commented Jul 13, 2023

Changes

afrittoli commented Jul 14, 2023

afrittoli commented Jul 14, 2023

QuanZhang-William commented Jul 14, 2023

QuanZhang-William commented Jul 20, 2023

QuanZhang-William commented Jul 12, 2023 •

edited

Loading