Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TestReconcileMachinePhases is flaky #8477

Closed
killianmuldoon opened this issue Apr 5, 2023 · 16 comments · Fixed by #8484 or #8576
Closed

TestReconcileMachinePhases is flaky #8477

killianmuldoon opened this issue Apr 5, 2023 · 16 comments · Fixed by #8484 or #8576
Assignees
Labels
area/testing Issues or PRs related to testing kind/flake Categorizes issue or PR as related to a flaky test. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@killianmuldoon
Copy link
Contributor

Which jobs are flaking?

TestReconcileMachinePhases/Should_set_Provisioned_when_there_is_a_ProviderID_and_there_is_no_Node

Which tests are flaking?

Part of the unit test run. Flaking on main and 1.4.

Since when has it been flaking?

#8044 moved these tests to be based on envtest instead of the fake client. Since then there has been a number of flakes.

Testgrid link

https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-test-release-1-4/1643034564504850432

Reason for failure (if possible)

No response

Anything else we need to know?

No response

Label(s) to be applied

/kind flake
/area testing

@k8s-ci-robot k8s-ci-robot added kind/flake Categorizes issue or PR as related to a flaky test. area/testing Issues or PRs related to testing needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 5, 2023
@killianmuldoon killianmuldoon changed the title TestReconcileMachinePhases/Should_set_Provisioned_when_there_is_a_ProviderID_and_there_is_no_Node TestReconcileMachinePhases is flaky Apr 5, 2023
@killianmuldoon
Copy link
Contributor Author

/triage accepted

@chrischdi - did you want to take this one on?

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 5, 2023
@chrischdi
Copy link
Member

chrischdi commented Apr 5, 2023

/assign

I'm digging into it.

I think sometimes it could be the case that if:

  • not everything got yet created by the test
  • and the controller already reconciles before that
  • that it will never reach the target tested state

It's for me locally reproducible via following patch

diff --git a/internal/controllers/machine/machine_controller_phases_test.go b/internal/controllers/machine/machine_controller_phases_test.go
index fbfb18bb9..bbe2c33e4 100644
--- a/internal/controllers/machine/machine_controller_phases_test.go
+++ b/internal/controllers/machine/machine_controller_phases_test.go
@@ -507,7 +507,7 @@ func TestReconcileMachinePhases(t *testing.T) {
                g.Expect(env.Create(ctx, bootstrapConfig)).To(Succeed())
                g.Expect(env.Create(ctx, infraMachine)).To(Succeed())
                g.Expect(env.Create(ctx, machine)).To(Succeed())
-
+               time.Sleep(time.Second)
                modifiedMachine := machine.DeepCopy()
                // Set NodeRef to nil.
                machine.Status.NodeRef = nil

Which then results in reconciliation before the statuses got set.

Different ideas to remediate / fix:

  • set the paused annotation at the Machine and remove it at the very end after all is set.
  • debug into the controller why in the "previously reconciled" state, it does not get to the expected result
  • change the creation / status patch order to create + patch the Machine object as last (after the dependent objects got created + status patched)

@chrischdi
Copy link
Member

chrischdi commented Apr 5, 2023

To add:

With this patch, the test does not even finish after 30s (current timeout is 10s), so it may also be a bigger bug.

@killianmuldoon
Copy link
Contributor Author

I don't like the idea of adding pause - it seems to be papering over cracks here 😄. I tried

change the creation / status patch order to create + patch the Machine object as last (after the dependent objects got created + status patched)

And I'm still getting a failure - does it work for you?

@chrischdi
Copy link
Member

I think so yes, but not yet at 100%.

@chrischdi
Copy link
Member

(but only with set the paused annotation at the Machine and remove it at the very end after all is set. as change AFAIK)

@killianmuldoon
Copy link
Contributor Author

It looks like the logic for setting LastUpdated is flaky in the controller.

@killianmuldoon
Copy link
Contributor Author

@chrischdi Just took another look - I think the issue here is in the test - there shouldn't be a patch for the machine at all with the lastUpdated field. i.e.

		modifiedMachine.Status.LastUpdated = &lastUpdated
		g.Expect(env.Status().Patch(ctx, modifiedMachine, client.MergeFrom(machine))).To(Succeed())

should be removed IMO.

@chrischdi
Copy link
Member

chrischdi commented Apr 5, 2023

Yes maybe, I think there are two cases for this field

  • nil -> setting it
  • previous value -> newer value

I don't know which of both (or both) are possible in the runtime scenario 🤔

Edit:

  • the first one is maybe a case for clusterctl move
  • the second one may be the case for provisioning to provisioned or similar.

@killianmuldoon
Copy link
Contributor Author

Does clusterctl move set this field? I can't see where that happens

@chrischdi
Copy link
Member

But to recap, the issue of the test could be the following scenario:

  1. Machine object gets created

g.Expect(env.Create(ctx, machine)).To(Succeed())

  1. Meanwhile the object gets reconciled

  2. We patch the .status.lastUpdated field after it was reconciled

lastUpdated := metav1.NewTime(time.Now().Add(-10 * time.Second))
machine.Status.LastUpdated = &lastUpdated
g.Expect(env.Status().Patch(ctx, modifiedMachine, client.MergeFrom(machine))).To(Succeed())

@killianmuldoon
Copy link
Contributor Author

We patch the .status.lastUpdated field after it was reconciled

And I don't think this patch is necessary or really useful - we just need to check that the field is set and that the lastupdated field in the machine is after the timestamp before the reconcile, no need to have it be updated on the machine IMO.

@chrischdi
Copy link
Member

I agree for this test case.

But in general for these envtest tests, if we rely on some tests that the status is already in some state:

  • we could set the Paused annotation at the object, and remove it once the test is set up and we wait for reconciliation

@chrischdi
Copy link
Member

chrischdi commented Apr 11, 2023

/reopen

I checked the dashboard again:

There's one more occurence happening at the test (since we merged the fix)

Should_set_`Running`_when_bootstrap_and_infra_is_ready

/ https://github.com/kubernetes-sigs/cluster-api/blob/main/internal/controllers/machine/machine_controller_phases_test.go#L243 , which we should fix too.

@k8s-ci-robot k8s-ci-robot reopened this Apr 11, 2023
@k8s-ci-robot
Copy link
Contributor

@chrischdi: Reopened this issue.

In response to this:

/reopen

I checked the dashboard again:

There's one more occurence happening at the test Should_set_Running_when_bootstrap_and_infra_is_ready / https://github.com/kubernetes-sigs/cluster-api/blob/main/internal/controllers/machine/machine_controller_phases_test.go#L243 , which we should fix too.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@chrischdi
Copy link
Member

Triage link: 655deea70b0ff4b80d8b

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/testing Issues or PRs related to testing kind/flake Categorizes issue or PR as related to a flaky test. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
3 participants