Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

locksmith: set max semaphore to 3 #543

Merged
merged 1 commit into from
Jul 29, 2024
Merged

Conversation

tormath1
Copy link
Contributor

@tormath1 tormath1 commented Jul 29, 2024

With recent upgrade from Go 1.21 to Go 1.22 we noticed that this test has never run as designed.

Previously this test was rebooting only one machine while it's supposed to coordinate the reboot of 3 machines.

It was an issue with the "for" loop:

Previously, the variables declared by a “for” loop were created
once and updated by each iteration. In Go 1.22, each iteration
of the loop creates new variables, to avoid accidental sharing bugs.

Now we set the maximum of token to 3, this is not recommended in production as it might create a downtime but here it's ok for testing purposes.

Race condition should not occur as there is a 5 minutes delay between the command and the actual reboot.

Testing done

Locally on QEMU.

3 semaphore: http://jenkins.infra.kinvolk.io:8080/job/container/job/test/25061/

With recent upgrade from Go 1.21 to Go 1.22 we noticed that this test has
never run as designed.

Previously this test was rebooting only one machine while it's supposed
to coordinate the reboot of 3 machines.

It was an issue with the "for" loop:

> Previously, the variables declared by a “for” loop were created
> once and updated by each iteration. In Go 1.22, each iteration
> of the loop creates new variables, to avoid accidental sharing bugs.

Now we set the maximum of token to 3, this is not recommended in
production as it might create a downtime but here it's ok for testing
purposes.

Race condition should not occur as there is a 5 minutes delay between
the command and the actual reboot.

Signed-off-by: Mathieu Tortuyaux <[email protected]>
@jepio
Copy link
Member

jepio commented Jul 29, 2024

@tormath1 i think the goal was to get the machines to reboot sequentially, otherwise we're not really testing locksmith.

Is the test failing because of the 5 minute delay? Why is there 5 minute delay for the reboot? If it's because the user is logged in on the console, is it possible to use ignition to disable flatcar.autologin so that the test works better?

@tormath1
Copy link
Contributor Author

@tormath1 i think the goal was to get the machines to reboot sequentially, otherwise we're not really testing locksmith.

Is the test failing because of the 5 minute delay? Why is there 5 minute delay for the reboot? If it's because the user is logged in on the console, is it possible to use ignition to disable flatcar.autologin so that the test works better?

Then if the goal is to reboot sequentially, the test will at least take 25 minutes:

  1. First machine reboot (+ 5min)
  2. The two others wait for token by checking with an exponential duration (max 5minutes)
  3. First machine has rebooted
  4. Best case scenario the second machine can acquire a semaphore directly (+ 5min)
  5. Second machine has rebooted
  6. Best case scenario the third machine can acquire a semaphore directly (+ 5min)
  7. Everything has rebooted: 15 minutes with best case scenario - 25 minutes worth case scenario

The reboot delay is here: https://github.com/flatcar/locksmith/blob/5b2275ec726a7f70902d2da0c782bbad405e3ef5/locksmithctl/daemon.go#L169-L174 (even if no one is connected it waits 5 minutes)

The delay between etcd check: https://github.com/flatcar/locksmith/blob/5b2275ec726a7f70902d2da0c782bbad405e3ef5/locksmithctl/daemon.go#L193-L195

@jepio
Copy link
Member

jepio commented Jul 29, 2024

Hmm, i see your point - would max-semaphore 2 be viable?

@tormath1
Copy link
Contributor Author

Hmm, i see your point - would max-semaphore 2 be viable?

I read too quickly the locksmith code - indeed, if no one is connected it should reboot directly - we can try with 2 semaphore.

@tormath1
Copy link
Contributor Author

tormath1 commented Jul 29, 2024

@jepio
3 semaphore: works fine (in 7 minutes) http://jenkins.infra.kinvolk.io:8080/job/container/job/test/25061/
2 semaphore: fails with 15minutes of timeout (http://jenkins.infra.kinvolk.io:8080/job/container/job/test/25062/console) finally succeeds but flaky

@tormath1 tormath1 force-pushed the tormath1/locksmith-go122 branch from ffeac0b to 8299ca8 Compare July 29, 2024 14:51
@tormath1 tormath1 self-assigned this Jul 29, 2024
@tormath1 tormath1 marked this pull request as ready for review July 29, 2024 14:52
@tormath1 tormath1 requested a review from a team July 29, 2024 14:52
@tormath1 tormath1 merged commit 9bbcbda into flatcar-master Jul 29, 2024
4 checks passed
@tormath1 tormath1 deleted the tormath1/locksmith-go122 branch July 29, 2024 15:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants