Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[test] Add e2e downgrade automatic cancellation test #19399

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

henrybear327
Copy link
Contributor

@henrybear327 henrybear327 commented Feb 12, 2025

Verify that the downgrade can be cancelled automatically when the downgrade is completed (using no inflight downgrade job as the indicator)

Please see #19365 (comment)

Reference: #17976

Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.

@henrybear327 henrybear327 self-assigned this Feb 12, 2025
@henrybear327 henrybear327 changed the title Add e2e downgrade automatic cancellation test [test] Add e2e downgrade automatic cancellation test Feb 12, 2025
@henrybear327 henrybear327 force-pushed the e2e/downgrade_auto_cancel branch 2 times, most recently from 995bf3e to 60e8a40 Compare February 12, 2025 12:10
Copy link

codecov bot commented Feb 12, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 68.90%. Comparing base (14cf669) to head (24d1fc1).

Additional details and impacted files

see 20 files with indirect coverage changes

@@            Coverage Diff             @@
##             main   #19399      +/-   ##
==========================================
- Coverage   68.95%   68.90%   -0.06%     
==========================================
  Files         420      420              
  Lines       35753    35753              
==========================================
- Hits        24655    24636      -19     
- Misses       9677     9690      +13     
- Partials     1421     1427       +6     

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 14cf669...24d1fc1. Read the comment docs.

@henrybear327 henrybear327 force-pushed the e2e/downgrade_auto_cancel branch from 60e8a40 to 11357be Compare February 12, 2025 13:07
Copy link
Member

@ahrtr ahrtr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ahrtr
Copy link
Member

ahrtr commented Feb 12, 2025

@henrybear327 can you please rebase this PR ? I just merged #19398

@henrybear327 henrybear327 force-pushed the e2e/downgrade_auto_cancel branch from 11357be to c0f2170 Compare February 12, 2025 15:59
@henrybear327
Copy link
Contributor Author

@henrybear327 can you please rebase this PR ? I just merged #19398

Saw and done as you posted!

time.Sleep(etcdserver.HealthInterval)
}

e2e.DowngradeAutoCancelCheck(t, epc)
Copy link
Member

@ahrtr ahrtr Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be flaky, it depends on when the downgrade job gets triggered. A simple way is just to monitor/wait for the log entry "the cluster has been downgraded"

m.lg.Info("the cluster has been downgraded", zap.String("cluster-version", targetVersion))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it's better to

  • monitor/wait for the log entry "the cluster has been downgraded"
  • call downgradeCancellation to verify that it's indeed cancelled

Right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first item is required, the second one is nice to have (optional).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it! Will impl. in a bit

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Theoretically, when you see the log message "the cluster has been downgraded", the leader hasn't send out the downgradeCancel request yet. So when you try to send downgradeCancel yourself in e2e.DowngradeAutoCancelCheck, it may succeed. I think it's easy to verify if you add a sleep in between.

But in practice, it's unlikely because there is a 5s (etcdserver.HealthInterval) sleep after the downgrade completion.

So either just fix comment https://github.com/etcd-io/etcd/pull/19399/files#r1957283593 or just remove e2e.DowngradeAutoCancelCheck (preferred)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest add a TODO item to followup once the downgrade query API is implemented. #19439 (comment)

@henrybear327 henrybear327 force-pushed the e2e/downgrade_auto_cancel branch from c0f2170 to 43e04f9 Compare February 12, 2025 16:14
Copy link
Member

@ahrtr ahrtr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please resolve the comments

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: henrybear327
Once this PR has been reviewed and has the lgtm label, please assign ahrtr for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ahrtr ahrtr mentioned this pull request Feb 13, 2025
15 tasks
@henrybear327 henrybear327 force-pushed the e2e/downgrade_auto_cancel branch from 43e04f9 to a18de65 Compare February 13, 2025 20:35
@henrybear327 henrybear327 force-pushed the e2e/downgrade_auto_cancel branch 2 times, most recently from 7baf245 to f39e84e Compare February 15, 2025 01:36
@henrybear327 henrybear327 force-pushed the e2e/downgrade_auto_cancel branch from f39e84e to 121fa72 Compare February 15, 2025 01:52
@henrybear327 henrybear327 requested a review from ahrtr February 15, 2025 01:54
if opString == "downgrading" && len(membersToChange) == len(clus.Procs) {
lg.Info("Waiting for downgrade completion log line")
leader := clus.WaitLeader(t)
_, err := clus.Procs[leader].Logs().ExpectWithContext(context.TODO(), expect.ExpectedResponse{Value: "the cluster has been downgraded"})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please set a timeout, i.e. 30s.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done as requested!

Do you see a risk with this approach of monitoring the leader's log @ahrtr? :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only concern is it will fail if the leader somehow changes.

So suggest to add a retry mechanism (i.e. up to 3 times), and let's wait 15s instead of 30s in each retry.

Verify that the downgrade can be cancelled
automatically when the downgrade is completed
(using `no inflight downgrade job`` as the indicator)

Please see: etcd-io#19365 (comment)
Reference: etcd-io#17976

Signed-off-by: Chun-Hung Tseng <[email protected]>
@henrybear327 henrybear327 force-pushed the e2e/downgrade_auto_cancel branch from 121fa72 to 24d1fc1 Compare February 15, 2025 10:58
@@ -207,6 +207,77 @@ func testDowngradeUpgrade(t *testing.T, numberOfMembersToDowngrade int, clusterS
assert.Equal(t, beforeMembers.Members, afterMembers.Members)
}

func TestAutomaticDowngradeCancellationAfterCompletingDowngradingInClusterOf3(t *testing.T) {
Copy link
Member

@ahrtr ahrtr Feb 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name is too long and wordy

Suggested change
func TestAutomaticDowngradeCancellationAfterCompletingDowngradingInClusterOf3(t *testing.T) {
func TestDowngradeAutoCancelAfterCompletion(t *testing.T) {

testutils.ExecuteWithTimeout(t, 1*time.Minute, func() {
t.Log("etcdctl downgrade cancel")
err = c.DowngradeCancel(context.TODO())
require.Errorf(t, err, "no inflight downgrade job")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
require.Errorf(t, err, "no inflight downgrade job")
require.ErrorContains(t, err, "no inflight downgrade job")

@henrybear327
Copy link
Contributor Author

Is this PR still relevant as we are doing #19439?

@ahrtr
Copy link
Member

ahrtr commented Feb 18, 2025

Is this PR still relevant as we are doing #19439?

Yes, I think so. The auto downgrade cancel should still work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

4 participants