-
Notifications
You must be signed in to change notification settings - Fork 9.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tests/framwork/e2e/cluster.go: revert back to sequential cluster stop to reduce e2e test run time #15637
Conversation
I'm not sure that we want to make force killing the default behavior for all tests. Ensuring that process exits correctly and gracefully is also important thing to test. It's not important for robustness tests, because their goal is not to test etcd binary, but the API. E2e should explicitly test etcd binary. I would rather go into direction of finding set of tests that take too long and checking the reason. If the reason is etcd shutdown taking long then we can selectively use force kill. However I don't expect that most e2e tests have the same problem as robustness tests as by design most of them are not disruptful as robustness tests are. |
1898654
to
6c25fac
Compare
@serathius sorry for the disruption and the scary PR. I was prematurely publishing it. In the latest version of PR, I make it an opt-in behavior only in common test framework e2e test.
Agreed this tenet applies to some e2e test cases. For the detailed analysis, could you please take a look at this #15497 (comment) in the original issue? Summary is "Test duration of test case including PeerTLS or PeerAutoTLS which has cluster size is 3 is unexpectedly too high" |
6c25fac
to
960b4ea
Compare
Don't worry about publishing PR, I wanted to give my feedback early to make sure the direction is correct. Commented on the issue #15497 (comment) |
960b4ea
to
166bd13
Compare
The issue analysis is posted in #15497 (comment). Could you please take a look if this PR makes sense to you? It helps greatly reduce the e2e test runtime and improve each contributor dev experience. Thanks!! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great analysis in the linked issue @chaochn47.
I think there is perhaps a followup pr to also make some changes to what tests use 3 node cluster versus 1 to further improve execution times. It's a good idea to keep the changes separate and do in small chunks.
Just taking the analysis in the linked issue at face value as I am still learning a lot of this code and don't have much project history yet. With that in mind then looking at these changes the code itself looks ok to me and I think it will be a good outcome if this merges however maintainers may have some more nuanced feedback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Great work!
I think if the case doesn't verify the graceful shutdown, it should use force cleanup.
Followup based on @fuweid comment, could you add tests for graceful shutdown in for all etcd cluster configurations? That would be the only tests that doesn't use forceful shutdown with 3 members. |
Thanks for the review @fuweid @serathius @jmhbnz !
@serathius I am not sure what's the expected graceful shutdown behavior. Could you please elaborate a little more? As the first step, I will create another PR includes the following
It's effectively the same as current test coverage. And improve incrementally based on the expected graceful shutdown behavior, what do you think? |
Last time I looked, leader will try to pass leadership before exiting. So we should have a tests for that behavior:
Maybe such tests exist, haven't looked.
Don't understand the proposal. Can you describe the what you want to change? |
Thanks!! That's very helpful.
Sorry for the confusion. The proposal is to creating a 3 member cluster, send SIGTERM signal to each process and wait for the etcd process graceful shutdown (transfer leadership) and exits. It's effectively the same as current test coverage in Do you think we are on the same page now? |
It would be better to have some assertions in the tests instead of just testing that process can be killed. Please see my suggestions in above comments on what graceful shutdown scenarios we should test |
I see. Thanks!! That is very helpful. |
I agree that the long delay (of stopping the last member) is caused by TransferLeadership. Great finding @chaochn47 . The reason is that our e2e test stops all members concurrently. When the last member tries to transfer the leadership to another member, which is also in progress of stopping, so the operation will always timeout (7s by default). In production, usually users won't stop all members concurrently; instead they stop member one by one (e.g. rolling upgrade or update). So I think we should support both sequentially/synchronously and concurrently/asynchronously stopping a multi-member cluster in our e2e test, and defaults to sequentially/synchronously. Of course, we should apply a timeout when stopping each member, but usually it should be fast enough, and we shouldn't see any performance reduce as compare to the existing concurrent stopping. We can intentionally add a couple of cases to test concurrent stopping. In the meanwhile, I agree with @fuweid that we should also support cleaning up zombie process at the end of each e2e test case. It can be done separately. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forcibly stopping a cluster isn't a best practice, and it should be only a method of cleanup just in case.
Please try to follow #15637 (comment) to stop cluster members sequentially.
Suggested plan/steps:
- stop cluster members sequentially; (we might also need to remove the member from the cluster).
- add test cases to test concurrent stopping;
- support cleaning up zombie process at the end of each e2e test.
Thanks @ahrtr !! I just realized it's concurrent stopping the members of cluster by looking at the test log. My memory is sequential so I traced back why it was changed. #15242 is the culprit and confirmed by the following diagram. You can see after the PR, e2e test was increased from 25m to 43m :( https://github.com/etcd-io/etcd/actions/workflows/e2e.yaml?query=created%3A%3C2023-02-04 I would suggest we make concurrently stopping behavior opted-in by robustness test only but by default sequentially stopping each member process. WDYT? @ahrtr @fuweid @serathius Sorry for the back and force but I think it's the right thing to do.. |
… to reduce e2e test run time Signed-off-by: Chao Chen <[email protected]>
166bd13
to
941c4af
Compare
Firstly please change it to sequentially stopping and double confirm whether we can reduce the e2e overall duration. I see you already did it, thx. Even we stop each member sequentially, but we did not remove each member from the cluster (we just stop the process); so when stopping the last member, it will still try to transfer the leadership, because there still has multiple voting members. Based on your input, it seems that sequentially stopping doesn't have such performance reduce. It conflicts with my understanding. I do not get enough time to dig into it. Please try to get this sorted out. thx |
I just did a quick verification, and confirmed that sequentially stopping doesn't have such performance issue. Please help me sort out the above confusion. thx
|
TriggerLeadershipTransfer needs to pick a transferee that is longest connected. But during sequentially shutdown, the last member will find out its peers are no longer active. etcd/server/etcdserver/server.go Lines 1226 to 1229 in 6519a15
That's my theory to explain why it takes short time to stop the cluster. I am confirming by adding logs... |
It seems that my patch make it badly. Sorry about that :(. I am confirming it in my local as well. Thanks for pointing it out! |
FYI. If the second stopped member is the leader, it also might lead to some delay (e.g. about 7s in this case) under sequentially stopping. I leave it to you to sort it out.
No worries. We should never blame anyone due to technical issue. |
|
Try it in my local. ➜ tests git:(main) ETCD_VERIFY=all taskset -c 0,1,2 go test --tags=e2e -timeout=30m go.etcd.io/etcd/tests/v3/common
ok go.etcd.io/etcd/tests/v3/common 1196.205s
➜ tests git:(main) git fetch upstream pull/15637/head:review-15637
remote: Enumerating objects: 311, done.
remote: Counting objects: 100% (191/191), done.
remote: Compressing objects: 100% (27/27), done.
remote: Total 115 (delta 98), reused 105 (delta 88), pack-reused 0
Receiving objects: 100% (115/115), 36.09 KiB | 1.16 MiB/s, done.
Resolving deltas: 100% (98/98), completed with 30 local objects.
From https://github.com/etcd-io/etcd
* [new ref] refs/pull/15637/head -> review-15637
➜ tests git:(main) git checkout review-15637
Switched to branch 'review-15637'
➜ tests git:(review-15637) ETCD_VERIFY=all taskset -c 0,1,2 go test --tags=e2e -timeout=30m go.etcd.io/etcd/tests/v3/common
go: downloading golang.org/x/sys v0.7.0
go: downloading golang.org/x/net v0.9.0
go: downloading github.com/jonboulle/clockwork v0.4.0
go: downloading github.com/spf13/cobra v1.7.0
go: downloading golang.org/x/text v0.9.0
ok go.etcd.io/etcd/tests/v3/common 713.899s Try to use testset to simulate the github action vm because it uses 3 cores for linux vm. When we use concurrent stop, it seems that it will be worse if the etcd leader is waiting for other member to be leader... |
I think the PR is good enough to reduce the common e2e test down from 29mins to 15mins. WDTY? It's essentially just a git revert.
IMHO, the second stopped member should not even try to transfer the leadership if 1 member is not active and itself is trying to shutdown. Can I follow this up in a separate PR to follow single responsibility principle? @ahrtr
Those are good test behavior and assertions. Can I follow this up in a separate PR to follow single responsibility principle? @serathius Raised a issue #15702 to track above and assigned to me. |
OK. Please get all the confusion sorted out in a separate session/PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks.
Fix: #15497
Before:
It is 28.9mins which is close to 30mins timeout
https://github.com/etcd-io/etcd/actions/runs/4602707842/jobs/8131973032?pr=15580
After:
15mins
Note: this change is meant to optimize 'go.etcd.io/etcd/tests/v3/common' test latency not for all the e2e testsAfter force stop on 3 node cluster created inPeerTLS / PeerAutoTLS
~~It is reduced to 19.9mins. ~~
https://github.com/etcd-io/etcd/actions/runs/4663910234/jobs/8255664805?pr=15637TODO:If we fix the issue mentioned in #15497 (comment), I am sure it will be dropped much more and close to the old e2e test runtime duration.Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.