tests/framwork/e2e/cluster.go: revert back to sequential cluster stop to reduce e2e test run time #15637

chaochn47 · 2023-04-04T07:10:10Z

Before:

It is 28.9mins which is close to 30mins timeout

% (cd tests && 'env' 'ETCD_VERIFY=all' 'go' 'test' 'go.etcd.io/etcd/tests/v3/common' '--tags=e2e' '-timeout=30m')
ok  	go.etcd.io/etcd/tests/v3/common	1735.430s

https://github.com/etcd-io/etcd/actions/runs/4602707842/jobs/8131973032?pr=15580

After:

15mins

% (cd tests && 'env' 'ETCD_VERIFY=all' 'go' 'test' 'go.etcd.io/etcd/tests/v3/common' '--tags=e2e' '-timeout=30m')
ok  	go.etcd.io/etcd/tests/v3/common	902.535s

~~Note: this change is meant to optimize 'go.etcd.io/etcd/tests/v3/common' test latency not for all the e2e tests~~

~~After force stop on 3 node cluster created in PeerTLS / PeerAutoTLS~~
~~It is reduced to 19.9mins. ~~

% (cd tests && 'env' 'ETCD_VERIFY=all' 'go' 'test' 'go.etcd.io/etcd/tests/v3/common' '--tags=e2e' '-timeout=30m')
ok  	go.etcd.io/etcd/tests/v3/common	1199.148s

~~https://github.com/etcd-io/etcd/actions/runs/4663910234/jobs/8255664805?pr=15637~~

~~TODO:~~
~~If we fix the issue mentioned in #15497 (comment), I am sure it will be dropped much more and close to the old e2e test runtime duration.~~

% (cd tests && 'env' 'ETCD_VERIFY=all' 'go' 'test' 'go.etcd.io/etcd/tests/v3/e2e' '-timeout=30m')
ok  	go.etcd.io/etcd/tests/v3/e2e	690.138s

Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.

serathius · 2023-04-04T07:54:33Z

I'm not sure that we want to make force killing the default behavior for all tests. Ensuring that process exits correctly and gracefully is also important thing to test. It's not important for robustness tests, because their goal is not to test etcd binary, but the API. E2e should explicitly test etcd binary.

I would rather go into direction of finding set of tests that take too long and checking the reason. If the reason is etcd shutdown taking long then we can selectively use force kill. However I don't expect that most e2e tests have the same problem as robustness tests as by design most of them are not disruptful as robustness tests are.

chaochn47 · 2023-04-04T08:03:28Z

@serathius sorry for the disruption and the scary PR. I was prematurely publishing it.

In the latest version of PR, I make it an opt-in behavior only in common test framework e2e test.

Ensuring that process exits correctly and gracefully is also important thing to test. E2e should explicitly test etcd binary.

Agreed this tenet applies to some e2e test cases.
But I don't think it applies to common test cases as IMHO it was designed to run on both integration and e2e.

For the detailed analysis, could you please take a look at this #15497 (comment) in the original issue? Summary is "Test duration of test case including PeerTLS or PeerAutoTLS which has cluster size is 3 is unexpectedly too high"

serathius · 2023-04-04T10:29:00Z

Don't worry about publishing PR, I wanted to give my feedback early to make sure the direction is correct. Commented on the issue #15497 (comment)

chaochn47 · 2023-04-11T05:20:52Z

The issue analysis is posted in #15497 (comment).

Could you please take a look if this PR makes sense to you? It helps greatly reduce the e2e test runtime and improve each contributor dev experience. Thanks!!

@serathius @ahrtr @jmhbnz

jmhbnz

Great analysis in the linked issue @chaochn47.

I think there is perhaps a followup pr to also make some changes to what tests use 3 node cluster versus 1 to further improve execution times. It's a good idea to keep the changes separate and do in small chunks.

Just taking the analysis in the linked issue at face value as I am still learning a lot of this code and don't have much project history yet. With that in mind then looking at these changes the code itself looks ok to me and I think it will be a good outcome if this merges however maintainers may have some more nuanced feedback.

fuweid

LGTM. Great work!

I think if the case doesn't verify the graceful shutdown, it should use force cleanup.

serathius · 2023-04-11T09:23:04Z

Followup based on @fuweid comment, could you add tests for graceful shutdown in for all etcd cluster configurations? That would be the only tests that doesn't use forceful shutdown with 3 members.

chaochn47 · 2023-04-11T17:49:17Z

Thanks for the review @fuweid @serathius @jmhbnz !

could you add tests for graceful shutdown in for all etcd cluster configurations? That would be the only tests that doesn't use forceful shutdown with 3 members.

Ensuring that process exits correctly and gracefully is also important thing to test

@serathius I am not sure what's the expected graceful shutdown behavior. Could you please elaborate a little more?

As the first step, I will create another PR includes the following

For a cluster configuration enumeration:
			clus := testRunner.NewCluster(ctx, t, config.WithClusterConfig(tc.config))
			defer clus.Close()

It's effectively the same as current test coverage.

And improve incrementally based on the expected graceful shutdown behavior, what do you think?

serathius · 2023-04-11T19:37:56Z

@serathius I am not sure what's the expected graceful shutdown behavior. Could you please elaborate a little more?

Last time I looked, leader will try to pass leadership before exiting. So we should have a tests for that behavior:

Test that in 3 member cluster leader will try to pass leadership when shut down
Test that in 3 member cluster leader will still shutdown if it cannot pass down leadership.
Test that in 1 member cluster passing leader is skipped

Maybe such tests exist, haven't looked.

As the first step, I will create another PR includes the following
For a cluster configuration enumeration:
			clus := testRunner.NewCluster(ctx, t, config.WithClusterConfig(tc.config))
			defer clus.Close()
It's effectively the same as current test coverage.

And improve incrementally based on the expected graceful shutdown behavior, what do you think?

Don't understand the proposal. Can you describe the what you want to change?

chaochn47 · 2023-04-11T19:43:30Z

Last time I looked, leader will try to pass leadership before exiting. So we should have a tests for that behavior

Thanks!! That's very helpful.

Don't understand the proposal. Can you describe the what you want to change?

Sorry for the confusion.

The proposal is to creating a 3 member cluster, send SIGTERM signal to each process and wait for the etcd process graceful shutdown (transfer leadership) and exits. It's effectively the same as current test coverage in tests/common. However, instead of testing this with N test cases, it will be reduced to test once.

Do you think we are on the same page now?

serathius · 2023-04-11T19:52:46Z

It would be better to have some assertions in the tests instead of just testing that process can be killed. Please see my suggestions in above comments on what graceful shutdown scenarios we should test

chaochn47 · 2023-04-11T20:00:17Z

I see. Thanks!! That is very helpful.

ahrtr · 2023-04-12T01:15:26Z

I agree that the long delay (of stopping the last member) is caused by TransferLeadership. Great finding @chaochn47 . The reason is that our e2e test stops all members concurrently. When the last member tries to transfer the leadership to another member, which is also in progress of stopping, so the operation will always timeout (7s by default).

In production, usually users won't stop all members concurrently; instead they stop member one by one (e.g. rolling upgrade or update). So I think we should support both sequentially/synchronously and concurrently/asynchronously stopping a multi-member cluster in our e2e test, and defaults to sequentially/synchronously. Of course, we should apply a timeout when stopping each member, but usually it should be fast enough, and we shouldn't see any performance reduce as compare to the existing concurrent stopping.

We can intentionally add a couple of cases to test concurrent stopping.

In the meanwhile, I agree with @fuweid that we should also support cleaning up zombie process at the end of each e2e test case. It can be done separately.

ahrtr

Forcibly stopping a cluster isn't a best practice, and it should be only a method of cleanup just in case.

Please try to follow #15637 (comment) to stop cluster members sequentially.

Suggested plan/steps:

stop cluster members sequentially; (we might also need to remove the member from the cluster).
add test cases to test concurrent stopping;
support cleaning up zombie process at the end of each e2e test.

chaochn47 · 2023-04-12T04:50:11Z

Thanks @ahrtr !! I just realized it's concurrent stopping the members of cluster by looking at the test log. My memory is sequential so I traced back why it was changed.

#15242 is the culprit and confirmed by the following diagram. You can see after the PR, e2e test was increased from 25m to 43m :(

https://github.com/etcd-io/etcd/actions/workflows/e2e.yaml?query=created%3A%3C2023-02-04

I would suggest we make concurrently stopping behavior opted-in by robustness test only but by default sequentially stopping each member process. WDYT? @ahrtr @fuweid @serathius

Sorry for the back and force but I think it's the right thing to do..

… to reduce e2e test run time Signed-off-by: Chao Chen <[email protected]>

ahrtr · 2023-04-12T05:15:41Z

I would suggest we make concurrently stopping behavior opted-in by robustness test only but by default sequentially stopping each member process. WDYT? @ahrtr @fuweid @serathius

Firstly please change it to sequentially stopping and double confirm whether we can reduce the e2e overall duration. I see you already did it, thx.

Even we stop each member sequentially, but we did not remove each member from the cluster (we just stop the process); so when stopping the last member, it will still try to transfer the leadership, because there still has multiple voting members. Based on your input, it seems that sequentially stopping doesn't have such performance reduce. It conflicts with my understanding. I do not get enough time to dig into it. Please try to get this sorted out. thx

ahrtr · 2023-04-12T05:20:50Z

I just did a quick verification, and confirmed that sequentially stopping doesn't have such performance issue. Please help me sort out the above confusion. thx

$ go test -v -run "^TestMemberAdd/NotLearner/DisableStrictReconfigCheck/NoWaitForQuorum/PeerTLS$" --tags e2e
=== RUN   TestMemberAdd
    before.go:36: Changing working directory to: /var/folders/n9/qhsds19d75x16c351t6tzhfc0000gp/T/TestMemberAdd2848703705/001
=== RUN   TestMemberAdd/NotLearner/DisableStrictReconfigCheck/NoWaitForQuorum/PeerTLS
    logger.go:130: 2023-04-12T13:19:17.487+0800	INFO	starting server...	{"name": "TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS-test-2"}
    logger.go:130: 2023-04-12T13:19:17.487+0800	INFO	starting server...	{"name": "TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS-test-1"}
    logger.go:130: 2023-04-12T13:19:17.487+0800	INFO	starting server...	{"name": "TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS-test-0"}
    logger.go:130: 2023-04-12T13:19:17.488+0800	INFO	spawning process	{"args": ["/Users/wachao/go/src/github.com/ahrtr/etcd/bin/etcd", "--name", "TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS-test-2", "--listen-client-urls", "http://localhost:20010", "--advertise-client-urls", "http://localhost:20010", "--listen-peer-urls", "https://localhost:20011", "--initial-advertise-peer-urls", "https://localhost:20011", "--initial-cluster-token", "new", "--data-dir", "/var/folders/n9/qhsds19d75x16c351t6tzhfc0000gp/T/TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS3515497785/003", "--snapshot-count", "10000", "--strict-reconfig-check=false", "--peer-cert-file", "/Users/wachao/go/src/github.com/ahrtr/etcd/tests/fixtures/server.crt", "--peer-key-file", "/Users/wachao/go/src/github.com/ahrtr/etcd/tests/fixtures/server.key.insecure", "--peer-trusted-ca-file", "/Users/wachao/go/src/github.com/ahrtr/etcd/tests/fixtures/ca.crt", "--initial-cluster", "TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS-test-0=https://localhost:20001,TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS-test-1=https://localhost:20006,TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS-test-2=https://localhost:20011", "--initial-cluster-state", "new"], "working-dir": "/private/var/folders/n9/qhsds19d75x16c351t6tzhfc0000gp/T/TestMemberAdd2848703705/001", "name": "TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS-test-2", "environment-variables": ["PATH=/Users/wachao/software/go/bin:/Users/wachao/software/google-cloud-sdk/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/Users/wachao/software/go/bin:/Users/wachao/go/gopath/bin:/Users/wachao/software/umlet:/Users/wachao/software/jdk/Contents/Home/bin:/Users/wachao/software:/Users/wachao/software/etcd:/Users/wachao/software/goreman:/Users/wachao/onedrive/vmware/tkgi/utils:/Users/wachao/Library/Python/3.9/bin:/Users/wachao/software/protoc/bin:/Users/wachao/software/jmeter/bin:/Users/wachao/software/golangci:/Users/wachao/onedrive/open_source/etcd/utils:/Users/wachao/software:/Users/wachao/software/trivy", "ETCD_UNSUPPORTED_ARCH=amd64", "ETCD_VERIFY=all"]}
    logger.go:130: 2023-04-12T13:19:17.488+0800	INFO	spawning process	{"args": ["/Users/wachao/go/src/github.com/ahrtr/etcd/bin/etcd", "--name", "TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS-test-0", "--listen-client-urls", "http://localhost:20000", "--advertise-client-urls", "http://localhost:20000", "--listen-peer-urls", "https://localhost:20001", "--initial-advertise-peer-urls", "https://localhost:20001", "--initial-cluster-token", "new", "--data-dir", "/var/folders/n9/qhsds19d75x16c351t6tzhfc0000gp/T/TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS3515497785/001", "--snapshot-count", "10000", "--strict-reconfig-check=false", "--peer-cert-file", "/Users/wachao/go/src/github.com/ahrtr/etcd/tests/fixtures/server.crt", "--peer-key-file", "/Users/wachao/go/src/github.com/ahrtr/etcd/tests/fixtures/server.key.insecure", "--peer-trusted-ca-file", "/Users/wachao/go/src/github.com/ahrtr/etcd/tests/fixtures/ca.crt", "--initial-cluster", "TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS-test-0=https://localhost:20001,TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS-test-1=https://localhost:20006,TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS-test-2=https://localhost:20011", "--initial-cluster-state", "new"], "working-dir": "/private/var/folders/n9/qhsds19d75x16c351t6tzhfc0000gp/T/TestMemberAdd2848703705/001", "name": "TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS-test-0", "environment-variables": ["PATH=/Users/wachao/software/go/bin:/Users/wachao/software/google-cloud-sdk/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/Users/wachao/software/go/bin:/Users/wachao/go/gopath/bin:/Users/wachao/software/umlet:/Users/wachao/software/jdk/Contents/Home/bin:/Users/wachao/software:/Users/wachao/software/etcd:/Users/wachao/software/goreman:/Users/wachao/onedrive/vmware/tkgi/utils:/Users/wachao/Library/Python/3.9/bin:/Users/wachao/software/protoc/bin:/Users/wachao/software/jmeter/bin:/Users/wachao/software/golangci:/Users/wachao/onedrive/open_source/etcd/utils:/Users/wachao/software:/Users/wachao/software/trivy", "ETCD_UNSUPPORTED_ARCH=amd64", "ETCD_VERIFY=all"]}
    logger.go:130: 2023-04-12T13:19:17.488+0800	INFO	spawning process	{"args": ["/Users/wachao/go/src/github.com/ahrtr/etcd/bin/etcd", "--name", "TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS-test-1", "--listen-client-urls", "http://localhost:20005", "--advertise-client-urls", "http://localhost:20005", "--listen-peer-urls", "https://localhost:20006", "--initial-advertise-peer-urls", "https://localhost:20006", "--initial-cluster-token", "new", "--data-dir", "/var/folders/n9/qhsds19d75x16c351t6tzhfc0000gp/T/TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS3515497785/002", "--snapshot-count", "10000", "--strict-reconfig-check=false", "--peer-cert-file", "/Users/wachao/go/src/github.com/ahrtr/etcd/tests/fixtures/server.crt", "--peer-key-file", "/Users/wachao/go/src/github.com/ahrtr/etcd/tests/fixtures/server.key.insecure", "--peer-trusted-ca-file", "/Users/wachao/go/src/github.com/ahrtr/etcd/tests/fixtures/ca.crt", "--initial-cluster", "TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS-test-0=https://localhost:20001,TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS-test-1=https://localhost:20006,TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS-test-2=https://localhost:20011", "--initial-cluster-state", "new"], "working-dir": "/private/var/folders/n9/qhsds19d75x16c351t6tzhfc0000gp/T/TestMemberAdd2848703705/001", "name": "TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS-test-1", "environment-variables": ["PATH=/Users/wachao/software/go/bin:/Users/wachao/software/google-cloud-sdk/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/Users/wachao/software/go/bin:/Users/wachao/go/gopath/bin:/Users/wachao/software/umlet:/Users/wachao/software/jdk/Contents/Home/bin:/Users/wachao/software:/Users/wachao/software/etcd:/Users/wachao/software/goreman:/Users/wachao/onedrive/vmware/tkgi/utils:/Users/wachao/Library/Python/3.9/bin:/Users/wachao/software/protoc/bin:/Users/wachao/software/jmeter/bin:/Users/wachao/software/golangci:/Users/wachao/onedrive/open_source/etcd/utils:/Users/wachao/software:/Users/wachao/software/trivy", "ETCD_UNSUPPORTED_ARCH=amd64", "ETCD_VERIFY=all"]}
    logger.go:130: 2023-04-12T13:19:20.909+0800	INFO	started server.	{"name": "TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS-test-2", "pid": 85623}
    logger.go:130: 2023-04-12T13:19:20.919+0800	INFO	started server.	{"name": "TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS-test-0", "pid": 85624}
    logger.go:130: 2023-04-12T13:19:20.919+0800	INFO	started server.	{"name": "TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS-test-1", "pid": 85625}
    logger.go:130: 2023-04-12T13:19:21.359+0800	INFO	closing test cluster...
    logger.go:130: 2023-04-12T13:19:21.359+0800	INFO	stopping server...	{"name": "TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS-test-0"}
    logger.go:130: 2023-04-12T13:19:21.585+0800	INFO	stopped server.	{"name": "TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS-test-0"}
    logger.go:130: 2023-04-12T13:19:21.585+0800	INFO	stopping server...	{"name": "TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS-test-1"}
    logger.go:130: 2023-04-12T13:19:21.703+0800	INFO	stopped server.	{"name": "TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS-test-1"}
    logger.go:130: 2023-04-12T13:19:21.703+0800	INFO	stopping server...	{"name": "TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS-test-2"}
    logger.go:130: 2023-04-12T13:19:21.804+0800	INFO	stopped server.	{"name": "TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS-test-2"}
    logger.go:130: 2023-04-12T13:19:21.804+0800	INFO	closing server...	{"name": "TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS-test-0"}
    logger.go:130: 2023-04-12T13:19:21.804+0800	INFO	stopping server...	{"name": "TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS-test-0"}
    logger.go:130: 2023-04-12T13:19:21.804+0800	INFO	removing directory	{"data-dir": "/var/folders/n9/qhsds19d75x16c351t6tzhfc0000gp/T/TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS3515497785/001"}
    logger.go:130: 2023-04-12T13:19:21.807+0800	INFO	closing server...	{"name": "TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS-test-1"}
    logger.go:130: 2023-04-12T13:19:21.807+0800	INFO	stopping server...	{"name": "TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS-test-1"}
    logger.go:130: 2023-04-12T13:19:21.807+0800	INFO	removing directory	{"data-dir": "/var/folders/n9/qhsds19d75x16c351t6tzhfc0000gp/T/TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS3515497785/002"}
    logger.go:130: 2023-04-12T13:19:21.808+0800	INFO	closing server...	{"name": "TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS-test-2"}
    logger.go:130: 2023-04-12T13:19:21.808+0800	INFO	stopping server...	{"name": "TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS-test-2"}
    logger.go:130: 2023-04-12T13:19:21.808+0800	INFO	removing directory	{"data-dir": "/var/folders/n9/qhsds19d75x16c351t6tzhfc0000gp/T/TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS3515497785/003"}
    logger.go:130: 2023-04-12T13:19:21.810+0800	INFO	closed test cluster.
--- PASS: TestMemberAdd (4.33s)
    --- PASS: TestMemberAdd/NotLearner/DisableStrictReconfigCheck/NoWaitForQuorum/PeerTLS (4.32s)
PASS
ok  	go.etcd.io/etcd/tests/v3/common	7.322s

chaochn47 · 2023-04-12T05:27:47Z

etcd (TestKVGetPeerAutoTLS-test-2) (2007481): 
{
    "level": "warn",
    "ts": "2023-04-12T05:19:09.148099Z",
    "caller": "etcdserver/server.go:1256",
    "msg": "leadership transfer failed",
    "local-member-id": "c61f33e98087dec2",
    "error": "etcdserver: unhealthy cluster"
}

TriggerLeadershipTransfer needs to pick a transferee that is longest connected. But during sequentially shutdown, the last member will find out its peers are no longer active.

etcd/server/etcdserver/server.go

Lines 1226 to 1229 in 6519a15

    
           transferee, ok := longestConnected(s.r.transport, s.cluster.VotingMemberIDs()) 
        
           if !ok { 
        
           	return errors.ErrUnhealthy 
        
           }

That's my theory to explain why it takes short time to stop the cluster. I am confirming by adding logs...

fuweid · 2023-04-12T05:36:06Z

#15242 is the culprit and confirmed by the following diagram. You can see after the PR, e2e test was increased from 25m to 43m :(

It seems that my patch make it badly. Sorry about that :(. I am confirming it in my local as well. Thanks for pointing it out!

ahrtr · 2023-04-12T05:41:06Z

FYI. If the second stopped member is the leader, it also might lead to some delay (e.g. about 7s in this case) under sequentially stopping. I leave it to you to sort it out.

/Users/wachao/go/src/github.com/ahrtr/etcd/bin/etcd (TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS-test-1) (85679): {"level":"info","ts":"2023-04-12T13:21:47.724134+0800","caller":"etcdserver/server.go:1178","msg":"leadership transfer starting","local-member-id":"d70245c72180528a","current-leader-member-id":"d70245c72180528a","transferee-member-id":"c61f33e98087dec2"}
...
/Users/wachao/go/src/github.com/ahrtr/etcd/bin/etcd (TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS-test-1) (85679): {"level":"warn","ts":"2023-04-12T13:21:54.724839+0800","caller":"etcdserver/server.go:1257","msg":"leadership transfer failed","local-member-id":"d70245c72180528a","error":"etcdserver: request timed out, leader transfer took too long"}

It seems that my patch make it badly. Sorry about that :(.

No worries. We should never blame anyone due to technical issue.

chaochn47 · 2023-04-12T05:49:12Z

No worries. We should never blame anyone due to technical issue.

+1 https://devops.com/how-sre-creates-a-blameless-culture/

fuweid · 2023-04-12T06:25:26Z

Try it in my local.

➜  tests git:(main) ETCD_VERIFY=all taskset -c 0,1,2 go test --tags=e2e -timeout=30m go.etcd.io/etcd/tests/v3/common
ok      go.etcd.io/etcd/tests/v3/common 1196.205s

➜  tests git:(main) git fetch upstream pull/15637/head:review-15637
remote: Enumerating objects: 311, done.
remote: Counting objects: 100% (191/191), done.
remote: Compressing objects: 100% (27/27), done.
remote: Total 115 (delta 98), reused 105 (delta 88), pack-reused 0
Receiving objects: 100% (115/115), 36.09 KiB | 1.16 MiB/s, done.
Resolving deltas: 100% (98/98), completed with 30 local objects.
From https://github.com/etcd-io/etcd
 * [new ref]             refs/pull/15637/head -> review-15637
➜  tests git:(main) git checkout review-15637
Switched to branch 'review-15637'
➜  tests git:(review-15637) ETCD_VERIFY=all taskset -c 0,1,2 go test --tags=e2e -timeout=30m go.etcd.io/etcd/tests/v3/common
go: downloading golang.org/x/sys v0.7.0
go: downloading golang.org/x/net v0.9.0
go: downloading github.com/jonboulle/clockwork v0.4.0
go: downloading github.com/spf13/cobra v1.7.0
go: downloading golang.org/x/text v0.9.0
ok      go.etcd.io/etcd/tests/v3/common 713.899s

Try to use testset to simulate the github action vm because it uses 3 cores for linux vm.

When we use concurrent stop, it seems that it will be worse if the etcd leader is waiting for other member to be leader...

chaochn47 · 2023-04-12T06:36:32Z

I think the PR is good enough to reduce the common e2e test down from 29mins to 15mins. WDTY? It's essentially just a git revert.

FYI. If the second stopped member is the leader, it also might lead to some delay (e.g. about 7s in this case) under sequentially stopping. I leave it to you to sort it out.

IMHO, the second stopped member should not even try to transfer the leadership if 1 member is not active and itself is trying to shutdown. Can I follow this up in a separate PR to follow single responsibility principle? @ahrtr

Last time I looked, leader will try to pass leadership before exiting. So we should have a tests for that behavior:

Test that in 3 member cluster leader will try to pass leadership when shut down

Test that in 3 member cluster leader will still shutdown if it cannot pass down leadership.

Test that in 1 member cluster passing leader is skipped

Those are good test behavior and assertions.

Can I follow this up in a separate PR to follow single responsibility principle? @serathius

Raised a issue #15702 to track above and assigned to me.

ahrtr · 2023-04-12T06:42:19Z

I think the PR is good enough to reduce the common e2e test down from 29mins to 15mins. WDTY?

OK. Please get all the confusion sorted out in a separate session/PR.

fuweid

LGTM

ahrtr

LGTM

Thanks.

chaochn47 · 2023-04-12T06:51:21Z

Thanks @ahrtr. I will get back to the #15702 tomorrow morning.

chaochn47 force-pushed the reduce_e2e_test_runtime branch from 1898654 to 6c25fac Compare April 4, 2023 08:00

chaochn47 force-pushed the reduce_e2e_test_runtime branch from 6c25fac to 960b4ea Compare April 4, 2023 08:29

chaochn47 changed the title ~~tests/framework/e2e: kill etcd cluster processes after test finished~~ tests/common: kill etcd cluster processes after test finished only when peerTLS turned on Apr 11, 2023

chaochn47 force-pushed the reduce_e2e_test_runtime branch from 960b4ea to 166bd13 Compare April 11, 2023 04:35

chaochn47 marked this pull request as ready for review April 11, 2023 05:15

jmhbnz approved these changes Apr 11, 2023

View reviewed changes

serathius approved these changes Apr 11, 2023

View reviewed changes

fuweid approved these changes Apr 11, 2023

View reviewed changes

chaochn47 mentioned this pull request Apr 11, 2023

ARM E2E test consistently failed #15647

Closed

ahrtr requested changes Apr 12, 2023

View reviewed changes

tests/framwork/e2e/cluster.go: revert back to sequential cluster stop…

941c4af

… to reduce e2e test run time Signed-off-by: Chao Chen <[email protected]>

chaochn47 force-pushed the reduce_e2e_test_runtime branch from 166bd13 to 941c4af Compare April 12, 2023 05:08

chaochn47 changed the title ~~tests/common: kill etcd cluster processes after test finished only when peerTLS turned on~~ tests/framwork/e2e/cluster.go: revert back to sequential cluster stop to reduce e2e test run time Apr 12, 2023

chaochn47 mentioned this pull request Apr 12, 2023

Tests: etcd graceful shutdown #15702

Closed

fuweid approved these changes Apr 12, 2023

View reviewed changes

ahrtr approved these changes Apr 12, 2023

View reviewed changes

ahrtr merged commit 817aba5 into etcd-io:main Apr 12, 2023

chaochn47 deleted the reduce_e2e_test_runtime branch April 12, 2023 06:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tests/framwork/e2e/cluster.go: revert back to sequential cluster stop to reduce e2e test run time #15637

tests/framwork/e2e/cluster.go: revert back to sequential cluster stop to reduce e2e test run time #15637

chaochn47 commented Apr 4, 2023 •

edited

Loading

serathius commented Apr 4, 2023

chaochn47 commented Apr 4, 2023 •

edited

Loading

serathius commented Apr 4, 2023 •

edited

Loading

chaochn47 commented Apr 11, 2023

jmhbnz left a comment

fuweid left a comment

serathius commented Apr 11, 2023 •

edited

Loading

chaochn47 commented Apr 11, 2023

serathius commented Apr 11, 2023 •

edited

Loading

chaochn47 commented Apr 11, 2023 •

edited

Loading

serathius commented Apr 11, 2023

chaochn47 commented Apr 11, 2023

ahrtr commented Apr 12, 2023

ahrtr left a comment •

edited

Loading

chaochn47 commented Apr 12, 2023 •

edited

Loading

ahrtr commented Apr 12, 2023

ahrtr commented Apr 12, 2023

chaochn47 commented Apr 12, 2023 •

edited

Loading

fuweid commented Apr 12, 2023

ahrtr commented Apr 12, 2023

chaochn47 commented Apr 12, 2023 •

edited

Loading

fuweid commented Apr 12, 2023 •

edited

Loading

chaochn47 commented Apr 12, 2023 •

edited

Loading

ahrtr commented Apr 12, 2023

fuweid left a comment

ahrtr left a comment

chaochn47 commented Apr 12, 2023

tests/framwork/e2e/cluster.go: revert back to sequential cluster stop to reduce e2e test run time #15637

tests/framwork/e2e/cluster.go: revert back to sequential cluster stop to reduce e2e test run time #15637

Conversation

chaochn47 commented Apr 4, 2023 • edited Loading

serathius commented Apr 4, 2023

chaochn47 commented Apr 4, 2023 • edited Loading

serathius commented Apr 4, 2023 • edited Loading

chaochn47 commented Apr 11, 2023

jmhbnz left a comment

Choose a reason for hiding this comment

fuweid left a comment

Choose a reason for hiding this comment

serathius commented Apr 11, 2023 • edited Loading

chaochn47 commented Apr 11, 2023

serathius commented Apr 11, 2023 • edited Loading

chaochn47 commented Apr 11, 2023 • edited Loading

serathius commented Apr 11, 2023

chaochn47 commented Apr 11, 2023

ahrtr commented Apr 12, 2023

ahrtr left a comment • edited Loading

Choose a reason for hiding this comment

chaochn47 commented Apr 12, 2023 • edited Loading

ahrtr commented Apr 12, 2023

ahrtr commented Apr 12, 2023

chaochn47 commented Apr 12, 2023 • edited Loading

fuweid commented Apr 12, 2023

ahrtr commented Apr 12, 2023

chaochn47 commented Apr 12, 2023 • edited Loading

fuweid commented Apr 12, 2023 • edited Loading

chaochn47 commented Apr 12, 2023 • edited Loading

ahrtr commented Apr 12, 2023

fuweid left a comment

Choose a reason for hiding this comment

ahrtr left a comment

Choose a reason for hiding this comment

chaochn47 commented Apr 12, 2023

chaochn47 commented Apr 4, 2023 •

edited

Loading

chaochn47 commented Apr 4, 2023 •

edited

Loading

serathius commented Apr 4, 2023 •

edited

Loading

serathius commented Apr 11, 2023 •

edited

Loading

serathius commented Apr 11, 2023 •

edited

Loading

chaochn47 commented Apr 11, 2023 •

edited

Loading

ahrtr left a comment •

edited

Loading

chaochn47 commented Apr 12, 2023 •

edited

Loading

chaochn47 commented Apr 12, 2023 •

edited

Loading

chaochn47 commented Apr 12, 2023 •

edited

Loading

fuweid commented Apr 12, 2023 •

edited

Loading

chaochn47 commented Apr 12, 2023 •

edited

Loading