Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tests/framwork/e2e/cluster.go: revert back to sequential cluster stop to reduce e2e test run time #15637

Merged
merged 1 commit into from
Apr 12, 2023

Conversation

chaochn47
Copy link
Member

@chaochn47 chaochn47 commented Apr 4, 2023

Fix: #15497

Before:

It is 28.9mins which is close to 30mins timeout

% (cd tests && 'env' 'ETCD_VERIFY=all' 'go' 'test' 'go.etcd.io/etcd/tests/v3/common' '--tags=e2e' '-timeout=30m')
ok  	go.etcd.io/etcd/tests/v3/common	1735.430s

https://github.com/etcd-io/etcd/actions/runs/4602707842/jobs/8131973032?pr=15580

After:

15mins

% (cd tests && 'env' 'ETCD_VERIFY=all' 'go' 'test' 'go.etcd.io/etcd/tests/v3/common' '--tags=e2e' '-timeout=30m')
ok  	go.etcd.io/etcd/tests/v3/common	902.535s

Note: this change is meant to optimize 'go.etcd.io/etcd/tests/v3/common' test latency not for all the e2e tests

After force stop on 3 node cluster created in PeerTLS / PeerAutoTLS
~~It is reduced to 19.9mins. ~~

% (cd tests && 'env' 'ETCD_VERIFY=all' 'go' 'test' 'go.etcd.io/etcd/tests/v3/common' '--tags=e2e' '-timeout=30m')
ok  	go.etcd.io/etcd/tests/v3/common	1199.148s

https://github.com/etcd-io/etcd/actions/runs/4663910234/jobs/8255664805?pr=15637


TODO:
If we fix the issue mentioned in #15497 (comment), I am sure it will be dropped much more and close to the old e2e test runtime duration.

% (cd tests && 'env' 'ETCD_VERIFY=all' 'go' 'test' 'go.etcd.io/etcd/tests/v3/e2e' '-timeout=30m')
ok  	go.etcd.io/etcd/tests/v3/e2e	690.138s

Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.

@serathius
Copy link
Member

I'm not sure that we want to make force killing the default behavior for all tests. Ensuring that process exits correctly and gracefully is also important thing to test. It's not important for robustness tests, because their goal is not to test etcd binary, but the API. E2e should explicitly test etcd binary.

I would rather go into direction of finding set of tests that take too long and checking the reason. If the reason is etcd shutdown taking long then we can selectively use force kill. However I don't expect that most e2e tests have the same problem as robustness tests as by design most of them are not disruptful as robustness tests are.

@chaochn47 chaochn47 force-pushed the reduce_e2e_test_runtime branch from 1898654 to 6c25fac Compare April 4, 2023 08:00
@chaochn47
Copy link
Member Author

chaochn47 commented Apr 4, 2023

@serathius sorry for the disruption and the scary PR. I was prematurely publishing it.

In the latest version of PR, I make it an opt-in behavior only in common test framework e2e test.

Ensuring that process exits correctly and gracefully is also important thing to test. E2e should explicitly test etcd binary.

Agreed this tenet applies to some e2e test cases.
But I don't think it applies to common test cases as IMHO it was designed to run on both integration and e2e.

For the detailed analysis, could you please take a look at this #15497 (comment) in the original issue? Summary is "Test duration of test case including PeerTLS or PeerAutoTLS which has cluster size is 3 is unexpectedly too high"

@chaochn47 chaochn47 force-pushed the reduce_e2e_test_runtime branch from 6c25fac to 960b4ea Compare April 4, 2023 08:29
@serathius
Copy link
Member

serathius commented Apr 4, 2023

Don't worry about publishing PR, I wanted to give my feedback early to make sure the direction is correct. Commented on the issue #15497 (comment)

@chaochn47 chaochn47 changed the title tests/framework/e2e: kill etcd cluster processes after test finished tests/common: kill etcd cluster processes after test finished only when peerTLS turned on Apr 11, 2023
@chaochn47 chaochn47 force-pushed the reduce_e2e_test_runtime branch from 960b4ea to 166bd13 Compare April 11, 2023 04:35
@chaochn47 chaochn47 marked this pull request as ready for review April 11, 2023 05:15
@chaochn47
Copy link
Member Author

The issue analysis is posted in #15497 (comment).

Could you please take a look if this PR makes sense to you? It helps greatly reduce the e2e test runtime and improve each contributor dev experience. Thanks!!

@serathius @ahrtr @jmhbnz

Copy link
Member

@jmhbnz jmhbnz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great analysis in the linked issue @chaochn47.

I think there is perhaps a followup pr to also make some changes to what tests use 3 node cluster versus 1 to further improve execution times. It's a good idea to keep the changes separate and do in small chunks.

Just taking the analysis in the linked issue at face value as I am still learning a lot of this code and don't have much project history yet. With that in mind then looking at these changes the code itself looks ok to me and I think it will be a good outcome if this merges however maintainers may have some more nuanced feedback.

Copy link
Member

@fuweid fuweid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Great work!

I think if the case doesn't verify the graceful shutdown, it should use force cleanup.

@serathius
Copy link
Member

serathius commented Apr 11, 2023

Followup based on @fuweid comment, could you add tests for graceful shutdown in for all etcd cluster configurations? That would be the only tests that doesn't use forceful shutdown with 3 members.

@chaochn47
Copy link
Member Author

Thanks for the review @fuweid @serathius @jmhbnz !

could you add tests for graceful shutdown in for all etcd cluster configurations? That would be the only tests that doesn't use forceful shutdown with 3 members.

Ensuring that process exits correctly and gracefully is also important thing to test

@serathius I am not sure what's the expected graceful shutdown behavior. Could you please elaborate a little more?

As the first step, I will create another PR includes the following

For a cluster configuration enumeration:
			clus := testRunner.NewCluster(ctx, t, config.WithClusterConfig(tc.config))
			defer clus.Close()

It's effectively the same as current test coverage.

And improve incrementally based on the expected graceful shutdown behavior, what do you think?

@serathius
Copy link
Member

serathius commented Apr 11, 2023

@serathius I am not sure what's the expected graceful shutdown behavior. Could you please elaborate a little more?

Last time I looked, leader will try to pass leadership before exiting. So we should have a tests for that behavior:

  • Test that in 3 member cluster leader will try to pass leadership when shut down
  • Test that in 3 member cluster leader will still shutdown if it cannot pass down leadership.
  • Test that in 1 member cluster passing leader is skipped

Maybe such tests exist, haven't looked.

As the first step, I will create another PR includes the following

For a cluster configuration enumeration:
			clus := testRunner.NewCluster(ctx, t, config.WithClusterConfig(tc.config))
			defer clus.Close()

It's effectively the same as current test coverage.

And improve incrementally based on the expected graceful shutdown behavior, what do you think?

Don't understand the proposal. Can you describe the what you want to change?

@chaochn47
Copy link
Member Author

chaochn47 commented Apr 11, 2023

Last time I looked, leader will try to pass leadership before exiting. So we should have a tests for that behavior

Thanks!! That's very helpful.

Don't understand the proposal. Can you describe the what you want to change?

Sorry for the confusion.

The proposal is to creating a 3 member cluster, send SIGTERM signal to each process and wait for the etcd process graceful shutdown (transfer leadership) and exits. It's effectively the same as current test coverage in tests/common. However, instead of testing this with N test cases, it will be reduced to test once.

Do you think we are on the same page now?

@serathius
Copy link
Member

It would be better to have some assertions in the tests instead of just testing that process can be killed. Please see my suggestions in above comments on what graceful shutdown scenarios we should test

@chaochn47
Copy link
Member Author

I see. Thanks!! That is very helpful.

@ahrtr
Copy link
Member

ahrtr commented Apr 12, 2023

I agree that the long delay (of stopping the last member) is caused by TransferLeadership. Great finding @chaochn47 . The reason is that our e2e test stops all members concurrently. When the last member tries to transfer the leadership to another member, which is also in progress of stopping, so the operation will always timeout (7s by default).

In production, usually users won't stop all members concurrently; instead they stop member one by one (e.g. rolling upgrade or update). So I think we should support both sequentially/synchronously and concurrently/asynchronously stopping a multi-member cluster in our e2e test, and defaults to sequentially/synchronously. Of course, we should apply a timeout when stopping each member, but usually it should be fast enough, and we shouldn't see any performance reduce as compare to the existing concurrent stopping.

We can intentionally add a couple of cases to test concurrent stopping.

In the meanwhile, I agree with @fuweid that we should also support cleaning up zombie process at the end of each e2e test case. It can be done separately.

Copy link
Member

@ahrtr ahrtr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forcibly stopping a cluster isn't a best practice, and it should be only a method of cleanup just in case.

Please try to follow #15637 (comment) to stop cluster members sequentially.

Suggested plan/steps:

  1. stop cluster members sequentially; (we might also need to remove the member from the cluster).
  2. add test cases to test concurrent stopping;
  3. support cleaning up zombie process at the end of each e2e test.

@chaochn47
Copy link
Member Author

chaochn47 commented Apr 12, 2023

Thanks @ahrtr !! I just realized it's concurrent stopping the members of cluster by looking at the test log. My memory is sequential so I traced back why it was changed.

#15242 is the culprit and confirmed by the following diagram. You can see after the PR, e2e test was increased from 25m to 43m :(

image

https://github.com/etcd-io/etcd/actions/workflows/e2e.yaml?query=created%3A%3C2023-02-04

I would suggest we make concurrently stopping behavior opted-in by robustness test only but by default sequentially stopping each member process. WDYT? @ahrtr @fuweid @serathius

Sorry for the back and force but I think it's the right thing to do..

… to reduce e2e test run time

Signed-off-by: Chao Chen <[email protected]>
@chaochn47 chaochn47 force-pushed the reduce_e2e_test_runtime branch from 166bd13 to 941c4af Compare April 12, 2023 05:08
@chaochn47 chaochn47 changed the title tests/common: kill etcd cluster processes after test finished only when peerTLS turned on tests/framwork/e2e/cluster.go: revert back to sequential cluster stop to reduce e2e test run time Apr 12, 2023
@ahrtr
Copy link
Member

ahrtr commented Apr 12, 2023

I would suggest we make concurrently stopping behavior opted-in by robustness test only but by default sequentially stopping each member process. WDYT? @ahrtr @fuweid @serathius

Firstly please change it to sequentially stopping and double confirm whether we can reduce the e2e overall duration. I see you already did it, thx.

Even we stop each member sequentially, but we did not remove each member from the cluster (we just stop the process); so when stopping the last member, it will still try to transfer the leadership, because there still has multiple voting members. Based on your input, it seems that sequentially stopping doesn't have such performance reduce. It conflicts with my understanding. I do not get enough time to dig into it. Please try to get this sorted out. thx

@ahrtr
Copy link
Member

ahrtr commented Apr 12, 2023

I just did a quick verification, and confirmed that sequentially stopping doesn't have such performance issue. Please help me sort out the above confusion. thx

$ go test -v -run "^TestMemberAdd/NotLearner/DisableStrictReconfigCheck/NoWaitForQuorum/PeerTLS$" --tags e2e
=== RUN   TestMemberAdd
    before.go:36: Changing working directory to: /var/folders/n9/qhsds19d75x16c351t6tzhfc0000gp/T/TestMemberAdd2848703705/001
=== RUN   TestMemberAdd/NotLearner/DisableStrictReconfigCheck/NoWaitForQuorum/PeerTLS
    logger.go:130: 2023-04-12T13:19:17.487+0800	INFO	starting server...	{"name": "TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS-test-2"}
    logger.go:130: 2023-04-12T13:19:17.487+0800	INFO	starting server...	{"name": "TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS-test-1"}
    logger.go:130: 2023-04-12T13:19:17.487+0800	INFO	starting server...	{"name": "TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS-test-0"}
    logger.go:130: 2023-04-12T13:19:17.488+0800	INFO	spawning process	{"args": ["/Users/wachao/go/src/github.com/ahrtr/etcd/bin/etcd", "--name", "TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS-test-2", "--listen-client-urls", "http://localhost:20010", "--advertise-client-urls", "http://localhost:20010", "--listen-peer-urls", "https://localhost:20011", "--initial-advertise-peer-urls", "https://localhost:20011", "--initial-cluster-token", "new", "--data-dir", "/var/folders/n9/qhsds19d75x16c351t6tzhfc0000gp/T/TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS3515497785/003", "--snapshot-count", "10000", "--strict-reconfig-check=false", "--peer-cert-file", "/Users/wachao/go/src/github.com/ahrtr/etcd/tests/fixtures/server.crt", "--peer-key-file", "/Users/wachao/go/src/github.com/ahrtr/etcd/tests/fixtures/server.key.insecure", "--peer-trusted-ca-file", "/Users/wachao/go/src/github.com/ahrtr/etcd/tests/fixtures/ca.crt", "--initial-cluster", "TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS-test-0=https://localhost:20001,TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS-test-1=https://localhost:20006,TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS-test-2=https://localhost:20011", "--initial-cluster-state", "new"], "working-dir": "/private/var/folders/n9/qhsds19d75x16c351t6tzhfc0000gp/T/TestMemberAdd2848703705/001", "name": "TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS-test-2", "environment-variables": ["PATH=/Users/wachao/software/go/bin:/Users/wachao/software/google-cloud-sdk/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/Users/wachao/software/go/bin:/Users/wachao/go/gopath/bin:/Users/wachao/software/umlet:/Users/wachao/software/jdk/Contents/Home/bin:/Users/wachao/software:/Users/wachao/software/etcd:/Users/wachao/software/goreman:/Users/wachao/onedrive/vmware/tkgi/utils:/Users/wachao/Library/Python/3.9/bin:/Users/wachao/software/protoc/bin:/Users/wachao/software/jmeter/bin:/Users/wachao/software/golangci:/Users/wachao/onedrive/open_source/etcd/utils:/Users/wachao/software:/Users/wachao/software/trivy", "ETCD_UNSUPPORTED_ARCH=amd64", "ETCD_VERIFY=all"]}
    logger.go:130: 2023-04-12T13:19:17.488+0800	INFO	spawning process	{"args": ["/Users/wachao/go/src/github.com/ahrtr/etcd/bin/etcd", "--name", "TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS-test-0", "--listen-client-urls", "http://localhost:20000", "--advertise-client-urls", "http://localhost:20000", "--listen-peer-urls", "https://localhost:20001", "--initial-advertise-peer-urls", "https://localhost:20001", "--initial-cluster-token", "new", "--data-dir", "/var/folders/n9/qhsds19d75x16c351t6tzhfc0000gp/T/TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS3515497785/001", "--snapshot-count", "10000", "--strict-reconfig-check=false", "--peer-cert-file", "/Users/wachao/go/src/github.com/ahrtr/etcd/tests/fixtures/server.crt", "--peer-key-file", "/Users/wachao/go/src/github.com/ahrtr/etcd/tests/fixtures/server.key.insecure", "--peer-trusted-ca-file", "/Users/wachao/go/src/github.com/ahrtr/etcd/tests/fixtures/ca.crt", "--initial-cluster", "TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS-test-0=https://localhost:20001,TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS-test-1=https://localhost:20006,TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS-test-2=https://localhost:20011", "--initial-cluster-state", "new"], "working-dir": "/private/var/folders/n9/qhsds19d75x16c351t6tzhfc0000gp/T/TestMemberAdd2848703705/001", "name": "TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS-test-0", "environment-variables": ["PATH=/Users/wachao/software/go/bin:/Users/wachao/software/google-cloud-sdk/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/Users/wachao/software/go/bin:/Users/wachao/go/gopath/bin:/Users/wachao/software/umlet:/Users/wachao/software/jdk/Contents/Home/bin:/Users/wachao/software:/Users/wachao/software/etcd:/Users/wachao/software/goreman:/Users/wachao/onedrive/vmware/tkgi/utils:/Users/wachao/Library/Python/3.9/bin:/Users/wachao/software/protoc/bin:/Users/wachao/software/jmeter/bin:/Users/wachao/software/golangci:/Users/wachao/onedrive/open_source/etcd/utils:/Users/wachao/software:/Users/wachao/software/trivy", "ETCD_UNSUPPORTED_ARCH=amd64", "ETCD_VERIFY=all"]}
    logger.go:130: 2023-04-12T13:19:17.488+0800	INFO	spawning process	{"args": ["/Users/wachao/go/src/github.com/ahrtr/etcd/bin/etcd", "--name", "TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS-test-1", "--listen-client-urls", "http://localhost:20005", "--advertise-client-urls", "http://localhost:20005", "--listen-peer-urls", "https://localhost:20006", "--initial-advertise-peer-urls", "https://localhost:20006", "--initial-cluster-token", "new", "--data-dir", "/var/folders/n9/qhsds19d75x16c351t6tzhfc0000gp/T/TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS3515497785/002", "--snapshot-count", "10000", "--strict-reconfig-check=false", "--peer-cert-file", "/Users/wachao/go/src/github.com/ahrtr/etcd/tests/fixtures/server.crt", "--peer-key-file", "/Users/wachao/go/src/github.com/ahrtr/etcd/tests/fixtures/server.key.insecure", "--peer-trusted-ca-file", "/Users/wachao/go/src/github.com/ahrtr/etcd/tests/fixtures/ca.crt", "--initial-cluster", "TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS-test-0=https://localhost:20001,TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS-test-1=https://localhost:20006,TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS-test-2=https://localhost:20011", "--initial-cluster-state", "new"], "working-dir": "/private/var/folders/n9/qhsds19d75x16c351t6tzhfc0000gp/T/TestMemberAdd2848703705/001", "name": "TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS-test-1", "environment-variables": ["PATH=/Users/wachao/software/go/bin:/Users/wachao/software/google-cloud-sdk/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/Users/wachao/software/go/bin:/Users/wachao/go/gopath/bin:/Users/wachao/software/umlet:/Users/wachao/software/jdk/Contents/Home/bin:/Users/wachao/software:/Users/wachao/software/etcd:/Users/wachao/software/goreman:/Users/wachao/onedrive/vmware/tkgi/utils:/Users/wachao/Library/Python/3.9/bin:/Users/wachao/software/protoc/bin:/Users/wachao/software/jmeter/bin:/Users/wachao/software/golangci:/Users/wachao/onedrive/open_source/etcd/utils:/Users/wachao/software:/Users/wachao/software/trivy", "ETCD_UNSUPPORTED_ARCH=amd64", "ETCD_VERIFY=all"]}
    logger.go:130: 2023-04-12T13:19:20.909+0800	INFO	started server.	{"name": "TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS-test-2", "pid": 85623}
    logger.go:130: 2023-04-12T13:19:20.919+0800	INFO	started server.	{"name": "TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS-test-0", "pid": 85624}
    logger.go:130: 2023-04-12T13:19:20.919+0800	INFO	started server.	{"name": "TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS-test-1", "pid": 85625}
    logger.go:130: 2023-04-12T13:19:21.359+0800	INFO	closing test cluster...
    logger.go:130: 2023-04-12T13:19:21.359+0800	INFO	stopping server...	{"name": "TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS-test-0"}
    logger.go:130: 2023-04-12T13:19:21.585+0800	INFO	stopped server.	{"name": "TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS-test-0"}
    logger.go:130: 2023-04-12T13:19:21.585+0800	INFO	stopping server...	{"name": "TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS-test-1"}
    logger.go:130: 2023-04-12T13:19:21.703+0800	INFO	stopped server.	{"name": "TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS-test-1"}
    logger.go:130: 2023-04-12T13:19:21.703+0800	INFO	stopping server...	{"name": "TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS-test-2"}
    logger.go:130: 2023-04-12T13:19:21.804+0800	INFO	stopped server.	{"name": "TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS-test-2"}
    logger.go:130: 2023-04-12T13:19:21.804+0800	INFO	closing server...	{"name": "TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS-test-0"}
    logger.go:130: 2023-04-12T13:19:21.804+0800	INFO	stopping server...	{"name": "TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS-test-0"}
    logger.go:130: 2023-04-12T13:19:21.804+0800	INFO	removing directory	{"data-dir": "/var/folders/n9/qhsds19d75x16c351t6tzhfc0000gp/T/TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS3515497785/001"}
    logger.go:130: 2023-04-12T13:19:21.807+0800	INFO	closing server...	{"name": "TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS-test-1"}
    logger.go:130: 2023-04-12T13:19:21.807+0800	INFO	stopping server...	{"name": "TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS-test-1"}
    logger.go:130: 2023-04-12T13:19:21.807+0800	INFO	removing directory	{"data-dir": "/var/folders/n9/qhsds19d75x16c351t6tzhfc0000gp/T/TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS3515497785/002"}
    logger.go:130: 2023-04-12T13:19:21.808+0800	INFO	closing server...	{"name": "TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS-test-2"}
    logger.go:130: 2023-04-12T13:19:21.808+0800	INFO	stopping server...	{"name": "TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS-test-2"}
    logger.go:130: 2023-04-12T13:19:21.808+0800	INFO	removing directory	{"data-dir": "/var/folders/n9/qhsds19d75x16c351t6tzhfc0000gp/T/TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS3515497785/003"}
    logger.go:130: 2023-04-12T13:19:21.810+0800	INFO	closed test cluster.
--- PASS: TestMemberAdd (4.33s)
    --- PASS: TestMemberAdd/NotLearner/DisableStrictReconfigCheck/NoWaitForQuorum/PeerTLS (4.32s)
PASS
ok  	go.etcd.io/etcd/tests/v3/common	7.322s

@chaochn47
Copy link
Member Author

chaochn47 commented Apr 12, 2023

etcd (TestKVGetPeerAutoTLS-test-2) (2007481): 
{
    "level": "warn",
    "ts": "2023-04-12T05:19:09.148099Z",
    "caller": "etcdserver/server.go:1256",
    "msg": "leadership transfer failed",
    "local-member-id": "c61f33e98087dec2",
    "error": "etcdserver: unhealthy cluster"
}

TriggerLeadershipTransfer needs to pick a transferee that is longest connected. But during sequentially shutdown, the last member will find out its peers are no longer active.

transferee, ok := longestConnected(s.r.transport, s.cluster.VotingMemberIDs())
if !ok {
return errors.ErrUnhealthy
}

That's my theory to explain why it takes short time to stop the cluster. I am confirming by adding logs...

@fuweid
Copy link
Member

fuweid commented Apr 12, 2023

#15242 is the culprit and confirmed by the following diagram. You can see after the PR, e2e test was increased from 25m to 43m :(

It seems that my patch make it badly. Sorry about that :(. I am confirming it in my local as well. Thanks for pointing it out!

@ahrtr
Copy link
Member

ahrtr commented Apr 12, 2023

FYI. If the second stopped member is the leader, it also might lead to some delay (e.g. about 7s in this case) under sequentially stopping. I leave it to you to sort it out.

/Users/wachao/go/src/github.com/ahrtr/etcd/bin/etcd (TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS-test-1) (85679): {"level":"info","ts":"2023-04-12T13:21:47.724134+0800","caller":"etcdserver/server.go:1178","msg":"leadership transfer starting","local-member-id":"d70245c72180528a","current-leader-member-id":"d70245c72180528a","transferee-member-id":"c61f33e98087dec2"}
...
/Users/wachao/go/src/github.com/ahrtr/etcd/bin/etcd (TestMemberAddNotLearnerDisableStrictReconfigCheckNoWaitForQuorumPeerTLS-test-1) (85679): {"level":"warn","ts":"2023-04-12T13:21:54.724839+0800","caller":"etcdserver/server.go:1257","msg":"leadership transfer failed","local-member-id":"d70245c72180528a","error":"etcdserver: request timed out, leader transfer took too long"}

It seems that my patch make it badly. Sorry about that :(.

No worries. We should never blame anyone due to technical issue.

@chaochn47
Copy link
Member Author

chaochn47 commented Apr 12, 2023

No worries. We should never blame anyone due to technical issue.

+1 https://devops.com/how-sre-creates-a-blameless-culture/

@fuweid
Copy link
Member

fuweid commented Apr 12, 2023

Try it in my local.

➜  tests git:(main) ETCD_VERIFY=all taskset -c 0,1,2 go test --tags=e2e -timeout=30m go.etcd.io/etcd/tests/v3/common
ok      go.etcd.io/etcd/tests/v3/common 1196.205s

➜  tests git:(main) git fetch upstream pull/15637/head:review-15637
remote: Enumerating objects: 311, done.
remote: Counting objects: 100% (191/191), done.
remote: Compressing objects: 100% (27/27), done.
remote: Total 115 (delta 98), reused 105 (delta 88), pack-reused 0
Receiving objects: 100% (115/115), 36.09 KiB | 1.16 MiB/s, done.
Resolving deltas: 100% (98/98), completed with 30 local objects.
From https://github.com/etcd-io/etcd
 * [new ref]             refs/pull/15637/head -> review-15637
➜  tests git:(main) git checkout review-15637
Switched to branch 'review-15637'
➜  tests git:(review-15637) ETCD_VERIFY=all taskset -c 0,1,2 go test --tags=e2e -timeout=30m go.etcd.io/etcd/tests/v3/common
go: downloading golang.org/x/sys v0.7.0
go: downloading golang.org/x/net v0.9.0
go: downloading github.com/jonboulle/clockwork v0.4.0
go: downloading github.com/spf13/cobra v1.7.0
go: downloading golang.org/x/text v0.9.0
ok      go.etcd.io/etcd/tests/v3/common 713.899s

Try to use testset to simulate the github action vm because it uses 3 cores for linux vm.

When we use concurrent stop, it seems that it will be worse if the etcd leader is waiting for other member to be leader...

@chaochn47
Copy link
Member Author

chaochn47 commented Apr 12, 2023

I think the PR is good enough to reduce the common e2e test down from 29mins to 15mins. WDTY? It's essentially just a git revert.

FYI. If the second stopped member is the leader, it also might lead to some delay (e.g. about 7s in this case) under sequentially stopping. I leave it to you to sort it out.

IMHO, the second stopped member should not even try to transfer the leadership if 1 member is not active and itself is trying to shutdown. Can I follow this up in a separate PR to follow single responsibility principle? @ahrtr

Last time I looked, leader will try to pass leadership before exiting. So we should have a tests for that behavior:

  1. Test that in 3 member cluster leader will try to pass leadership when shut down
  2. Test that in 3 member cluster leader will still shutdown if it cannot pass down leadership.
  3. Test that in 1 member cluster passing leader is skipped

Those are good test behavior and assertions.

Can I follow this up in a separate PR to follow single responsibility principle? @serathius

Raised a issue #15702 to track above and assigned to me.

@ahrtr
Copy link
Member

ahrtr commented Apr 12, 2023

I think the PR is good enough to reduce the common e2e test down from 29mins to 15mins. WDTY?

OK. Please get all the confusion sorted out in a separate session/PR.

Copy link
Member

@fuweid fuweid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@ahrtr ahrtr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thanks.

@ahrtr ahrtr merged commit 817aba5 into etcd-io:main Apr 12, 2023
@chaochn47
Copy link
Member Author

Thanks @ahrtr. I will get back to the #15702 tomorrow morning.

@chaochn47 chaochn47 deleted the reduce_e2e_test_runtime branch April 12, 2023 06:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Shorten runtime of e2e tests
5 participants