Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tests: Migrate member add tests to common framework #14281

Merged
merged 1 commit into from
Sep 6, 2022

Conversation

clement2026
Copy link
Contributor

@clement2026 clement2026 commented Jul 27, 2022

Conctext #13637

func TestCtlV3MemberAddPeerTLS(t *testing.T) {
testCtl(t, memberAddTest, withCfg(*e2e.NewConfigPeerTLS()))
}
func TestCtlV3MemberAdd(t *testing.T) { testCtl(t, memberAddTest) }
func TestCtlV3MemberAddForLearner(t *testing.T) { testCtl(t, memberAddForLearnerTest) }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this also covered?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The underlying implementation of TestCtlV3MemberAdd and TestCtlV3MemberAddForLearner use member add and member add --learner command.

I thought it would be valuable to keep them in the e2e testing, to cover the real command usage scenarios.

Do we need to keep them?

Copy link
Member

@serathius serathius Jul 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if I understood you. New framework runs e2e test, but runs etcdctl only with --format=json, so we never real command usage secnario etcdctl member add, but flavor that uses json?.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I've got the point.

I thought our common framework doesn't run e2e test. Now I know it does.

I'm going to remove these redundant functions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought our common framework doesn't run e2e test. Now I know it does.

Common framework is an interface for testing that covers both e2e and integration. Tests in tests/common that use common framework will be automatically run as both e2e and integration tests. Write a test once and double the benefit.

Instead of having two separate frameworks and redundant tests, we want to just have one framework with shared tests. For now we are migrating tests/e2e directory, however if you want you can look into tests/integration directory and check if some test scenarios are already covered and can be removed.

It's a good point that we don't test exact output of etcdctl member add, however that's much lower on our priority than testing e2e process.

@codecov-commenter
Copy link

codecov-commenter commented Jul 27, 2022

Codecov Report

Merging #14281 (d05c894) into main (012fc51) will decrease coverage by 0.27%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##             main   #14281      +/-   ##
==========================================
- Coverage   75.49%   75.22%   -0.28%     
==========================================
  Files         457      457              
  Lines       37084    37084              
==========================================
- Hits        27996    27895     -101     
- Misses       7341     7427      +86     
- Partials     1747     1762      +15     
Flag Coverage Δ
all 75.22% <ø> (-0.28%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
server/auth/simple_token.go 80.00% <0.00%> (-8.47%) ⬇️
client/pkg/v3/tlsutil/tlsutil.go 83.33% <0.00%> (-8.34%) ⬇️
client/pkg/v3/fileutil/purge.go 66.03% <0.00%> (-7.55%) ⬇️
client/v3/concurrency/mutex.go 61.64% <0.00%> (-5.48%) ⬇️
raft/rafttest/node.go 95.00% <0.00%> (-5.00%) ⬇️
server/storage/mvcc/watchable_store.go 84.42% <0.00%> (-4.72%) ⬇️
server/etcdserver/api/v3rpc/interceptor.go 73.43% <0.00%> (-4.17%) ⬇️
server/proxy/grpcproxy/watch.go 92.48% <0.00%> (-4.05%) ⬇️
server/etcdserver/txn/util.go 75.47% <0.00%> (-3.78%) ⬇️
client/pkg/v3/testutil/recorder.go 76.27% <0.00%> (-3.39%) ⬇️
... and 21 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@serathius
Copy link
Member

serathius commented Jul 27, 2022

Found errror:

member_test.go:94: could not add member, err: match not found. Set EXPECT_DEBUG for more info Err: read /dev/ptmx: input/output error, last lines:

Let me know if you need help with debugging.

@clement2026
Copy link
Contributor Author

This clue shows that a member of a cluster starts serving client before it's connected to all peers.

/home/runner/work/etcd/etcd/bin/etcd-22812: {"level":"warn","ts":"2022-07-27T08:46:33.653Z","caller":"etcdserver/server.go:1351","msg":"rejecting member add request; local member has not been connected to all peers, reconfigure breaks active quorum","local-member-id":"a8f3458387986d","requested-member-add":"{ID:557cfa84711678b8 RaftAttributes:{PeerURLs:[http://240.0.0.0:65535/] IsLearner:false} Attributes:{Name: ClientURLs:[]}}","error":"etcdserver: unhealthy cluster"}
[14847](https://github.com/etcd-io/etcd/runs/7536399620?check_suite_focus=true#step:5:14848)

To add or to remove a member from a cluster, a client need to wait a little bit. I've reproduced it on my computer.

Xnip2022-07-27_19-40-35

To correct the test cases, I'm gonna keep trying MemberAdd() for a few seconds.

@serathius
Copy link
Member

serathius commented Jul 27, 2022

Hmm, maybe we should use clus.WaitLeader()?

@serathius
Copy link
Member

One potential problem is adding members to 1 node cluster. In that situation we grow from 1 -> 2 member cluster. This means that quorum is now 2 member, however in tests we really don't add working member (just a backhole url). This means that 1 node cluster will become unavailable.

On the other hand 3 node cluster grows to 4 members, so the quorum stays the same (2 members). Please try removing test cases that use cluster of size 1.

@clement2026
Copy link
Contributor Author

Hmm, maybe we should use clus.WaitLeader()?

Wow, this is an amazing function, it looks like what I need. I'll give it try.

@clement2026
Copy link
Contributor Author

One potential problem is adding members to 1 node cluster. In that situation we grow from 1 -> 2 member cluster. This means that quorum is now 2 member, however in tests we really don't add working member (just a backhole url). This means that 1 node cluster will become unavailable.

On the other hand 3 node cluster grows to 4 members, so the quorum stays the same (2 members). Please try removing test cases that use cluster of size 1.

sg.

@serathius
Copy link
Member

Hmm, maybe we should use clus.WaitLeader()?

Wow, this is an amazing function, it looks like what I need. I'll give it try.

I think cluster create should already call it so I would look into second suggestion with skipping test cases where cluster size is equal 1.

@clement2026
Copy link
Contributor Author

As I went through the code base, I found that clus.WaitLeader() is called in many integration test cases, not in any e2e test case or the process of cluster creation.
Neither in old e2e test code nor in common framework e2e code, an equivalent function to clus.WaitLeader() was found. 😟

@clement2026
Copy link
Contributor Author

Hmm, maybe we should use clus.WaitLeader()?

Wow, this is an amazing function, it looks like what I need. I'll give it try.

I think cluster create should already call it so I would look into second suggestion with skipping test cases where cluster size is equal 1.

Skipping test cases where cluster size is equal 1 is not likely going to make the tests work. The failed test cases at the moment are the ones with cluster size > 1.

When cluster size=1, the single node accepts MemberAdd request immediately and responses OK, ignoring the fact that the newly added backhole peer url would make the cluster unavailable.

@clement2026 clement2026 force-pushed the migrate-member-add-tests branch from cf7ca3a to bcb1923 Compare July 28, 2022 14:32
@serathius
Copy link
Member

I would recommend to rebase as there was a change that greatly reduces test flakiness #14283

@clement2026 clement2026 force-pushed the migrate-member-add-tests branch from bcb1923 to 76c57f8 Compare July 28, 2022 16:01
@clement2026
Copy link
Contributor Author

Maybe we need to implement a clus.WaitLeader() function in e2e cluster, and expose it in Cluster interface. What do you think?

type Cluster interface {
	Members() []Member
        // HERE
	WaitLeader() int
	Client() Client
	Close() error
}

type Member interface {
	Client() Client
	Start() error
	Stop()
}

type Client interface {
	Put(key, value string, opts config.PutOptions) error
	Get(key string, opts config.GetOptions) (*clientv3.GetResponse, error)
	Delete(key string, opts config.DeleteOptions) (*clientv3.DeleteResponse, error)
	Compact(rev int64, opts config.CompactOption) (*clientv3.CompactResponse, error)
	Status() ([]*clientv3.StatusResponse, error)
	HashKV(rev int64) ([]*clientv3.HashKVResponse, error)
	Health() error
	Defragment(opts config.DefragOption) error
	AlarmList() (*clientv3.AlarmResponse, error)
	AlarmDisarm(alarmMember *clientv3.AlarmMember) (*clientv3.AlarmResponse, error)

@clement2026 clement2026 force-pushed the migrate-member-add-tests branch 2 times, most recently from 47c487a to d2ed176 Compare August 2, 2022 07:25
@ahrtr
Copy link
Member

ahrtr commented Aug 8, 2022

Do you need to wait for #14304 to be merged and then update this PR?

@clement2026
Copy link
Contributor Author

Do you need to wait for #14304 to be merged and then update this PR?

Yes. This PR is blocked by #14304. Should we label this PR or close it at the moment?

@ahrtr
Copy link
Member

ahrtr commented Aug 9, 2022

Do you need to wait for #14304 to be merged and then update this PR?

Yes. This PR is blocked by #14304. Should we label this PR or close it at the moment?

Once #14304 is merged, you can rebase this PR.

var addResp *clientv3.MemberAddResponse
var err error
if tc.learner {
addResp, err = cc.MemberAddAsLearner("newmember", []string{blackHolePeerUrl})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not to use a real peerURL something like http://localhost:xxx instead of http://240.0.0.0:65535?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

http://240.0.0.0:65535 is used to avoid "Peer URLs already exists" error. If http://localhost:xxx happens to be the peer-url of a started etcd instance, member add operation would fail due to that error. I considered to use http://localhost:xxx with a unused random port such as 2777 and 15981, but I'm slightly worried about port conflict in the future.

@clement2026
Copy link
Contributor Author

clement2026 commented Aug 18, 2022

The E2E / test (linux-386-e2e) (pull_request) check failed due to an unhealthy cluster error

/home/runner/work/etcd/etcd/bin/etcd-22730: {"level":"warn","ts":"2022-08-14T11:45:32.476Z","caller":"etcdserver/server.go:1351","msg":"rejecting member add request; local member has not been connected to all peers, reconfigure breaks active quorum","local-member-id":"d70245c72180528a","requested-member-add":"{ID:ccf9af529e52be7e RaftAttributes:{PeerURLs:[http://240.0.0.0:65535/] IsLearner:false} Attributes:{Name: ClientURLs:[]}}","error":"etcdserver: unhealthy cluster"}
[14842](https://github.com/etcd-io/etcd/runs/7826140582?check_suite_focus=true#step:5:14843)

MemberAdd function worked fine in integration test, but failed in e2e test. That's because integration cluster has --strict-reconfig-check disabled by default and e2e cluster has `--strict-reconfig-check' enabled by default.

Therefore, in e2e cluster, MemberAdd function has to wait 5 seconds before e2e cluster isConnectedFullySince, or an unhealthy cluster error would occur.

if !isConnectedFullySince(s.r.transport, time.Now().Add(-HealthInterval), s.MemberId(), s.cluster.VotingMembers()) {
lg.Warn(
"rejecting member add request; local member has not been connected to all peers, reconfigure breaks active quorum",
zap.String("local-member-id", s.MemberId().String()),
zap.String("requested-member-add", fmt.Sprintf("%+v", memb)),
zap.Error(errors.ErrUnhealthy),
)
return errors.ErrUnhealthy
}

A PR #14360 was opened to discuss this.

@ahrtr
Copy link
Member

ahrtr commented Sep 5, 2022

Please resolve the conflict.

@clement2026 clement2026 force-pushed the migrate-member-add-tests branch from d05c894 to 8b4b70a Compare September 5, 2022 12:04
}
}

func checkAddResp(t *testing.T, addResp *clientv3.MemberAddResponse, err error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This checks error and validates response. Could we maybe split this:

  • rename checkAddResp to validateMemberAdd and remove err argument.
  • for error checking we can use require.NoError(err).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks to the suggestions! I'd love to do so , as they make the code clearer.

@clement2026 clement2026 force-pushed the migrate-member-add-tests branch from 8b4b70a to 34d9b03 Compare September 5, 2022 12:36
cc := clus.Client()

testutils.ExecuteUntil(ctx, t, func() {
clus.WaitLeader(t)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need WaitLeader here? I haven't seen it used in other tests as I would expect that NewCluster returns me cluster after leader is established.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. It turns out we don't need WaitLeader here. I'll remove them.

The first time I saw etcdserver: unhealthy cluster error after calling MemberAdd, I thought it was because all etcd members did not agree on the same leader. Therefore, I added WaitLeader function to the Cluster interface. But the etcdserver: unhealthy cluster error did not disappear. Finally I found it was caused by --strict-reconfig-check.

Comment on lines 94 to 95
if nc.config.ClusterSize == 1 {
checkAddResp(t, addResp, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect reverse result, adding invalid (localhost:123 will never respond) member should fail in single node cluster as you cannot establish quorum. From 2 members only 1 is alive, but quorum is 2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, adding an invalid member succeeds in the original e2e test. I also tried this in my terminal, it succeeded too. Is this a long-standing bug?🤣

}
}

func TestMemberAdd_SleetHealthInterval(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func TestMemberAdd_SleetHealthInterval(t *testing.T) {
func TestMemberAdd_SleepHealthInterval(t *testing.T) {

}
}

func TestMemberAdd_SleetHealthInterval(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see similar test removed in this PR. Where this test has come from?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a scenario not covered before:

  • In the first 5 seconds(HealthInterval) after the creation of a cluster(with --strict-reconfig-check enabled), any call to MemberAdd and MemberRemove should return an etcdserver: unhealthy cluster error.
  • As the first 5 seconds(HealthInterval) has passed, call to these functions should be successful.

This behaviour is controlled by the isConnectedFullySince function:

if !isConnectedFullySince(s.r.transport, time.Now().Add(-HealthInterval), s.MemberId(), s.cluster.VotingMembers()) {
lg.Warn(
"rejecting member add request; local member has not been connected to all peers, reconfigure breaks active quorum",
zap.String("local-member-id", s.MemberId().String()),
zap.String("requested-member-add", fmt.Sprintf("%+v", memb)),
zap.Error(errors.ErrUnhealthy),
)
return errors.ErrUnhealthy
}

Copy link
Member

@serathius serathius Sep 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, please rename/rewrite test so the reason for the behavior is known. For example TestMemberAdd_BeforeConnectedToAllPeers or TestMemberAdd_WaitForQuorum.

clus.WaitLeader(t)
var addResp *clientv3.MemberAddResponse
var err error
time.Sleep(etcdserver.HealthInterval)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why adding sleep here makes a difference?

@serathius
Copy link
Member

Overall note, instead of having 3 different test could merge them into one table test. For me the scenario in all those tests is almost the same:

  • Create cluster
  • Call MemberAdd
  • Check whether it was successful or not

We could just add expectError, strictReconfigCheck and waitForQuorum as fields on test case struct.

@clement2026
Copy link
Contributor Author

Overall note, instead of having 3 different test could merge them into one table test. For me the scenario in all those tests is almost the same:

  • Create cluster
  • Call MemberAdd
  • Check whether it was successful or not

We could just add expectError, strictReconfigCheck and waitForQuorum as fields on test case struct.

I was worried about making a single test function too large and complex. I'll try and see if I can handle this.

@clement2026 clement2026 force-pushed the migrate-member-add-tests branch from ae04eab to 236f836 Compare September 5, 2022 18:57
@clement2026 clement2026 force-pushed the migrate-member-add-tests branch from 236f836 to fcc076f Compare September 6, 2022 03:07
@clement2026
Copy link
Contributor Author

Overall note, instead of having 3 different test could merge them into one table test. For me the scenario in all those tests is almost the same:

  • Create cluster
  • Call MemberAdd
  • Check whether it was successful or not

We could just add expectError, strictReconfigCheck and waitForQuorum as fields on test case struct.

I followed your suggestion and it works great! I've force pushed. Please take a look.

@serathius serathius merged commit f0c95ca into etcd-io:main Sep 6, 2022
@clement2026 clement2026 deleted the migrate-member-add-tests branch September 6, 2022 10:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants