Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tests: unify --strict-reconfig-check config of common framework #14360

Conversation

clement2026
Copy link
Contributor

Purpose

This PR is opened to discuss the different policies of `--strict-reconfig-check' between e2e test and integration test.

What went wrong?

As described in #14281 (comment), the MemberAdd function works fine in integration test but encounters an unhealthy cluster error in e2e test.

Why?

Integration cluster has `--strict-reconfig-check' disabled by default:

m.StrictReconfigCheck = mcfg.StrictReconfigCheck

in the meantime, e2e cluster has `--strict-reconfig-check' enabled by default:

if cfg.NoStrictReconfig {
args = append(args, "--strict-reconfig-check=false")
}

therefore, in e2e cluster, functions like MemberAdd, MemberRemove and TransferLeadership would encounter an unhealthy cluster error before members isConnectedFullySince, which will take 5 seconds:

if !isConnectedFullySince(s.r.transport, time.Now().Add(-HealthInterval), s.MemberId(), s.cluster.VotingMembers()) {
lg.Warn(
"rejecting member add request; local member has not been connected to all peers, reconfigure breaks active quorum",
zap.String("local-member-id", s.MemberId().String()),
zap.String("requested-member-add", fmt.Sprintf("%+v", memb)),
zap.Error(errors.ErrUnhealthy),
)
return errors.ErrUnhealthy
}

Solution

I've come up with some solutions. But I'm not sure if they have more hidden risks.

  1. MemberAdd function could wait 5 seconds before all members isConnectedFullySince. Apparently it'll make our code difficult to maintain.
  2. Expose --strict-reconfig-check to common framework, in order to give test cases control of whether to enable/disable --strict-reconfig-check in the process of cluster creation.
  3. Enable or disable --strict-reconfig-check both in e2e cluster and integration cluster.

I tend to go with the 3rd solution. Though the 2nd solution may be required to do some specific tests in the future, we can skip it now.

What have I done?

I made some changes to disable --strict-reconfig-check in e2e cluster of common framework. At the same time, I kept it enabled as it was in the old framework, to avoid breaking existing test cases.

@clement2026 clement2026 force-pushed the common-test-of-e2e-disable-strict-reconfig-check-by-default branch from af77669 to 1c84082 Compare August 18, 2022 10:57
@codecov-commenter
Copy link

Codecov Report

Merging #14360 (1c84082) into main (27ffd7e) will decrease coverage by 0.07%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##             main   #14360      +/-   ##
==========================================
- Coverage   75.26%   75.19%   -0.08%     
==========================================
  Files         457      457              
  Lines       37140    37140              
==========================================
- Hits        27953    27926      -27     
- Misses       7428     7444      +16     
- Partials     1759     1770      +11     
Flag Coverage Δ
all 75.19% <ø> (-0.08%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
server/auth/simple_token.go 80.00% <0.00%> (-8.47%) ⬇️
client/pkg/v3/fileutil/lock_linux.go 72.22% <0.00%> (-8.34%) ⬇️
client/v3/experimental/recipes/queue.go 58.62% <0.00%> (-6.90%) ⬇️
client/v3/concurrency/mutex.go 61.64% <0.00%> (-5.48%) ⬇️
server/storage/wal/file_pipeline.go 90.69% <0.00%> (-4.66%) ⬇️
server/etcdserver/api/v3rpc/member.go 93.54% <0.00%> (-3.23%) ⬇️
client/v3/leasing/txn.go 88.09% <0.00%> (-3.18%) ⬇️
client/v3/maintenance.go 59.74% <0.00%> (-1.26%) ⬇️
client/v3/client.go 80.12% <0.00%> (-0.95%) ⬇️
client/v3/retry_interceptor.go 64.54% <0.00%> (-0.91%) ⬇️
... and 12 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@serathius serathius self-requested a review August 18, 2022 16:50
@ahrtr
Copy link
Member

ahrtr commented Aug 18, 2022

Good catch and thanks for raising this discussion.

Comments:

  1. We should expose a configuration item for the --strict-reconfig-check in the common config;
  2. Since the item is enabled by default in production environment, so we should enable it by default as well in our test; and we can add a couple of common case to intentionally disable it.
  3. If above change breaks some existing cases, then we need to investigate whether it's expected. If yes, then we can disable the item for the specific cases, otherwise fix the issue.

@clement2026
Copy link
Contributor Author

Thanks for your analysis!
I agree with you on enabling it by default. I'm gonna do it and see what happens.

Copy link
Member

@serathius serathius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, leaving for @ahrtr to approve and merge

@ahrtr
Copy link
Member

ahrtr commented Aug 22, 2022

Thanks for your analysis! I agree with you on enabling it by default. I'm gonna do it and see what happens.

@clarkfw Are you going to do this in this PR or in a separate PR?

@clement2026
Copy link
Contributor Author

I'm going to do it in this PR, still working on it.

@clement2026 clement2026 force-pushed the common-test-of-e2e-disable-strict-reconfig-check-by-default branch 3 times, most recently from f71afd0 to d44bb5d Compare August 25, 2022 07:07
@clement2026
Copy link
Contributor Author

clement2026 commented Aug 25, 2022

A new commit was force pushed. It enables --strict-reconfig-check in both e2e and integration cluster. It also exposes --strict-reconfig-check to ClusterConfig.

The checks are expected to fail. The integration testing on my own computer shows that 15 test cases failed due to this commit. Most of them use MemberAdd or MemberRemove function.

I'm gonna push another commit to fix or discuss these failed test cases. Before that, I would like to know if changes in this commit are appropriate @ahrtr .

image

Comment on lines +194 to 191
func withDisableStrictReconfig() ctlOption {
return func(cx *ctlCtx) { cx.disableStrictReconfigCheck = true }
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func withDisableStrictReconfig() ctlOption {
return func(cx *ctlCtx) { cx.disableStrictReconfigCheck = true }
}
func withStrictReconfig() ctlOption {
return func(cx *ctlCtx) { cx.disableStrictReconfigCheck = true }
}

@ahrtr
Copy link
Member

ahrtr commented Aug 25, 2022

@clarkfw yes, overall the change looks good to me. Please continue to finish the task. Thanks.

I see that the e2e configuration is a bit mess. There are three ways to set the configuration.

  1. Users(test cases) can set a configuration object/struct directly;
  2. Users(test cases) can set some fields via the ctlCtx. For example, set the initialCorruptCheck in ctlCtx, then pass it to cfg.InitialCorruptCheck;
  3. Users(test cases) can set some fields to EtcdProcessClusterConfig directly, such as withMaxConcurrentStreams or CorruptCheckTime

I would suggest to remove all EtcdProcessClusterConfig related fields from ctlCtx, and keep using # 1 and # 3 above. ctlCtx already has a field cfg e2e.EtcdProcessClusterConfig, so it doesn't make sense to define duplicated items (e.g. initialCorruptCheck, quotaBackendBytes) inside ctlCtx. WDYT? @serathius @spzala @clarkfw

FYI. I just submitted a PR to cleanup this. #14389

@serathius
Copy link
Member

+1 to config cleanup (not needed to be addressed in this PR). For me whole usage ctlCtx is an anti-pattern of how tests should be written, so it would be even better if we just get rid of it :P

@clement2026 clement2026 force-pushed the common-test-of-e2e-disable-strict-reconfig-check-by-default branch 7 times, most recently from 1eced78 to e9a26a9 Compare August 31, 2022 14:21
@clement2026 clement2026 force-pushed the common-test-of-e2e-disable-strict-reconfig-check-by-default branch from e9a26a9 to 93a17b9 Compare August 31, 2022 14:21
@clement2026
Copy link
Contributor Author

With --strict-reconfig-check enabled by default, some test cases fail because they call MemberAdd or MemberRemove function.
In this commit(93a17b9), I disabled --strict-reconfig-check for the failed test cases.

@clement2026
Copy link
Contributor Author

There are still some check failures. I'm working on it.

@clement2026 clement2026 force-pushed the common-test-of-e2e-disable-strict-reconfig-check-by-default branch from 93a17b9 to 7f7ccd8 Compare August 31, 2022 16:40
@clement2026 clement2026 force-pushed the common-test-of-e2e-disable-strict-reconfig-check-by-default branch from 7f7ccd8 to 3f3149b Compare September 3, 2022 07:57
@clement2026
Copy link
Contributor Author

Thank God, all checks have passed.
Please take a look when you guys are available🤓

PeerTLS TLSConfig
ClientTLS TLSConfig
QuotaBackendBytes int64
DisableStrictReconfigCheck bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name DisableStrictReconfigCheck seems to be anti-pattern because users (test cases) need to use double negative to enable it, not disabled == enabled. It's also a tradeoff, please see #14389 (comment) .

You just follow the same naming as the existing e2e configuration NoStrictReconfig, so it isn't your fault. So we can keep it as it's for now, and feel free to change it to StrictReconfigCheck so as to be consistent with the definition in the production code in a separate PR.

@ahrtr
Copy link
Member

ahrtr commented Sep 3, 2022

Overall looks good to me. I see that you disabled the StrictReconfigCheck for some cases, did you intentionally disable them to increase the test coverage or fix the test failures?

@clement2026
Copy link
Contributor Author

clement2026 commented Sep 4, 2022

Overall looks good to me. I see that you disabled the StrictReconfigCheck for some cases, did you intentionally disable them to increase the test coverage or fix the test failures?

Just to fix the test failures. After enabling StrictReconfigCheck these tests failed due to calling MemberAdd or MemberRemove function. I disabled StrictReconfigCheck for theses tests to make them run as they did before.

I'm also working on another thread(#14281) that migrates MemberAdd and MemberRemove functions from e2e tests to common framework. Both enabling and disabling StrictReconfigCheck will be covered in that thread.

Copy link
Member

@ahrtr ahrtr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thank you @clarkfw

@ahrtr ahrtr merged commit 4d57eb8 into etcd-io:main Sep 4, 2022
@clement2026 clement2026 deleted the common-test-of-e2e-disable-strict-reconfig-check-by-default branch September 5, 2022 06:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants