Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate cluster attributes to use v3 backend #11427

Merged
merged 5 commits into from
Dec 9, 2019

Conversation

YoyinZyc
Copy link
Contributor

@YoyinZyc YoyinZyc commented Dec 6, 2019

Related #11380
Things did in this pr:

  1. add two raft internal types ClusterVersionSet ClusterMemberAttrSet
  2. use v3 server internal raft type to update cluster version.
  3. add publishV3 to publish member attributes through v3 backend. Keeping using v2 publish to avoid incompatible during upgrade.
  4. recover clusterVersion from backend instead of v2Store

cc @jingyih

jingyih and others added 4 commits December 5, 2019 16:25
Added ClusterVersionSetRequest for setting cluster version via v3 apply.

Added ClusterMemberAttrSetRequest for setting clsuter member attributes
via v3 apply.
@codecov-io
Copy link

codecov-io commented Dec 6, 2019

Codecov Report

Merging #11427 into master will decrease coverage by 0.02%.
The diff coverage is 26.25%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #11427      +/-   ##
==========================================
- Coverage   64.63%    64.6%   -0.03%     
==========================================
  Files         403      403              
  Lines       37998    38070      +72     
==========================================
+ Hits        24560    24596      +36     
- Misses      11802    11843      +41     
+ Partials     1636     1631       -5
Impacted Files Coverage Δ
etcdserver/apply_v2.go 86.15% <ø> (-3.4%) ⬇️
etcdserver/apply.go 77.63% <41.17%> (-0.99%) ⬇️
etcdserver/api/membership/cluster.go 72.77% <57.89%> (+0.35%) ⬆️
etcdserver/server.go 66.68% <6.81%> (-1.86%) ⬇️
pkg/transport/timeout_conn.go 80% <0%> (-20%) ⬇️
proxy/grpcproxy/register.go 69.44% <0%> (-11.12%) ⬇️
etcdserver/api/v3rpc/lease.go 69.31% <0%> (-5.69%) ⬇️
pkg/logutil/zap_grpc.go 47.61% <0%> (-4.77%) ⬇️
proxy/grpcproxy/watcher.go 89.79% <0%> (-4.09%) ⬇️
proxy/grpcproxy/watch.go 89.94% <0%> (-2.96%) ⬇️
... and 16 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1f8764b...b657b78. Read the comment docs.

if resp.Kvs[0].CreateRevision != 7 {
t.Fatalf("resp.Kvs[0].CreateRevision expected 7, got %d", resp.Kvs[0].CreateRevision)

if resp.Kvs[0].CreateRevision <= rev {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is CreateRevision == rev + 1 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

@@ -85,11 +83,22 @@ func TestCtlV3Migrate(t *testing.T) {
if err != nil {
t.Fatal(err)
}
rev := resp.Header.Revision
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to revAfterMigrate for better readability?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

// information is the JSON representation of this server's member struct, updated
// with the static clientURLs of the server.
// The function keeps attempting to register until it succeeds,
// or its server is stopped.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a TODO to replace publish() in etcd 3.6?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

@jingyih
Copy link
Contributor

jingyih commented Dec 6, 2019

To clarify, the intention is to add new internal raft type ClusterVersionSet and ClusterMemberAttrSet in v3.5 (so that v3.5 servers understand these types), but they will only be exercised in next minor version release v3.6.

@jingyih
Copy link
Contributor

jingyih commented Dec 6, 2019

@gyuho Could you help take a look?

The main motivation of this PR is that currently cluster version is recovered from v2 store during server restart. Because v2 store is in memory, cluster version information after server restart could be very stale. During cluster downgrade, when server starts and checks the compatibility of its own server version against cluster version, the check could fail if the cluster version info is very stale. This could be very confusing to user. As an example: #11362 (comment)

So the plan is to use v3 request and v3 backend for all cluster information updates (except members).

@@ -893,7 +893,7 @@ func TestKVLargeRequests(t *testing.T) {
expectError error
}{
{
maxRequestBytesServer: 1,
maxRequestBytesServer: 256,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we change this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, if we keep the previous test data, the test cluster will fail to start. It is because the cluster will make several v3 raft requests like cluster version set during start. Old code will use v2 for starting server which will bypass the check.

if len(data) > int(s.Cfg.MaxRequestBytes) {
return nil, ErrRequestTooLarge
}

Changing a larger maxRequestBytesServer will not affect the test behavior as long as maxRequestBytesServer is smaller then valueSize.

for {
select {
case <-s.stopping:
if lg := s.getLogger(); lg != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we update getLogger to return zap.NewNop(), so we don't need all these manual nil checks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Will fix other getLogger in a separate pr.

return

default:
if lg := s.getLogger(); lg != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above. Can we store variable once at top (before for loop), and use the same logger?

@@ -58,6 +59,9 @@ message InternalRaftRequest {
AuthRoleGetRequest auth_role_get = 1202;
AuthRoleGrantPermissionRequest auth_role_grant_permission = 1203;
AuthRoleRevokePermissionRequest auth_role_revoke_permission = 1204;

membershippb.ClusterVersionSetRequest cluster_version_set = 1300;
membershippb.ClusterMemberAttrSetRequest cluster_member_attr_set = 1301;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are unknown fields for older etcd? What happens to the older etcd if it receives this request?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, now I see. These will be ignored in old etcd, and it instead uses v2 API.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like these two fields made v3.5 to v3.4 downgrade fail.

Can we make it an experimental API in v3.4 and add these two fields to make downgrade from v3.5 to v3.4 possible?

panic: not implemented

goroutine 214 [running]:
go.etcd.io/etcd/etcdserver.(*applierV3backend).Apply(0xc000238468, 0xc0002b2460, 0x0)
        /home/chaochn/workplace/EKS-etcd/src/EKS-etcd/etcdserver/apply.go:173 +0xf85
go.etcd.io/etcd/etcdserver.(*authApplierV3).Apply(0xc000242000, 0xc0002b2460, 0x0)
        /home/chaochn/workplace/EKS-etcd/src/EKS-etcd/etcdserver/apply_auth.go:60 +0xd2
go.etcd.io/etcd/etcdserver.(*EtcdServer).applyEntryNormal(0xc000290600, 0xc0002d14d8)
        /home/chaochn/workplace/EKS-etcd/src/EKS-etcd/etcdserver/server.go:2230 +0x1fb
go.etcd.io/etcd/etcdserver.(*EtcdServer).apply(0xc000290600, 0xc000302a80, 0xa, 0xc, 0xc0001c8000, 0x0, 0xc000046000, 0xc00033f640)
        /home/chaochn/workplace/EKS-etcd/src/EKS-etcd/etcdserver/server.go:2144 +0x579
go.etcd.io/etcd/etcdserver.(*EtcdServer).applyEntries(0xc000290600, 0xc0001c8000, 0xc00002c000)
        /home/chaochn/workplace/EKS-etcd/src/EKS-etcd/etcdserver/server.go:1396 +0xe5
go.etcd.io/etcd/etcdserver.(*EtcdServer).applyAll(0xc000290600, 0xc0001c8000, 0xc00002c000)
        /home/chaochn/workplace/EKS-etcd/src/EKS-etcd/etcdserver/server.go:1120 +0x88
go.etcd.io/etcd/etcdserver.(*EtcdServer).run.func8(0x11b6e30, 0xc0005da040)
        /home/chaochn/workplace/EKS-etcd/src/EKS-etcd/etcdserver/server.go:1065 +0x3c
go.etcd.io/etcd/pkg/schedule.(*fifo).run(0xc0003c0180)
        /home/chaochn/workplace/EKS-etcd/src/EKS-etcd/pkg/schedule/schedule.go:157 +0xf3
created by go.etcd.io/etcd/pkg/schedule.NewFIFOScheduler
        /home/chaochn/workplace/EKS-etcd/src/EKS-etcd/pkg/schedule/schedule.go:70 +0x13b

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know which one failed:

  • ClusterVersionSetRequest or ClusterMemberAttrSetRequest

ClusterMemberAttrSetRequest is only called by publishV3, that is not yet called in v3.5. so I assume not this one.

ClusterVersionSetRequest seems to be called from:

func (s *EtcdServer) monitorVersions() {

@hexfusion @wenjiaswe @ychen11 shell we postpone enabling monitorVersions() till etcd-3.6 ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I confirmed in v3.4 we haven't yet implemented ClusterVersionSet API just now. And the raftRequest MustUnmarshal just slipped this through, which is surprising to me.

case r.ClusterVersionSet != nil:
    a.s.applyV3Internal.ClusterVersionSet(r.ClusterVersionSet)
default:
    panic("not implemented")

so do we want to make a equivalent monitorVersionsV3() in etcd-3.5? @ptabor

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shell we postpone enabling monitorVersions() till etcd-3.6 ?

Yes, this breaks upgrade case, which is not acceptable for 3.5 + 3.5 + 3.4.

Thanks for the catch @chaochn47.

@chaochn47 Can you help making this optional?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FTR: #12987

@@ -290,6 +290,7 @@ func (cfg *etcdProcessClusterConfig) etcdServerProcessConfigs() []*etcdServerPro
for i := range etcdCfgs {
etcdCfgs[i].initialCluster = strings.Join(initialCluster, ",")
etcdCfgs[i].args = append(etcdCfgs[i].args, initialClusterArgs...)
etcdCfgs[i].args = append(etcdCfgs[i].args, "--logger=zap")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jingyih Should we make zap default in a separate PR, so we don't make this unrelated change?

@@ -289,6 +289,7 @@ func testCtlV2Backup(t *testing.T, snapCount int, v3 bool) {

if v3 {
// v3 must lock the db to backup, so stop process
time.Sleep(time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need comment on why

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed in current pr since we keep using v2 publish(). It seems like the cluster needs more time publish using publishV3 which makes TestCtlV2BackupV3 a flaky test. Adding a sleep is to address this.

@@ -148,6 +150,7 @@ func TestEtcdPeerCNAuth(t *testing.T) {
"--listen-peer-urls", fmt.Sprintf("https://127.0.0.1:%d,https://127.0.0.1:%d", etcdProcessBasePort+i, etcdProcessBasePort+len(peers)+i),
"--initial-advertise-peer-urls", fmt.Sprintf("https://127.0.0.1:%d", etcdProcessBasePort+i),
"--initial-cluster", ic,
"--logger=zap",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above, let's make zap default in a separate PR

if lg != nil {
lg.Panic(
"unexpected number of keys when getting cluster version from backend",
zap.Int("number fo keys", len(keys)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we make zap field as json tag as in Go? since it's structured logging, it should be logged as something like "number-of-keys" or "key"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@@ -1898,7 +1898,7 @@ func TestV3LargeRequests(t *testing.T) {
expectError error
}{
// don't set to 0. use 0 as the default.
{1, 1024, rpctypes.ErrGRPCRequestTooLarge},
{256, 1024, rpctypes.ErrGRPCRequestTooLarge},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we explain why we need this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explained above.
#11427 (comment)

@gyuho
Copy link
Contributor

gyuho commented Dec 8, 2019

So the plan is to use v3 request and v3 backend for all cluster information updates

Sounds good. Overall approach looks good.

if c.be != nil {
c.version = clusterVersionFromBackend(c.lg, c.be)
} else {
c.version = clusterVersionFromStore(c.lg, c.v2store)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need fallback on v2 version data if v3 version is nil (upgraded etcd process)?

c.version = clusterVersionFromStore(c.lg, c.v2store)
if c.be != nil {
        v3Ver := clusterVersionFromBackend(c.lg, c.be)
	if v3Ver != nil { c.version = v3Ver }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about it. When will v3 version data is nil while v2 is not? If v3 backend is available, the v3 version will not be nil except starting a new server? Setting version will store new version info into both v2 store and v3 backend.

if c.v2store != nil {
mustSaveClusterVersionToStore(c.v2store, ver)
}
if c.be != nil {
mustSaveClusterVersionToBackend(c.be, ver)
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, that answers my question. Thanks.

@YoyinZyc YoyinZyc force-pushed the migration-cluster-attr branch from 5b86d3d to b657b78 Compare December 9, 2019 19:09
ClientUrls: s.attributes.ClientURLs,
},
}
lg := zap.NewNop()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant, we still use the zap logger, but a separate function getLogger should handle the previous if-statements. Can we use the logger from EtcdServer in this PR, and iterate more to clean up getLogger method in the separate PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got you. Will update getLogger() in another PR.

@YoyinZyc YoyinZyc force-pushed the migration-cluster-attr branch from b657b78 to 7784ca8 Compare December 9, 2019 21:09
Copy link
Contributor

@gyuho gyuho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, great job!

Please update CHANGELOG accordingly.

@gyuho gyuho merged commit 9f81002 into etcd-io:master Dec 9, 2019
@jingyih
Copy link
Contributor

jingyih commented Dec 9, 2019

Thanks for taking the time to review! @gyuho
Good work! @YoyinZyc

YoyinZyc added a commit to YoyinZyc/etcd that referenced this pull request Dec 9, 2019
YoyinZyc added a commit to YoyinZyc/etcd that referenced this pull request Dec 11, 2019
YoyinZyc added a commit to YoyinZyc/etcd that referenced this pull request Dec 11, 2019
jingyih added a commit that referenced this pull request Dec 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

6 participants