Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kill -9 PD non-leader instance, cause all 2 cdc instances restarted. #7067

Closed
3AceShowHand opened this issue Sep 14, 2022 · 3 comments · Fixed by #7069
Closed

kill -9 PD non-leader instance, cause all 2 cdc instances restarted. #7067

3AceShowHand opened this issue Sep 14, 2022 · 3 comments · Fixed by #7069
Assignees
Labels
area/ticdc Issues or PRs related to TiCDC. availability service availability component/pd Issues related to pd severity/minor type/bug The issue is confirmed as a bug.

Comments

@3AceShowHand
Copy link
Contributor

3AceShowHand commented Sep 14, 2022

What did you do?

2 pd, 2 ticdc.

kill -9 pd non leader instance.

What did you expect to see?

ticdc should works normally

What did you see instead?

all 2 cdc nodes restart

[2022/09/14 18:47:45.182 +08:00] [WARN] [client.go:97] ["etcd RPC failed"] [RPC=Txn] [error="rpc error: code = Unavailable desc = error reading from server: read tcp 10.244.3.5:49780->10.244.9.155:2379: read: connection reset by peer"]
[2022/09/14 18:47:45.182 +08:00] [INFO] [client.go:235] ["WatchWithChan exited"] [role=owner]
[2022/09/14 18:47:47.354 +08:00] [WARN] [server.go:302] ["etcd health check error"] [endpoint=http://lingjin-1.jinling.tispace:2379] [error="Get \"http://lingjin-1.jinling.tispace:2379/pd/api/v1/health/\": dial tcp 10.244.9.155:2379: connect: connection refused"] [errorVerbose="Get \"http://lingjin-1.jinling.tispace:2379/pd/api/v1/health/\": dial tcp 10.244.9.155:2379: connect: connection refused\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/[email protected]/errors.go:174\ngithub.com/pingcap/errors.Trace\n\tgithub.com/pingcap/[email protected]/juju_adaptor.go:15\ngithub.com/pingcap/tiflow/pkg/pdutil.(*pdAPIClient).Healthy\n\tgithub.com/pingcap/tiflow/pkg/pdutil/api_client.go:224\ngithub.com/pingcap/tiflow/cdc/server.(*server).etcdHealthChecker\n\tgithub.com/pingcap/tiflow/cdc/server/server.go:301\ngithub.com/pingcap/tiflow/cdc/server.(*server).run.func2\n\tgithub.com/pingcap/tiflow/cdc/server/server.go:324\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/[email protected]/errgroup/errgroup.go:75\nruntime.goexit\n\truntime/asm_amd64.s:1571"]

Versions of the cluster

Upstream TiDB cluster version (execute SELECT tidb_version(); in a MySQL client):

(paste TiDB cluster version here)

Upstream TiKV version (execute tikv-server --version):

(paste TiKV version here)

TiCDC version (execute cdc version):

master
@3AceShowHand
Copy link
Contributor Author

We kill the pd at the 11:37:31, and the cdc loss session at 11:37:43, then restart.

[2022/09/15 11:37:32.647 +08:00] [WARN] [server.go:302] ["etcd health check error"] [endpoint=http://lingjin-1.jinling.tispace:2379] [error="Get \"http://lingjin-1.jinling.tispace:2379/pd/api/v1/health/\": dial tcp 10.244.9.155:2379: connect: connection refused"] [errorVerbose="Get \"http://lingjin-1.jinling.tispace:2379/pd/api/v1/health/\": dial tcp 10.244.9.155:2379: connect: connection refused\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/[email protected]/errors.go:174\ngithub.com/pingcap/errors.Trace\n\tgithub.com/pingcap/[email protected]/juju_adaptor.go:15\ngithub.com/pingcap/tiflow/pkg/pdutil.(*pdAPIClient).Healthy\n\tgithub.com/pingcap/tiflow/pkg/pdutil/api_client.go:224\ngithub.com/pingcap/tiflow/cdc/server.(*server).etcdHealthChecker\n\tgithub.com/pingcap/tiflow/cdc/server/server.go:301\ngithub.com/pingcap/tiflow/cdc/server.(*server).run.func2\n\tgithub.com/pingcap/tiflow/cdc/server/server.go:324\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/[email protected]/errgroup/errgroup.go:75\nruntime.goexit\n\truntime/asm_amd64.s:1571"]
[2022/09/15 11:37:35.648 +08:00] [WARN] [server.go:302] ["etcd health check error"] [endpoint=http://lingjin-1.jinling.tispace:2379] [error="Get \"http://lingjin-1.jinling.tispace:2379/pd/api/v1/health/\": dial tcp 10.244.9.155:2379: connect: connection refused"] [errorVerbose="Get \"http://lingjin-1.jinling.tispace:2379/pd/api/v1/health/\": dial tcp 10.244.9.155:2379: connect: connection refused\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/[email protected]/errors.go:174\ngithub.com/pingcap/errors.Trace\n\tgithub.com/pingcap/[email protected]/juju_adaptor.go:15\ngithub.com/pingcap/tiflow/pkg/pdutil.(*pdAPIClient).Healthy\n\tgithub.com/pingcap/tiflow/pkg/pdutil/api_client.go:224\ngithub.com/pingcap/tiflow/cdc/server.(*server).etcdHealthChecker\n\tgithub.com/pingcap/tiflow/cdc/server/server.go:301\ngithub.com/pingcap/tiflow/cdc/server.(*server).run.func2\n\tgithub.com/pingcap/tiflow/cdc/server/server.go:324\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/[email protected]/errgroup/errgroup.go:75\nruntime.goexit\n\truntime/asm_amd64.s:1571"]
[2022/09/15 11:37:36.667 +08:00] [ERROR] [pd.go:236] ["updateTS error"] [txnScope=global] [error="rpc error: code = Unknown desc = [PD:tso:ErrGenerateTimestamp]generate timestamp failed, requested pd is not leader of cluster"] [errorVerbose="rpc error: code = Unknown desc = [PD:tso:ErrGenerateTimestamp]generate timestamp failed, requested pd is not leader of cluster\ngithub.com/tikv/pd/client.(*client).processTSORequests\n\tgithub.com/tikv/pd/[email protected]/client.go:1084\ngithub.com/tikv/pd/client.(*client).handleDispatcher\n\tgithub.com/tikv/pd/[email protected]/client.go:840\nruntime.goexit\n\truntime/asm_amd64.s:1571\ngithub.com/tikv/pd/client.(*tsoRequest).Wait\n\tgithub.com/tikv/pd/[email protected]/client.go:1300\ngithub.com/tikv/pd/client.(*client).GetTS\n\tgithub.com/tikv/pd/[email protected]/client.go:1320\ngithub.com/tikv/client-go/v2/util.InterceptedPDClient.GetTS\n\tgithub.com/tikv/client-go/[email protected]/util/pd_interceptor.go:81\ngithub.com/tikv/client-go/v2/oracle/oracles.(*pdOracle).getTimestamp\n\tgithub.com/tikv/client-go/[email protected]/oracle/oracles/pd.go:143\ngithub.com/tikv/client-go/v2/oracle/oracles.(*pdOracle).updateTS.func1\n\tgithub.com/tikv/client-go/[email protected]/oracle/oracles/pd.go:234\nsync.(*Map).Range\n\tsync/map.go:347\ngithub.com/tikv/client-go/v2/oracle/oracles.(*pdOracle).updateTS\n\tgithub.com/tikv/client-go/[email protected]/oracle/oracles/pd.go:232\nruntime.goexit\n\truntime/asm_amd64.s:1571\ngithub.com/tikv/client-go/v2/oracle/oracles.(*pdOracle).getTimestamp\n\tgithub.com/tikv/client-go/[email protected]/oracle/oracles/pd.go:148\ngithub.com/tikv/client-go/v2/oracle/oracles.(*pdOracle).updateTS.func1\n\tgithub.com/tikv/client-go/[email protected]/oracle/oracles/pd.go:234\nsync.(*Map).Range\n\tsync/map.go:347\ngithub.com/tikv/client-go/v2/oracle/oracles.(*pdOracle).updateTS\n\tgithub.com/tikv/client-go/[email protected]/oracle/oracles/pd.go:232\nruntime.goexit\n\truntime/asm_amd64.s:1571"]
[2022/09/15 11:37:36.784 +08:00] [WARN] [clock.go:99] ["get time from pd failed, will use local time as pd time"]
[2022/09/15 11:37:37.694 +08:00] [WARN] [clock.go:99] ["get time from pd failed, will use local time as pd time"]
[2022/09/15 11:37:38.613 +08:00] [WARN] [clock.go:99] ["get time from pd failed, will use local time as pd time"]
[2022/09/15 11:37:38.647 +08:00] [WARN] [server.go:302] ["etcd health check error"] [endpoint=http://lingjin-1.jinling.tispace:2379] [error="Get \"http://lingjin-1.jinling.tispace:2379/pd/api/v1/health/\": dial tcp 10.244.9.155:2379: connect: connection refused"] [errorVerbose="Get \"http://lingjin-1.jinling.tispace:2379/pd/api/v1/health/\": dial tcp 10.244.9.155:2379: connect: connection refused\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/[email protected]/errors.go:174\ngithub.com/pingcap/errors.Trace\n\tgithub.com/pingcap/[email protected]/juju_adaptor.go:15\ngithub.com/pingcap/tiflow/pkg/pdutil.(*pdAPIClient).Healthy\n\tgithub.com/pingcap/tiflow/pkg/pdutil/api_client.go:224\ngithub.com/pingcap/tiflow/cdc/server.(*server).etcdHealthChecker\n\tgithub.com/pingcap/tiflow/cdc/server/server.go:301\ngithub.com/pingcap/tiflow/cdc/server.(*server).run.func2\n\tgithub.com/pingcap/tiflow/cdc/server/server.go:324\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/[email protected]/errgroup/errgroup.go:75\nruntime.goexit\n\truntime/asm_amd64.s:1571"]
[2022/09/15 11:37:38.667 +08:00] [ERROR] [pd.go:236] ["updateTS error"] [txnScope=global] [error="rpc error: code = Unknown desc = [PD:tso:ErrGenerateTimestamp]generate timestamp failed, requested pd is not leader of cluster"] [errorVerbose="rpc error: code = Unknown desc = [PD:tso:ErrGenerateTimestamp]generate timestamp failed, requested pd is not leader of cluster\ngithub.com/tikv/pd/client.(*client).processTSORequests\n\tgithub.com/tikv/pd/[email protected]/client.go:1084\ngithub.com/tikv/pd/client.(*client).handleDispatcher\n\tgithub.com/tikv/pd/[email protected]/client.go:840\nruntime.goexit\n\truntime/asm_amd64.s:1571\ngithub.com/tikv/pd/client.(*tsoRequest).Wait\n\tgithub.com/tikv/pd/[email protected]/client.go:1300\ngithub.com/tikv/pd/client.(*client).GetTS\n\tgithub.com/tikv/pd/[email protected]/client.go:1320\ngithub.com/tikv/client-go/v2/util.InterceptedPDClient.GetTS\n\tgithub.com/tikv/client-go/[email protected]/util/pd_interceptor.go:81\ngithub.com/tikv/client-go/v2/oracle/oracles.(*pdOracle).getTimestamp\n\tgithub.com/tikv/client-go/[email protected]/oracle/oracles/pd.go:143\ngithub.com/tikv/client-go/v2/oracle/oracles.(*pdOracle).updateTS.func1\n\tgithub.com/tikv/client-go/[email protected]/oracle/oracles/pd.go:234\nsync.(*Map).Range\n\tsync/map.go:347\ngithub.com/tikv/client-go/v2/oracle/oracles.(*pdOracle).updateTS\n\tgithub.com/tikv/client-go/[email protected]/oracle/oracles/pd.go:232\nruntime.goexit\n\truntime/asm_amd64.s:1571\ngithub.com/tikv/client-go/v2/oracle/oracles.(*pdOracle).getTimestamp\n\tgithub.com/tikv/client-go/[email protected]/oracle/oracles/pd.go:148\ngithub.com/tikv/client-go/v2/oracle/oracles.(*pdOracle).updateTS.func1\n\tgithub.com/tikv/client-go/[email protected]/oracle/oracles/pd.go:234\nsync.(*Map).Range\n\tsync/map.go:347\ngithub.com/tikv/client-go/v2/oracle/oracles.(*pdOracle).updateTS\n\tgithub.com/tikv/client-go/[email protected]/oracle/oracles/pd.go:232\nruntime.goexit\n\truntime/asm_amd64.s:1571"]
[2022/09/15 11:37:39.525 +08:00] [WARN] [clock.go:99] ["get time from pd failed, will use local time as pd time"]
[2022/09/15 11:37:40.438 +08:00] [WARN] [clock.go:99] ["get time from pd failed, will use local time as pd time"]
[2022/09/15 11:37:40.667 +08:00] [ERROR] [pd.go:236] ["updateTS error"] [txnScope=global] [error="rpc error: code = Unknown desc = [PD:tso:ErrGenerateTimestamp]generate timestamp failed, requested pd is not leader of cluster"] [errorVerbose="rpc error: code = Unknown desc = [PD:tso:ErrGenerateTimestamp]generate timestamp failed, requested pd is not leader of cluster\ngithub.com/tikv/pd/client.(*client).processTSORequests\n\tgithub.com/tikv/pd/[email protected]/client.go:1084\ngithub.com/tikv/pd/client.(*client).handleDispatcher\n\tgithub.com/tikv/pd/[email protected]/client.go:840\nruntime.goexit\n\truntime/asm_amd64.s:1571\ngithub.com/tikv/pd/client.(*tsoRequest).Wait\n\tgithub.com/tikv/pd/[email protected]/client.go:1300\ngithub.com/tikv/pd/client.(*client).GetTS\n\tgithub.com/tikv/pd/[email protected]/client.go:1320\ngithub.com/tikv/client-go/v2/util.InterceptedPDClient.GetTS\n\tgithub.com/tikv/client-go/[email protected]/util/pd_interceptor.go:81\ngithub.com/tikv/client-go/v2/oracle/oracles.(*pdOracle).getTimestamp\n\tgithub.com/tikv/client-go/[email protected]/oracle/oracles/pd.go:143\ngithub.com/tikv/client-go/v2/oracle/oracles.(*pdOracle).updateTS.func1\n\tgithub.com/tikv/client-go/[email protected]/oracle/oracles/pd.go:234\nsync.(*Map).Range\n\tsync/map.go:347\ngithub.com/tikv/client-go/v2/oracle/oracles.(*pdOracle).updateTS\n\tgithub.com/tikv/client-go/[email protected]/oracle/oracles/pd.go:232\nruntime.goexit\n\truntime/asm_amd64.s:1571\ngithub.com/tikv/client-go/v2/oracle/oracles.(*pdOracle).getTimestamp\n\tgithub.com/tikv/client-go/[email protected]/oracle/oracles/pd.go:148\ngithub.com/tikv/client-go/v2/oracle/oracles.(*pdOracle).updateTS.func1\n\tgithub.com/tikv/client-go/[email protected]/oracle/oracles/pd.go:234\nsync.(*Map).Range\n\tsync/map.go:347\ngithub.com/tikv/client-go/v2/oracle/oracles.(*pdOracle).updateTS\n\tgithub.com/tikv/client-go/[email protected]/oracle/oracles/pd.go:232\nruntime.goexit\n\truntime/asm_amd64.s:1571"]
[2022/09/15 11:37:41.351 +08:00] [WARN] [clock.go:99] ["get time from pd failed, will use local time as pd time"]
[2022/09/15 11:37:41.647 +08:00] [WARN] [server.go:302] ["etcd health check error"] [endpoint=http://lingjin-1.jinling.tispace:2379] [error="Get \"http://lingjin-1.jinling.tispace:2379/pd/api/v1/health/\": dial tcp 10.244.9.155:2379: connect: connection refused"] [errorVerbose="Get \"http://lingjin-1.jinling.tispace:2379/pd/api/v1/health/\": dial tcp 10.244.9.155:2379: connect: connection refused\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/[email protected]/errors.go:174\ngithub.com/pingcap/errors.Trace\n\tgithub.com/pingcap/[email protected]/juju_adaptor.go:15\ngithub.com/pingcap/tiflow/pkg/pdutil.(*pdAPIClient).Healthy\n\tgithub.com/pingcap/tiflow/pkg/pdutil/api_client.go:224\ngithub.com/pingcap/tiflow/cdc/server.(*server).etcdHealthChecker\n\tgithub.com/pingcap/tiflow/cdc/server/server.go:301\ngithub.com/pingcap/tiflow/cdc/server.(*server).run.func2\n\tgithub.com/pingcap/tiflow/cdc/server/server.go:324\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/[email protected]/errgroup/errgroup.go:75\nruntime.goexit\n\truntime/asm_amd64.s:1571"]
[2022/09/15 11:37:42.265 +08:00] [WARN] [clock.go:99] ["get time from pd failed, will use local time as pd time"]
[2022/09/15 11:37:42.278 +08:00] [ERROR] [kv.go:243] ["fail to load safepoint from pd"] [error="rpc error: code = Unknown desc = context deadline exceeded"] [errorVerbose="rpc error: code = Unknown desc = context deadline exceeded\ngithub.com/tikv/client-go/v2/tikv.(*EtcdSafePointKV).Get\n\tgithub.com/tikv/client-go/[email protected]/tikv/safepoint.go:148\ngithub.com/tikv/client-go/v2/tikv.loadSafePoint\n\tgithub.com/tikv/client-go/[email protected]/tikv/safepoint.go:183\ngithub.com/tikv/client-go/v2/tikv.(*KVStore).runSafePointChecker\n\tgithub.com/tikv/client-go/[email protected]/tikv/kv.go:236\nruntime.goexit\n\truntime/asm_amd64.s:1571"]
[2022/09/15 11:37:42.667 +08:00] [ERROR] [pd.go:236] ["updateTS error"] [txnScope=global] [error="rpc error: code = Unknown desc = [PD:tso:ErrGenerateTimestamp]generate timestamp failed, requested pd is not leader of cluster"] [errorVerbose="rpc error: code = Unknown desc = [PD:tso:ErrGenerateTimestamp]generate timestamp failed, requested pd is not leader of cluster\ngithub.com/tikv/pd/client.(*client).processTSORequests\n\tgithub.com/tikv/pd/[email protected]/client.go:1084\ngithub.com/tikv/pd/client.(*client).handleDispatcher\n\tgithub.com/tikv/pd/[email protected]/client.go:840\nruntime.goexit\n\truntime/asm_amd64.s:1571\ngithub.com/tikv/pd/client.(*tsoRequest).Wait\n\tgithub.com/tikv/pd/[email protected]/client.go:1300\ngithub.com/tikv/pd/client.(*client).GetTS\n\tgithub.com/tikv/pd/[email protected]/client.go:1320\ngithub.com/tikv/client-go/v2/util.InterceptedPDClient.GetTS\n\tgithub.com/tikv/client-go/[email protected]/util/pd_interceptor.go:81\ngithub.com/tikv/client-go/v2/oracle/oracles.(*pdOracle).getTimestamp\n\tgithub.com/tikv/client-go/[email protected]/oracle/oracles/pd.go:143\ngithub.com/tikv/client-go/v2/oracle/oracles.(*pdOracle).updateTS.func1\n\tgithub.com/tikv/client-go/[email protected]/oracle/oracles/pd.go:234\nsync.(*Map).Range\n\tsync/map.go:347\ngithub.com/tikv/client-go/v2/oracle/oracles.(*pdOracle).updateTS\n\tgithub.com/tikv/client-go/[email protected]/oracle/oracles/pd.go:232\nruntime.goexit\n\truntime/asm_amd64.s:1571\ngithub.com/tikv/client-go/v2/oracle/oracles.(*pdOracle).getTimestamp\n\tgithub.com/tikv/client-go/[email protected]/oracle/oracles/pd.go:148\ngithub.com/tikv/client-go/v2/oracle/oracles.(*pdOracle).updateTS.func1\n\tgithub.com/tikv/client-go/[email protected]/oracle/oracles/pd.go:234\nsync.(*Map).Range\n\tsync/map.go:347\ngithub.com/tikv/client-go/v2/oracle/oracles.(*pdOracle).updateTS\n\tgithub.com/tikv/client-go/[email protected]/oracle/oracles/pd.go:232\nruntime.goexit\n\truntime/asm_amd64.s:1571"]
[2022/09/15 11:37:43.179 +08:00] [WARN] [clock.go:99] ["get time from pd failed, will use local time as pd time"]
[2022/09/15 11:37:43.180 +08:00] [INFO] [client.go:235] ["WatchWithChan exited"] [role=processor]
[2022/09/15 11:37:43.180 +08:00] [WARN] [capture.go:499] ["session is disconnected"] [error="[CDC:ErrEtcdSessionDone]the etcd session is done"] [errorVerbose="[CDC:ErrEtcdSessionDone]the etcd session is done\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/[email protected]/errors.go:174\ngithub.com/pingcap/errors.(*Error).GenWithStackByArgs\n\tgithub.com/pingcap/[email protected]/normalize.go:164\ngithub.com/pingcap/tiflow/pkg/orchestrator.(*EtcdWorker).Run\n\tgithub.com/pingcap/tiflow/pkg/orchestrator/etcd_worker.go:186\ngithub.com/pingcap/tiflow/cdc/capture.(*captureImpl).runEtcdWorker\n\tgithub.com/pingcap/tiflow/cdc/capture/capture.go:490\ngithub.com/pingcap/tiflow/cdc/capture.(*captureImpl).run.func4\n\tgithub.com/pingcap/tiflow/cdc/capture/capture.go:333\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/[email protected]/errgroup/errgroup.go:75\nruntime.goexit\n\truntime/asm_amd64.s:1571"]
[2022/09/15 11:37:43.180 +08:00] [INFO] [capture.go:334] ["processor routine exited"] [captureID=558666f3-922c-4d2d-8b92-f40c40e8009a] [error="[CDC:ErrCaptureSuicide]capture suicide"] [errorVerbose="[CDC:ErrCaptureSuicide]capture suicide\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/[email protected]/errors.go:174\ngithub.com/pingcap/errors.(*Error).GenWithStackByArgs\n\tgithub.com/pingcap/[email protected]/normalize.go:164\ngithub.com/pingcap/tiflow/cdc/capture.(*captureImpl).runEtcdWorker\n\tgithub.com/pingcap/tiflow/cdc/capture/capture.go:500\ngithub.com/pingcap/tiflow/cdc/capture.(*captureImpl).run.func4\n\tgithub.com/pingcap/tiflow/cdc/capture/capture.go:333\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/[email protected]/errgroup/errgroup.go:75\nruntime.goexit\n\truntime/asm_amd64.s:1571"]

@3AceShowHand
Copy link
Contributor Author

Since 2 PD instances in the cluster, one pd member goes offline makes the leader does not have support from the majority, cause pd leader resigned and re-elected, cost more than 20 seconds.

During this time period, the PD cluster is unavailable, also the etcd cluster, which makes the TiCDC etcd session disconnected.

@3AceShowHand
Copy link
Contributor Author

There are 3 PD instance, kill -9 one non leader pd instance, cause the cdc leader switched.

image

image

ti-chi-bot pushed a commit that referenced this issue Sep 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ticdc Issues or PRs related to TiCDC. availability service availability component/pd Issues related to pd severity/minor type/bug The issue is confirmed as a bug.
Projects
None yet
1 participant