Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

redo log changefeed stuck after network chaos injection #6859

Closed
fubinzh opened this issue Aug 23, 2022 · 3 comments · Fixed by #6882
Closed

redo log changefeed stuck after network chaos injection #6859

fubinzh opened this issue Aug 23, 2022 · 3 comments · Fixed by #6882
Assignees
Labels
affects-5.1 affects-5.2 affects-5.3 affects-5.4 This bug affects the 5.4.x(LTS) versions. affects-6.0 affects-6.1 This bug affects the 6.1.x(LTS) versions. affects-6.2 area/ticdc Issues or PRs related to TiCDC. severity/major type/bug The issue is confirmed as a bug.

Comments

@fubinzh
Copy link

fubinzh commented Aug 23, 2022

What did you do?

  1. Start redo log changefeed, store redo log to testing minio server
  2. Run sysbench workload for 5 miniutes
  3. Run "cdc redo apply" to downstream. (I didn't stop the changefeed before redo apply)
  4. Inject network chaos between tikv <-> minio, and cdc <-> minio for 10 minio, then restore the chaos injection.
  5. About 2 hours later, I use "cdc cli changefeed list" to check the changefeed status

What did you expect to see?

Changefeed status should be normal

What did you see instead?

"cdc cli changefeed list" command stucks.

Versions of the cluster

Upstream TiDB cluster version (execute SELECT tidb_version(); in a MySQL client):

| Release Version: v6.2.0-alpha
Edition: Community
Git Commit Hash: 8b5b724d8a932239303a1d0ba547323eb0e5161b
Git Branch: heads/refs/tags/v6.2.0-alpha
UTC Build Time: 2022-08-19 11:28:57
GoVersion: go1.18.5
Race Enabled: false
TiKV Min Version: 6.2.0-alpha
Check Table Before Drop: false

Upstream TiKV version (execute tikv-server --version):

/ # /tikv-server -V
TiKV
Release Version:   6.2.0-alpha
Edition:           Community
Git Commit Hash:   58fa80e0de0d43d473dc456081a4d2b08939e0aa
Git Commit Branch: heads/refs/tags/v6.2.0-alpha
UTC Build Time:    2022-08-19 11:03:24
Rust Version:      rustc 1.64.0-nightly (0f4bcadb4 2022-07-30)
Enable Features:   pprof-fp jemalloc mem-profiling portable sse test-engine-kv-rocksdb test-engine-raft-raft-en
gine cloud-aws cloud-gcp cloud-azure
Profile:           dist_release

TiCDC version (execute cdc version):

bash-5.1# /cdc version
Release Version: v6.2.0-alpha
Git Commit Hash: 750f0e4ddf97bb4d9feba66799de93113af04405
Git Branch: heads/refs/tags/v6.2.0-alpha
UTC Build Time: 2022-08-22 11:03:29
Go Version: go version go1.18.5 linux/amd64
Failpoint Build: false
@fubinzh fubinzh added area/ticdc Issues or PRs related to TiCDC. type/bug The issue is confirmed as a bug. labels Aug 23, 2022
@Tammyxia
Copy link

Tammyxia commented Aug 24, 2022

Reproduced this issue in cdc version hotfix 6.1.0 20220817, when test redo storage NFS umount/mount chaos for 10 hours. I'm not sure if the root cause if exactly the same, but it seems can be debug at the same time.
Release Version: v6.1.0-20220817
Git Commit Hash: 7662896
Git Branch: heads/refs/tags/v6.1.0-20220817
UTC Build Time: 2022-08-17 06:01:06
Go Version: go version go1.18.5 linux/amd64
Failpoint Build: false

@fubinzh
Copy link
Author

fubinzh commented Aug 25, 2022

/severity Major

@zhaoxinyu
Copy link
Contributor

zhaoxinyu commented Aug 25, 2022

This issue is called by initializing changfeed for multiple times if network chaos are injected.
The direct reason is redo.NewManager failed and c.initialized remains false, but ddl_puller and the corresponding goroutines are already created.

mgr, err := redo.NewManager(stdCtx, c.state.Info.Config.Consistent, redoManagerOpts)
c.redoManager = mgr
if err != nil {
return err
}

At next tick, ddl_puller and the corresponding goroutines will be created again. Because c.cancel will be reassigned with the latest cancel function. It can only cancel the latest created context. The previously created context cannot be cancelled.

cancelCtx, cancel := cdcContext.WithCancel(ctx)
c.cancel = cancel

As a result, c.releaseResource() will be blocked at the following line forever. Therefore owner will be blocked forever.

c.wg.Wait()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-5.1 affects-5.2 affects-5.3 affects-5.4 This bug affects the 5.4.x(LTS) versions. affects-6.0 affects-6.1 This bug affects the 6.1.x(LTS) versions. affects-6.2 area/ticdc Issues or PRs related to TiCDC. severity/major type/bug The issue is confirmed as a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants