Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TiCDC owner gets stuck when creating new kafka producer #3352

Closed
Tracked by #2971
amyangfei opened this issue Nov 9, 2021 · 2 comments · Fixed by #3598
Closed
Tracked by #2971

TiCDC owner gets stuck when creating new kafka producer #3352

amyangfei opened this issue Nov 9, 2021 · 2 comments · Fixed by #3598
Assignees
Labels

Comments

@amyangfei
Copy link
Contributor

amyangfei commented Nov 9, 2021

What did you do?

  1. Setup a ticdc cluster, replicating data to Kafka
  2. Sending pause changefeed or remove changefeed doesn't work

What did you expect to see?

No response

What did you see instead?

owner gets stuck

image

ticdc goroutins:

goroutines_222.log
goroutines_222_second.log

ticdc log:

cdc.log

metrics:
metrics.tar.gz

Two owner newChangeFeed, looks like some dead lock.

goroutine 1300 [semacquire]:
sync.runtime_SemacquireMutex(0xc00957a3d4, 0x0, 0x1)
	runtime/sema.go:71 +0x47
sync.(*Mutex).lockSlow(0xc00957a3d0)
	sync/mutex.go:138 +0xfc
sync.(*Mutex).Lock(...)
	sync/mutex.go:81
github.com/Shopify/sarama.(*Broker).send(0xc00957a380, 0x32d75c0, 0xc005f18bd0, 0xc301, 0x0, 0x0, 0x0)
	github.com/Shopify/[email protected]/broker.go:716 +0x6e2
github.com/Shopify/sarama.(*Broker).sendAndReceive(0xc00957a380, 0x32d75c0, 0xc005f18bd0, 0x32d7620, 0xc0058157c0, 0x2942400, 0x1)
	github.com/Shopify/[email protected]/broker.go:765 +0x84
github.com/Shopify/sarama.(*Broker).GetMetadata(0xc00957a380, 0xc005f18bd0, 0x0, 0x0, 0x1)
	github.com/Shopify/[email protected]/broker.go:283 +0x70
github.com/Shopify/sarama.(*client).tryRefreshMetadata(0xc0093b42d0, 0x0, 0x0, 0x0, 0x64, 0x0, 0x0, 0x0, 0x22cd430, 0x0)
	github.com/Shopify/[email protected]/client.go:880 +0x4d1
github.com/Shopify/sarama.(*client).tryRefreshMetadata.func2(0x3249ba0, 0xc000c5c3b0, 0x1, 0x1)
	github.com/Shopify/[email protected]/client.go:859 +0x1d5
github.com/Shopify/sarama.(*client).tryRefreshMetadata(0xc0093b42d0, 0x0, 0x0, 0x0, 0x65, 0x0, 0x0, 0x0, 0x22cd430, 0x0)
	github.com/Shopify/[email protected]/client.go:927 +0xc51
....
github.com/Shopify/sarama.(*client).tryRefreshMetadata(0xc0093b42d0, 0x0, 0x0, 0x0, 0x77, 0x0, 0x0, 0x0, 0x22cd430, 0x0)
	github.com/Shopify/[email protected]/client.go:927 +0xc51
github.com/Shopify/sarama.(*client).tryRefreshMetadata.func2(0x3249ba0, 0xc000c5c3b0, 0x1, 0x1)
	github.com/Shopify/[email protected]/client.go:859 +0x1d5
github.com/Shopify/sarama.(*client).tryRefreshMetadata(0xc0093b42d0, 0x0, 0x0, 0x0, 0x78, 0x0, 0x0, 0x0, 0x0, 0x0)
	github.com/Shopify/[email protected]/client.go:927 +0xc51
github.com/Shopify/sarama.(*client).RefreshMetadata(0xc0093b42d0, 0x0, 0x0, 0x0, 0x7f2b94f9d62c, 0x12)
	github.com/Shopify/[email protected]/client.go:473 +0xbe
github.com/Shopify/sarama.NewClient(0xc0022aa750, 0x3, 0x3, 0xc0092d7c00, 0x2ad5040, 0x2, 0x3, 0xc0022aa750)
	github.com/Shopify/[email protected]/client.go:167 +0x331
github.com/Shopify/sarama.NewAsyncProducer(0xc0022aa750, 0x3, 0x3, 0xc0092d7c00, 0x0, 0x2, 0xc0022aa750, 0x3)
	github.com/Shopify/[email protected]/async_producer.go:126 +0x4d
github.com/pingcap/ticdc/cdc/sink/producer/kafka.NewKafkaSaramaProducer(0x32b7e80, 0xc004b6e7c0, 0xc000242b08, 0x38, 0xc000242b41, 0x15, 0x100000000, 0xc000242b65, 0x5, 0x4000000, ...)
	github.com/pingcap/ticdc@/cdc/sink/producer/kafka/kafka.go:328 +0x35a
github.com/pingcap/ticdc/cdc/sink.newKafkaSaramaSink(0x32b7e80, 0xc004b6e7c0, 0xc004876480, 0xc005165720, 0xc005d31b30, 0xc005d31b00, 0xc0093b0540, 0xc000c6a388, 0x0, 0xc000c6a340)
	github.com/pingcap/ticdc@/cdc/sink/mq.go:475 +0x44b
github.com/pingcap/ticdc/cdc/sink.init.1.func3(0x32b7e80, 0xc004b6e7c0, 0xc005ff39fa, 0xc, 0xc004876480, 0xc005165720, 0xc005d31b30, 0xc005d31b00, 0xc0093b0540, 0x27fe8e0, ...)
	github.com/pingcap/ticdc@/cdc/sink/sink.go:81 +0x7a
github.com/pingcap/ticdc/cdc/sink.NewSink(0x32b7e80, 0xc004b6e7c0, 0xc005ff39fa, 0xc, 0xc000242b00, 0xae, 0xc005165720, 0xc005d31b30, 0xc005d31b00, 0xc0093b0540, ...)
	github.com/pingcap/ticdc@/cdc/sink/sink.go:113 +0x1ef
github.com/pingcap/ticdc/cdc.(*Owner).newChangeFeed(0xc0039f2140, 0x32b7e80, 0xc004b6e7c0, 0xc005ff39fa, 0xc, 0xc005585bc0, 0xc005d31ad0, 0xc00463a300, 0x5f4063fe7d00000, 0x0, ...)
	github.com/pingcap/ticdc@/cdc/owner.go:414 +0x21bb
github.com/pingcap/ticdc/cdc.(*Owner).loadChangeFeeds(0xc0039f2140, 0x32b7e80, 0xc0054b6800, 0x0, 0x0)
	github.com/pingcap/ticdc@/cdc/owner.go:615 +0xe1c
github.com/pingcap/ticdc/cdc.(*Owner).run(0xc0039f2140, 0x32b7e80, 0xc0054b6800, 0x0, 0x0)
	github.com/pingcap/ticdc@/cdc/owner.go:1309 +0xe8
github.com/pingcap/ticdc/cdc.(*Owner).Run(0xc0039f2140, 0x32b7e80, 0xc0056960c0, 0x1dcd6500, 0x0, 0x0)
	github.com/pingcap/ticdc@/cdc/owner.go:1204 +0x3af
github.com/pingcap/ticdc/cdc.(*Server).campaignOwnerLoop(0xc000c8e660, 0x32b7e80, 0xc0056960c0, 0xc004936e40, 0x16b5000000000008)
	github.com/pingcap/ticdc@/cdc/server.go:172 +0x540
github.com/pingcap/ticdc/cdc.(*Server).run.func1(0xc000500f68, 0xc004d08240)
	github.com/pingcap/ticdc@/cdc/server.go:255 +0x3c
golang.org/x/sync/errgroup.(*Group).Go.func1(0xc005435d40, 0xc002f0c500)
	golang.org/x/[email protected]/errgroup/errgroup.go:57 +0x64
created by golang.org/x/sync/errgroup.(*Group).Go
	golang.org/x/[email protected]/errgroup/errgroup.go:54 +0x66
goroutine 227 [semacquire]:
sync.runtime_SemacquireMutex(0xc003de2054, 0x0, 0x1)
	runtime/sema.go:71 +0x47
sync.(*Mutex).lockSlow(0xc003de2050)
	sync/mutex.go:138 +0xfc
sync.(*Mutex).Lock(...)
	sync/mutex.go:81
github.com/Shopify/sarama.(*Broker).send(0xc003de2000, 0x32d75c0, 0xc0043b9080, 0xc301, 0x0, 0x0, 0x0)
	github.com/Shopify/[email protected]/broker.go:716 +0x6e2
github.com/Shopify/sarama.(*Broker).sendAndReceive(0xc003de2000, 0x32d75c0, 0xc0043b9080, 0x32d7620, 0xc0045f7590, 0x2942400, 0x1)
	github.com/Shopify/[email protected]/broker.go:765 +0x84
github.com/Shopify/sarama.(*Broker).GetMetadata(0xc003de2000, 0xc0043b9080, 0x0, 0x0, 0x1)
	github.com/Shopify/[email protected]/broker.go:283 +0x70
github.com/Shopify/sarama.(*client).tryRefreshMetadata(0xc002e81b00, 0x0, 0x0, 0x0, 0x5f, 0x0, 0x0, 0x0, 0x22cd430, 0x0)
	github.com/Shopify/[email protected]/client.go:880 +0x4d1
...
github.com/Shopify/sarama.(*client).RefreshMetadata(0xc002e81b00, 0x0, 0x0, 0x0, 0x7f2b94f9d02c, 0x12)
	github.com/Shopify/[email protected]/client.go:473 +0xbe
github.com/Shopify/sarama.NewClient(0xc00387ebd0, 0x3, 0x3, 0xc003007880, 0x2ad5040, 0x2, 0x3, 0xc00387ebd0)
	github.com/Shopify/[email protected]/client.go:167 +0x331
github.com/Shopify/sarama.NewAsyncProducer(0xc00387ebd0, 0x3, 0x3, 0xc003007880, 0x0, 0x2, 0xc00387ebd0, 0x3)
	github.com/Shopify/[email protected]/async_producer.go:126 +0x4d
github.com/pingcap/ticdc/cdc/sink/producer/kafka.NewKafkaSaramaProducer(0x32b7e80, 0xc000e7ebc0, 0xc000242638, 0x38, 0xc000242671, 0x15, 0x100000000, 0xc000242695, 0x5, 0x4000000, ...)
	github.com/pingcap/ticdc@/cdc/sink/producer/kafka/kafka.go:328 +0x35a
github.com/pingcap/ticdc/cdc/sink.newKafkaSaramaSink(0x32b7e80, 0xc000e7ebc0, 0xc0013fcc00, 0xc000818be0, 0xc000c550e0, 0xc000c54e10, 0xc003883bc0, 0xc000c6a388, 0x0, 0xc000c6a340)
	github.com/pingcap/ticdc@/cdc/sink/mq.go:475 +0x44b
github.com/pingcap/ticdc/cdc/sink.init.1.func3(0x32b7e80, 0xc000e7ebc0, 0xc00016e8ba, 0xc, 0xc0013fcc00, 0xc000818be0, 0xc000c550e0, 0xc000c54e10, 0xc003883bc0, 0x27fe8e0, ...)
	github.com/pingcap/ticdc@/cdc/sink/sink.go:81 +0x7a
github.com/pingcap/ticdc/cdc/sink.NewSink(0x32b7e80, 0xc000e7ebc0, 0xc00016e8ba, 0xc, 0xc000242630, 0xae, 0xc000818be0, 0xc000c550e0, 0xc000c54e10, 0xc003883bc0, ...)
	github.com/pingcap/ticdc@/cdc/sink/sink.go:113 +0x1ef
github.com/pingcap/ticdc/cdc.(*Owner).newChangeFeed(0xc0006783c0, 0x32b7e80, 0xc000e7ebc0, 0xc00016e8ba, 0xc, 0xc0008790b0, 0xc000879860, 0xc0012d6180, 0x5f4063fe7d00000, 0x0, ...)
	github.com/pingcap/ticdc@/cdc/owner.go:414 +0x21bb
github.com/pingcap/ticdc/cdc.(*Owner).loadChangeFeeds(0xc0006783c0, 0x32b7f40, 0xc000d9d7a0, 0x0, 0x0)
	github.com/pingcap/ticdc@/cdc/owner.go:615 +0xe1c
github.com/pingcap/ticdc/cdc.(*Owner).watchCapture(0xc0006783c0, 0x32b7e80, 0xc0004087c0, 0x1, 0x0)
	github.com/pingcap/ticdc@/cdc/owner.go:1497 +0xdc
github.com/pingcap/ticdc/cdc.(*Owner).startCaptureWatcher.func1(0x32b7e80, 0xc0004087c0, 0xc0006783c0)
	github.com/pingcap/ticdc@/cdc/owner.go:1609 +0xd2
created by github.com/pingcap/ticdc/cdc.(*Owner).startCaptureWatcher
	github.com/pingcap/ticdc@/cdc/owner.go:1598 +0x7d

Versions of the cluster

Upstream TiDB cluster version (execute SELECT tidb_version(); in a MySQL client):

(paste TiDB cluster version here)

TiCDC version (execute cdc version):

v5.0.1

@amyangfei amyangfei added type/bug The issue is confirmed as a bug. area/ticdc Issues or PRs related to TiCDC. labels Nov 9, 2021
@liuzix liuzix self-assigned this Nov 9, 2021
@liuzix
Copy link
Contributor

liuzix commented Nov 9, 2021

The root cause is that the initialization of a sarama.(*Broker) blocks the owner goroutine from doing everything else.

goroutine 11818 [IO wait]:
internal/poll.runtime_pollWait(0x7f2b92a2df50, 0x77, 0xc003caf500)
	runtime/netpoll.go:184 +0x55
internal/poll.(*pollDesc).wait(0xc000ff5518, 0x77, 0x32b7f00, 0xc0091e1f80, 0xc000ff5500)
	internal/poll/fd_poll_runtime.go:87 +0x45
internal/poll.(*pollDesc).waitWrite(...)
	internal/poll/fd_poll_runtime.go:96
internal/poll.(*FD).WaitWrite(...)
	internal/poll/fd_unix.go:498
net.(*netFD).connect(0xc000ff5500, 0x32b7f00, 0xc0091e1f80, 0x0, 0x0, 0x324d2a0, 0xc004856ac0, 0x0, 0x0, 0x0, ...)
	net/fd_unix.go:152 +0x29d
net.(*netFD).dial(0xc000ff5500, 0x32b7f00, 0xc0091e1f80, 0x32dcba0, 0x0, 0x32dcba0, 0xc005f18c30, 0x0, 0x1, 0xc0055af858)
	net/sock_posix.go:149 +0xff
net.socket(0x32b7f00, 0xc0091e1f80, 0x2c1593d, 0x3, 0x2, 0x1, 0x0, 0x0, 0x32dcba0, 0x0, ...)
	net/sock_posix.go:70 +0x1c0
net.internetSocket(0x32b7f00, 0xc0091e1f80, 0x2c1593d, 0x3, 0x32dcba0, 0x0, 0x32dcba0, 0xc005f18c30, 0x1, 0x0, ...)
	net/ipsock_posix.go:141 +0x141
net.(*sysDialer).doDialTCP(0xc000ff5480, 0x32b7f00, 0xc0091e1f80, 0x0, 0xc005f18c30, 0x27c9020, 0x50f1538, 0x0)
	net/tcpsock_posix.go:65 +0xc2
net.(*sysDialer).dialTCP(0xc000ff5480, 0x32b7f00, 0xc0091e1f80, 0x0, 0xc005f18c30, 0xc002a4d5a0, 0x10, 0x10)
	net/tcpsock_posix.go:61 +0xd7
net.(*sysDialer).dialSingle(0xc000ff5480, 0x32b7f00, 0xc0091e1f80, 0x327d680, 0xc005f18c30, 0x0, 0x0, 0x0, 0x0)
	net/dial.go:578 +0x36e
net.(*sysDialer).dialSerial(0xc000ff5480, 0x32b7f00, 0xc0091e1f80, 0xc002a4d5a0, 0x1, 0x1, 0x0, 0x0, 0x0, 0x0)
	net/dial.go:546 +0x221
net.(*Dialer).DialContext(0xc0091e15c0, 0x32b7ec0, 0xc0000e8000, 0x2c1593d, 0x3, 0xc000242b2e, 0x12, 0x0, 0x0, 0x0, ...)
	net/dial.go:424 +0x666
net.(*Dialer).Dial(0xc0091e15c0, 0x2c1593d, 0x3, 0xc000242b2e, 0x12, 0xc003f04ec0, 0x4052da, 0xc0012fcf00, 0xc0012d5020)
	net/dial.go:347 +0x75
github.com/Shopify/sarama.(*Broker).Open.func1()
	github.com/Shopify/[email protected]/broker.go:158 +0xe2
github.com/Shopify/sarama.withRecover(0xc006269680)
	github.com/Shopify/[email protected]/utils.go:43 +0x4d
created by github.com/Shopify/sarama.(*Broker).Open
	github.com/Shopify/[email protected]/broker.go:154 +0x120

TODO

  • Add a mechanism to create sinks asynchronously, which will print warnings in case of timeouts (note that creating a sarama client does not seem to be cancellable by a context).
  • Investigate the semantics of sarama timeouts, and test whether it behaves as expected
  • Find a way to gracefully cancel the producer if the user wants to remove (or pause) the changefeed but the producer is still initializing.

@amyangfei
Copy link
Contributor Author

This issue doesn't block v5.3.0 release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants