ccl/sqlproxyccl: TestDirectoryConnect failed #105402

cockroach-teamcity · 2023-06-23T06:41:37Z

ccl/sqlproxyccl.TestDirectoryConnect failed with artifacts on release-23.1 @ 624d9dea7f1296af60d16bbb45cc9d1259b3a1be:

Fatal error:

panic: test timed out after 4m55s

Stack:

goroutine 42056 [running]:
testing.(*M).startAlarm.func1()
	GOROOT/src/testing/testing.go:2036 +0x8e
created by time.goFunc
	GOROOT/src/time/sleep.go:176 +0x32

Log preceding fatal error

* created by github.com/cockroachdb/pebble.(*tableCacheShard).init
* 	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/table_cache.go:314 +0xef
* 
* goroutine 34864 [chan receive, 2 minutes]:
* github.com/cockroachdb/pebble.(*tableCacheShard).releaseLoop.func1({0x7439140, 0xc00ac98510})
* 	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/table_cache.go:324 +0x9f
* runtime/pprof.Do({0x74390d0?, 0xc00022a000?}, {{0xc000166a00?, 0xc0050fb180?, 0xc0050fb1d0?}}, 0xc00207c7a8)
* 	GOROOT/src/runtime/pprof/runtime.go:40 +0xa3
* github.com/cockroachdb/pebble.(*tableCacheShard).releaseLoop(0x0?)
* 	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/table_cache.go:322 +0x58
* created by github.com/cockroachdb/pebble.(*tableCacheShard).init
* 	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/table_cache.go:314 +0xef
* 
* goroutine 34746 [select]:
* github.com/cockroachdb/cockroach/pkg/util/admission.initWorkQueue.func2()
* 	github.com/cockroachdb/cockroach/pkg/util/admission/work_queue.go:388 +0x86
* created by github.com/cockroachdb/cockroach/pkg/util/admission.initWorkQueue
* 	github.com/cockroachdb/cockroach/pkg/util/admission/work_queue.go:385 +0x33f
* 
* goroutine 34741 [select, 2 minutes]:
* github.com/cockroachdb/cockroach/pkg/util/log.(*fileSink).gcDaemon(0xc00772b2b0, {0x7439098, 0xc001cd79c0})
* 	github.com/cockroachdb/cockroach/pkg/util/log/file_log_gc.go:25 +0x7f
* created by github.com/cockroachdb/cockroach/pkg/util/log.ApplyConfig
* 	github.com/cockroachdb/cockroach/pkg/util/log/flags.go:306 +0x10aa
* 
* goroutine 34737 [sleep]:
* time.Sleep(0x4c4b40)
* 	GOROOT/src/runtime/time.go:195 +0x135
* github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).Quiesce(0xc00790b080, {0x74390d0?, 0xc00022a000})
* 	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:584 +0x1a5
* github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).Stop(0xc00790b080, {0x74390d0, 0xc00022a000})
* 	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:526 +0x11d
* github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl.TestDirectoryConnect(0xc006f49380)
* 	github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl/proxy_handler_test.go:1447 +0xc8f
* testing.tRunner(0xc006f49380, 0x6204ef8)
* 	GOROOT/src/testing/testing.go:1446 +0x10b
* created by testing.(*T).Run
* 	GOROOT/src/testing/testing.go:1493 +0x35f
* 
* goroutine 34794 [chan receive, 2 minutes]:
* github.com/cockroachdb/pebble.(*tableCacheShard).releaseLoop.func1({0x7439140, 0xc008f172f0})
* 	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/table_cache.go:324 +0x9f
* runtime/pprof.Do({0x74390d0?, 0xc00022a000?}, {{0xc000166a00?, 0xc003d2d810?, 0xc006d235b0?}}, 0xc0015c97a8)
* 	GOROOT/src/runtime/pprof/runtime.go:40 +0xa3
* github.com/cockroachdb/pebble.(*tableCacheShard).releaseLoop(0xc0084aa720?)
* 	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/table_cache.go:322 +0x58
* created by github.com/cockroachdb/pebble.(*tableCacheShard).init
* 	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/table_cache.go:314 +0xef
* 
*

Help

See also: How To Investigate a Go Test Failure (internal)

/cc @cockroachdb/sql-foundations @cockroachdb/server _{This test on roachdash | Improve this report!

Jira issue: CRDB-29014}

The text was updated successfully, but these errors were encountered:

jeffswenson · 2023-07-07T21:35:14Z

This was fixed on master by #101864.

jeffswenson · 2023-07-10T14:26:58Z

The change I thought fixed this was already back ported before this flake.

jeffswenson · 2023-07-10T21:13:54Z

WIP work on deflaking the test is contained in #106549. I stumbled on another bug that shows up once every 1k attempts under --stress.

The main symptoms are:

The test times out with zero failed connection attempts (the connection is stuck).
The sql proxy is blocked on this line

cockroach/pkg/ccl/sqlproxyccl/backend_dialer.go

Line 60 in 2eac0f0

if _, err = io.ReadFull(conn, response); err != nil {

.
There is no stack in the sql server that suggests the sql server accepted the connection.

I suspect the root cause is the tenant directory cache contains a stale entry for the tenant and something (probably the new sql server's http server but could be something outside the test harness) is listening on the port. This causes the tcp dial to succeed but connection setup gets stuck during the negotiation phase because the recipient does not understand the postgres tls negotiation protocol. There is a time out inside the sql proxy for dialing a connection, but there is no time out for negotiating ssl.

Previously, if a sql server did not respond to the TLS handshake, the sql proxy would wait forever. This could happen in production if a sql server is overloaded. It can also cause test flakes if a port is reused by something that does not understand the pgwire protocol. Release Note: None Fixes: cockroachdb#106554 Part of: cockroachdb#105402

This contains fixes to two sources of flakes in TestDirectoryConnect: - sqlproxy http draining is now tied into the stopper. This avoids a source of goroutine leaks. - The sql server is gracefully drained to work around cockroachdb#106537. When combined with cockroachdb#106599, I was able to run the test for 25K interations under stress with no flakes. Fixes: cockroachdb#105402

106549: sqlproxyccl: deflake TestDirectoryConnect r=JeffSwenson a=JeffSwenson This contains fixes to two sources of flakes in TestDirectoryConnect: - sqlproxy http draining is now tied into the stopper. This avoids a source of goroutine leaks. - The sql server is gracefully drained to work around #106537. When combined with #106599, I was able to run the test for 25K interations under stress with no flakes. Fixes: #105402 Co-authored-by: Jeff <[email protected]>

Previously, if a sql server did not respond to the TLS handshake, the sql proxy would wait forever. This could happen in production if a sql server is overloaded. It can also cause test flakes if a port is reused by something that does not understand the pgwire protocol. Release Note: None Fixes: cockroachdb#106554 Part of: cockroachdb#105402

106599: sqlproxyccl: handle black hole sql servers r=JeffSwenson a=JeffSwenson Previously, if a sql server did not respond to the TLS handshake, the sql proxy would wait forever. This could happen in production if a sql server is overloaded. It can also cause test flakes if a port is reused by something that does not understand the pgwire protocol. Release Note: None Fixes: #106554 Part of: #105402 Co-authored-by: Jeff <[email protected]>

cockroach-teamcity added this to the 23.1 milestone Jun 23, 2023

rafiss removed the T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) label Jun 23, 2023

rafiss assigned jeffswenson Jun 23, 2023

rafiss added T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) T-serverless labels Jun 23, 2023

rafiss removed the T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) label Jun 23, 2023

jeffswenson mentioned this issue Jul 10, 2023

Sql Server Fails to Start with DistSQL Error #106537

Closed

jeffswenson mentioned this issue Jul 11, 2023

sqlproxyccl: handle black hole sql servers #106599

Merged

jeffswenson mentioned this issue Jul 11, 2023

sqlproxyccl: deflake TestDirectoryConnect #106549

Merged

craig bot closed this as completed in 55745a8 Jul 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ccl/sqlproxyccl: TestDirectoryConnect failed #105402

ccl/sqlproxyccl: TestDirectoryConnect failed #105402

cockroach-teamcity commented Jun 23, 2023 •

edited by cockroach-jira-scripts

Loading

jeffswenson commented Jul 7, 2023

jeffswenson commented Jul 10, 2023

jeffswenson commented Jul 10, 2023 •

edited

Loading

ccl/sqlproxyccl: TestDirectoryConnect failed #105402

ccl/sqlproxyccl: TestDirectoryConnect failed #105402

Comments

cockroach-teamcity commented Jun 23, 2023 • edited by cockroach-jira-scripts Loading

jeffswenson commented Jul 7, 2023

jeffswenson commented Jul 10, 2023

jeffswenson commented Jul 10, 2023 • edited Loading

cockroach-teamcity commented Jun 23, 2023 •

edited by cockroach-jira-scripts

Loading

jeffswenson commented Jul 10, 2023 •

edited

Loading