Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ccl/sqlproxyccl: TestDirectoryConnect failed #105402

Closed
cockroach-teamcity opened this issue Jun 23, 2023 · 3 comments · Fixed by #106549
Closed

ccl/sqlproxyccl: TestDirectoryConnect failed #105402

cockroach-teamcity opened this issue Jun 23, 2023 · 3 comments · Fixed by #106549
Assignees
Labels
branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. T-serverless
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Jun 23, 2023

ccl/sqlproxyccl.TestDirectoryConnect failed with artifacts on release-23.1 @ 624d9dea7f1296af60d16bbb45cc9d1259b3a1be:

Fatal error:

panic: test timed out after 4m55s

Stack:

goroutine 42056 [running]:
testing.(*M).startAlarm.func1()
	GOROOT/src/testing/testing.go:2036 +0x8e
created by time.goFunc
	GOROOT/src/time/sleep.go:176 +0x32
Log preceding fatal error

* created by github.com/cockroachdb/pebble.(*tableCacheShard).init
* 	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/table_cache.go:314 +0xef
* 
* goroutine 34864 [chan receive, 2 minutes]:
* github.com/cockroachdb/pebble.(*tableCacheShard).releaseLoop.func1({0x7439140, 0xc00ac98510})
* 	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/table_cache.go:324 +0x9f
* runtime/pprof.Do({0x74390d0?, 0xc00022a000?}, {{0xc000166a00?, 0xc0050fb180?, 0xc0050fb1d0?}}, 0xc00207c7a8)
* 	GOROOT/src/runtime/pprof/runtime.go:40 +0xa3
* github.com/cockroachdb/pebble.(*tableCacheShard).releaseLoop(0x0?)
* 	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/table_cache.go:322 +0x58
* created by github.com/cockroachdb/pebble.(*tableCacheShard).init
* 	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/table_cache.go:314 +0xef
* 
* goroutine 34746 [select]:
* github.com/cockroachdb/cockroach/pkg/util/admission.initWorkQueue.func2()
* 	github.com/cockroachdb/cockroach/pkg/util/admission/work_queue.go:388 +0x86
* created by github.com/cockroachdb/cockroach/pkg/util/admission.initWorkQueue
* 	github.com/cockroachdb/cockroach/pkg/util/admission/work_queue.go:385 +0x33f
* 
* goroutine 34741 [select, 2 minutes]:
* github.com/cockroachdb/cockroach/pkg/util/log.(*fileSink).gcDaemon(0xc00772b2b0, {0x7439098, 0xc001cd79c0})
* 	github.com/cockroachdb/cockroach/pkg/util/log/file_log_gc.go:25 +0x7f
* created by github.com/cockroachdb/cockroach/pkg/util/log.ApplyConfig
* 	github.com/cockroachdb/cockroach/pkg/util/log/flags.go:306 +0x10aa
* 
* goroutine 34737 [sleep]:
* time.Sleep(0x4c4b40)
* 	GOROOT/src/runtime/time.go:195 +0x135
* github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).Quiesce(0xc00790b080, {0x74390d0?, 0xc00022a000})
* 	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:584 +0x1a5
* github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).Stop(0xc00790b080, {0x74390d0, 0xc00022a000})
* 	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:526 +0x11d
* github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl.TestDirectoryConnect(0xc006f49380)
* 	github.com/cockroachdb/cockroach/pkg/ccl/sqlproxyccl/proxy_handler_test.go:1447 +0xc8f
* testing.tRunner(0xc006f49380, 0x6204ef8)
* 	GOROOT/src/testing/testing.go:1446 +0x10b
* created by testing.(*T).Run
* 	GOROOT/src/testing/testing.go:1493 +0x35f
* 
* goroutine 34794 [chan receive, 2 minutes]:
* github.com/cockroachdb/pebble.(*tableCacheShard).releaseLoop.func1({0x7439140, 0xc008f172f0})
* 	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/table_cache.go:324 +0x9f
* runtime/pprof.Do({0x74390d0?, 0xc00022a000?}, {{0xc000166a00?, 0xc003d2d810?, 0xc006d235b0?}}, 0xc0015c97a8)
* 	GOROOT/src/runtime/pprof/runtime.go:40 +0xa3
* github.com/cockroachdb/pebble.(*tableCacheShard).releaseLoop(0xc0084aa720?)
* 	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/table_cache.go:322 +0x58
* created by github.com/cockroachdb/pebble.(*tableCacheShard).init
* 	github.com/cockroachdb/pebble/external/com_github_cockroachdb_pebble/table_cache.go:314 +0xef
* 
*

Help

See also: How To Investigate a Go Test Failure (internal)

/cc @cockroachdb/sql-foundations @cockroachdb/server

This test on roachdash | Improve this report!

Jira issue: CRDB-29014

@cockroach-teamcity cockroach-teamcity added branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) labels Jun 23, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.1 milestone Jun 23, 2023
@rafiss rafiss removed the T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) label Jun 23, 2023
@rafiss rafiss added T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) T-serverless labels Jun 23, 2023
@rafiss rafiss removed the T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) label Jun 23, 2023
@jeffswenson
Copy link
Collaborator

This was fixed on master by #101864.

@jeffswenson
Copy link
Collaborator

The change I thought fixed this was already back ported before this flake.

@jeffswenson
Copy link
Collaborator

jeffswenson commented Jul 10, 2023

WIP work on deflaking the test is contained in #106549. I stumbled on another bug that shows up once every 1k attempts under --stress.

The main symptoms are:

  1. The test times out with zero failed connection attempts (the connection is stuck).
  2. The sql proxy is blocked on this line
    if _, err = io.ReadFull(conn, response); err != nil {
    .
  3. There is no stack in the sql server that suggests the sql server accepted the connection.

I suspect the root cause is the tenant directory cache contains a stale entry for the tenant and something (probably the new sql server's http server but could be something outside the test harness) is listening on the port. This causes the tcp dial to succeed but connection setup gets stuck during the negotiation phase because the recipient does not understand the postgres tls negotiation protocol. There is a time out inside the sql proxy for dialing a connection, but there is no time out for negotiating ssl.

jeffswenson added a commit to jeffswenson/cockroach that referenced this issue Jul 11, 2023
Previously, if a sql server did not respond to the TLS handshake, the
sql proxy would wait forever. This could happen in production if a sql
server is overloaded. It can also cause test flakes if a port is reused
by something that does not understand the pgwire protocol.

Release Note: None
Fixes: cockroachdb#106554
Part of: cockroachdb#105402
jeffswenson added a commit to jeffswenson/cockroach that referenced this issue Jul 11, 2023
This contains fixes to two sources of flakes in TestDirectoryConnect:
- sqlproxy http draining is now tied into the stopper. This avoids a
	source of goroutine leaks.
- The sql server is gracefully drained to work around cockroachdb#106537.

When combined with cockroachdb#106599, I was able to run the test for 25K
interations under stress with no flakes.

Fixes: cockroachdb#105402
craig bot pushed a commit that referenced this issue Jul 11, 2023
106549: sqlproxyccl: deflake TestDirectoryConnect r=JeffSwenson a=JeffSwenson

This contains fixes to two sources of flakes in TestDirectoryConnect:
- sqlproxy http draining is now tied into the stopper. This avoids a
	source of goroutine leaks.
- The sql server is gracefully drained to work around #106537.

When combined with #106599, I was able to run the test for 25K
interations under stress with no flakes.

Fixes: #105402

Co-authored-by: Jeff <[email protected]>
@craig craig bot closed this as completed in 55745a8 Jul 11, 2023
jeffswenson added a commit to jeffswenson/cockroach that referenced this issue Jul 11, 2023
Previously, if a sql server did not respond to the TLS handshake, the
sql proxy would wait forever. This could happen in production if a sql
server is overloaded. It can also cause test flakes if a port is reused
by something that does not understand the pgwire protocol.

Release Note: None
Fixes: cockroachdb#106554
Part of: cockroachdb#105402
craig bot pushed a commit that referenced this issue Jul 12, 2023
106599: sqlproxyccl: handle black hole sql servers r=JeffSwenson a=JeffSwenson

Previously, if a sql server did not respond to the TLS handshake, the sql proxy would wait forever. This could happen in production if a sql server is overloaded. It can also cause test flakes if a port is reused by something that does not understand the pgwire protocol.

Release Note: None
Fixes: #106554
Part of: #105402

Co-authored-by: Jeff <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. T-serverless
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants