-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ccl/sqlproxyccl: TestDirectoryConnect failed #105402
Comments
This was fixed on master by #101864. |
The change I thought fixed this was already back ported before this flake. |
WIP work on deflaking the test is contained in #106549. I stumbled on another bug that shows up once every 1k attempts under --stress. The main symptoms are:
I suspect the root cause is the tenant directory cache contains a stale entry for the tenant and something (probably the new sql server's http server but could be something outside the test harness) is listening on the port. This causes the tcp dial to succeed but connection setup gets stuck during the negotiation phase because the recipient does not understand the postgres tls negotiation protocol. There is a time out inside the sql proxy for dialing a connection, but there is no time out for negotiating ssl. |
Previously, if a sql server did not respond to the TLS handshake, the sql proxy would wait forever. This could happen in production if a sql server is overloaded. It can also cause test flakes if a port is reused by something that does not understand the pgwire protocol. Release Note: None Fixes: cockroachdb#106554 Part of: cockroachdb#105402
This contains fixes to two sources of flakes in TestDirectoryConnect: - sqlproxy http draining is now tied into the stopper. This avoids a source of goroutine leaks. - The sql server is gracefully drained to work around cockroachdb#106537. When combined with cockroachdb#106599, I was able to run the test for 25K interations under stress with no flakes. Fixes: cockroachdb#105402
106549: sqlproxyccl: deflake TestDirectoryConnect r=JeffSwenson a=JeffSwenson This contains fixes to two sources of flakes in TestDirectoryConnect: - sqlproxy http draining is now tied into the stopper. This avoids a source of goroutine leaks. - The sql server is gracefully drained to work around #106537. When combined with #106599, I was able to run the test for 25K interations under stress with no flakes. Fixes: #105402 Co-authored-by: Jeff <[email protected]>
Previously, if a sql server did not respond to the TLS handshake, the sql proxy would wait forever. This could happen in production if a sql server is overloaded. It can also cause test flakes if a port is reused by something that does not understand the pgwire protocol. Release Note: None Fixes: cockroachdb#106554 Part of: cockroachdb#105402
106599: sqlproxyccl: handle black hole sql servers r=JeffSwenson a=JeffSwenson Previously, if a sql server did not respond to the TLS handshake, the sql proxy would wait forever. This could happen in production if a sql server is overloaded. It can also cause test flakes if a port is reused by something that does not understand the pgwire protocol. Release Note: None Fixes: #106554 Part of: #105402 Co-authored-by: Jeff <[email protected]>
ccl/sqlproxyccl.TestDirectoryConnect failed with artifacts on release-23.1 @ 624d9dea7f1296af60d16bbb45cc9d1259b3a1be:
Fatal error:
Stack:
Log preceding fatal error
Help
See also: How To Investigate a Go Test Failure (internal)
This test on roachdash | Improve this report!
Jira issue: CRDB-29014
The text was updated successfully, but these errors were encountered: