pgerror: add "initial connection heartbeat failed" to SQL retryable #106547

yuzefovich · 2023-07-10T20:43:03Z

This commit adds initial connection heartbeat failed error to the list of SQL retryable errors. IsSQLRetryableError is used in three places:

rerunning distributed query as local
IsPermanentSchemaChangeError
validateUniqueConstraint

and in all cases I think it makes sense to consider failing to establish a connection as a retryable error.

This commit includes the test from Jeff that exposed the gap in the retry-as-local mechanism.

Fixes: #106537.

Release note: None

This commit adds `initial connection heartbeat failed` error to the list of SQL retryable errors. `IsSQLRetryableError` is used in three places: - rerunning distributed query as local - `IsPermanentSchemaChangeError` - `validateUniqueConstraint` and in all cases I think it makes sense to consider failing to establish a connection as a retryable error. This commit includes the test from Jeff that exposed the gap in the retry-as-local mechanism. Release note: None

cockroach-teamcity · 2023-07-10T20:43:14Z

This change is

knz · 2023-07-10T20:59:47Z

What behavior does.this give us if a node is partitioned away from the cluster with a network error? Will it keep retrying in SQL until the partition is resolved? Or do we want to stop considering network errors as retriable after a timeout?

yuzefovich · 2023-07-10T21:10:10Z

What behavior does.this give us if a node is partitioned away from the cluster with a network error?

This will depend on the caller of IsSQLRetryableError:

retry-as-local mechanism will run the query locally, and it if happens to hit the partitioned node down at the KV layer, then that error will be propagated to the client and won't be retried again
in IsPermanentSchemaChangeError, before calling IsSQLRetryableError we call grpcutil.IsClosedConnection which I think will return true because the error uses Unavailable code, so in this case this change should be a no-op
in validateUniqueConstraint the retry mechanism of up to 5 attempts will kick in. This mechanism was introduced somewhat recently in sql: retry validation of unique constraint #88307, and it was made partially redundant by recently introduced retry-as-local mechanism. Still, it seems ok to keep validateUniqueConstraint's own mechanism but perhaps reduce the number of retry attempts to 2 or 3. Thoughts on this one?

jeffswenson · 2023-07-10T21:23:02Z

There are a couple other cases that might show up that we should handle:

The port is in use by an RPC server for the same tenant, but with a new instance id. This generates an error that complains about the recipient having the wrong node ID.
The port is in use by another tenant. This would generate an ssl certificate error.

I think the error allow list is always going to be fragile. Is there an error code we can check for that indicates the error is a problem with the query and not the transport layer?

jeffswenson · 2023-07-10T21:34:02Z

Alternatively, we should have the RPC layer tag error messages that indicate a communication error/protocol error so that we don't have to guess higher up in the sql layer.

yuzefovich · 2023-07-10T23:45:37Z

Is there an error code we can check for that indicates the error is a problem with the query and not the transport layer?

At least not that I know of, perhaps @knz knows.

Alternatively, we should have the RPC layer tag error messages that indicate a communication error/protocol error so that we don't have to guess higher up in the sql layer.

Also curious about @knz thinks about this idea.

knz · 2023-07-11T05:04:18Z

There's multiple kinds of network errors. If the error occurs after a connection is established already I believe grpc gives us an error object that says so much.

But if the error happens during the connection handshake, we have a larger diversity of cases due to circuit breaker, heartbeat and whatnot.

Maybe a question for @aliher1911 (or @tbg): do we have a stable and exhaustive error characterization for rpc.Context?

tbg · 2023-07-18T15:04:49Z

Maybe a question for @aliher1911 (or @tbg): do we have a stable and exhaustive error characterization for rpc.Context?

No, the best we have is

cockroach/pkg/util/grpcutil/grpc_util.go

Lines 153 to 163 in 24537db

    
           // RequestDidNotStart returns true if the given RPC error means that the request 
        
           // definitely could not have started on the remote server. 
        
           func RequestDidNotStart(err error) bool { 
        
           	// NB: gRPC doesn't provide a way to distinguish unambiguous failures, but 
        
           	// InitialHeartbeatFailedError serves mostly the same purpose. See also 
        
           	// https://github.com/grpc/grpc-go/issues/1443. 
        
           	return errors.HasType(err, (*netutil.InitialHeartbeatFailedError)(nil)) || 
        
           		errors.Is(err, circuit.ErrBreakerOpen) || 
        
           		IsConnectionRejected(err) || 
        
           		IsWaitingForInit(err) 
        
           }

Which errors to expect to bubble up to SQL is a bit of a mess. This comment links a few issues that I know of. Basically there is no official contract and also no testing that verifies which errors do occur. In some sense, our perf tests are the best testing we have because they don't tolerate errors and will fail¹ when something does bubble up.

https://github.com/cockroachdb/cockroach/issues/106474#issuecomment-1640382354 ↩

yuzefovich · 2023-08-07T22:00:22Z

Thanks everyone for the input. Over in #108271 we saw another scenario that Jeff previously mentioned

The port is in use by an RPC server for the same tenant, but with a new instance id. This generates an error that complains about the recipient having the wrong node ID.

I'll close this PR and will make a bit more target "fix" ("fix" is in quotes because it'll still rely on "retry-as-local" mechanism).

yuzefovich requested review from jeffswenson and a team July 10, 2023 20:43

yuzefovich requested review from a team as code owners July 10, 2023 20:43

yuzefovich requested a review from cucaroach July 10, 2023 20:43

rafiss removed the request for review from a team July 25, 2023 20:10

yuzefovich closed this Aug 7, 2023

yuzefovich deleted the sql-retryable branch August 7, 2023 22:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pgerror: add "initial connection heartbeat failed" to SQL retryable #106547

pgerror: add "initial connection heartbeat failed" to SQL retryable #106547

yuzefovich commented Jul 10, 2023

cockroach-teamcity commented Jul 10, 2023

knz commented Jul 10, 2023

yuzefovich commented Jul 10, 2023

jeffswenson commented Jul 10, 2023

jeffswenson commented Jul 10, 2023

yuzefovich commented Jul 10, 2023

knz commented Jul 11, 2023

tbg commented Jul 18, 2023

yuzefovich commented Aug 7, 2023

pgerror: add "initial connection heartbeat failed" to SQL retryable #106547

pgerror: add "initial connection heartbeat failed" to SQL retryable #106547

Conversation

yuzefovich commented Jul 10, 2023

cockroach-teamcity commented Jul 10, 2023

knz commented Jul 10, 2023

yuzefovich commented Jul 10, 2023

jeffswenson commented Jul 10, 2023

jeffswenson commented Jul 10, 2023

yuzefovich commented Jul 10, 2023

knz commented Jul 11, 2023

tbg commented Jul 18, 2023

Footnotes

yuzefovich commented Aug 7, 2023