-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pgerror: add "initial connection heartbeat failed" to SQL retryable #106547
Conversation
This commit adds `initial connection heartbeat failed` error to the list of SQL retryable errors. `IsSQLRetryableError` is used in three places: - rerunning distributed query as local - `IsPermanentSchemaChangeError` - `validateUniqueConstraint` and in all cases I think it makes sense to consider failing to establish a connection as a retryable error. This commit includes the test from Jeff that exposed the gap in the retry-as-local mechanism. Release note: None
What behavior does.this give us if a node is partitioned away from the cluster with a network error? Will it keep retrying in SQL until the partition is resolved? Or do we want to stop considering network errors as retriable after a timeout? |
This will depend on the caller of
|
There are a couple other cases that might show up that we should handle:
I think the error allow list is always going to be fragile. Is there an error code we can check for that indicates the error is a problem with the query and not the transport layer? |
Alternatively, we should have the RPC layer tag error messages that indicate a communication error/protocol error so that we don't have to guess higher up in the sql layer. |
At least not that I know of, perhaps @knz knows.
Also curious about @knz thinks about this idea. |
There's multiple kinds of network errors. If the error occurs after a connection is established already I believe grpc gives us an error object that says so much. But if the error happens during the connection handshake, we have a larger diversity of cases due to circuit breaker, heartbeat and whatnot. Maybe a question for @aliher1911 (or @tbg): do we have a stable and exhaustive error characterization for rpc.Context? |
No, the best we have is cockroach/pkg/util/grpcutil/grpc_util.go Lines 153 to 163 in 24537db
Which errors to expect to bubble up to SQL is a bit of a mess. This comment links a few issues that I know of. Basically there is no official contract and also no testing that verifies which errors do occur. In some sense, our perf tests are the best testing we have because they don't tolerate errors and will fail1 when something does bubble up. Footnotes |
Thanks everyone for the input. Over in #108271 we saw another scenario that Jeff previously mentioned
I'll close this PR and will make a bit more target "fix" ("fix" is in quotes because it'll still rely on "retry-as-local" mechanism). |
This commit adds
initial connection heartbeat failed
error to the list of SQL retryable errors.IsSQLRetryableError
is used in three places:IsPermanentSchemaChangeError
validateUniqueConstraint
and in all cases I think it makes sense to consider failing to establish a connection as a retryable error.
This commit includes the test from Jeff that exposed the gap in the retry-as-local mechanism.
Fixes: #106537.
Release note: None