Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pgerror: add "initial connection heartbeat failed" to SQL retryable #106547

Closed
wants to merge 1 commit into from

Conversation

yuzefovich
Copy link
Member

This commit adds initial connection heartbeat failed error to the list of SQL retryable errors. IsSQLRetryableError is used in three places:

  • rerunning distributed query as local
  • IsPermanentSchemaChangeError
  • validateUniqueConstraint

and in all cases I think it makes sense to consider failing to establish a connection as a retryable error.

This commit includes the test from Jeff that exposed the gap in the retry-as-local mechanism.

Fixes: #106537.

Release note: None

This commit adds `initial connection heartbeat failed` error to the list
of SQL retryable errors. `IsSQLRetryableError` is used in three places:
- rerunning distributed query as local
- `IsPermanentSchemaChangeError`
- `validateUniqueConstraint`

and in all cases I think it makes sense to consider failing to establish
a connection as a retryable error.

This commit includes the test from Jeff that exposed the gap in the
retry-as-local mechanism.

Release note: None
@yuzefovich yuzefovich requested review from jeffswenson and a team July 10, 2023 20:43
@yuzefovich yuzefovich requested review from a team as code owners July 10, 2023 20:43
@yuzefovich yuzefovich requested a review from cucaroach July 10, 2023 20:43
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@knz
Copy link
Contributor

knz commented Jul 10, 2023

What behavior does.this give us if a node is partitioned away from the cluster with a network error? Will it keep retrying in SQL until the partition is resolved? Or do we want to stop considering network errors as retriable after a timeout?

@yuzefovich
Copy link
Member Author

What behavior does.this give us if a node is partitioned away from the cluster with a network error?

This will depend on the caller of IsSQLRetryableError:

  • retry-as-local mechanism will run the query locally, and it if happens to hit the partitioned node down at the KV layer, then that error will be propagated to the client and won't be retried again
  • in IsPermanentSchemaChangeError, before calling IsSQLRetryableError we call grpcutil.IsClosedConnection which I think will return true because the error uses Unavailable code, so in this case this change should be a no-op
  • in validateUniqueConstraint the retry mechanism of up to 5 attempts will kick in. This mechanism was introduced somewhat recently in sql: retry validation of unique constraint #88307, and it was made partially redundant by recently introduced retry-as-local mechanism. Still, it seems ok to keep validateUniqueConstraint's own mechanism but perhaps reduce the number of retry attempts to 2 or 3. Thoughts on this one?

@jeffswenson
Copy link
Collaborator

There are a couple other cases that might show up that we should handle:

  1. The port is in use by an RPC server for the same tenant, but with a new instance id. This generates an error that complains about the recipient having the wrong node ID.
  2. The port is in use by another tenant. This would generate an ssl certificate error.

I think the error allow list is always going to be fragile. Is there an error code we can check for that indicates the error is a problem with the query and not the transport layer?

@jeffswenson
Copy link
Collaborator

Alternatively, we should have the RPC layer tag error messages that indicate a communication error/protocol error so that we don't have to guess higher up in the sql layer.

@yuzefovich
Copy link
Member Author

Is there an error code we can check for that indicates the error is a problem with the query and not the transport layer?

At least not that I know of, perhaps @knz knows.

Alternatively, we should have the RPC layer tag error messages that indicate a communication error/protocol error so that we don't have to guess higher up in the sql layer.

Also curious about @knz thinks about this idea.

@knz
Copy link
Contributor

knz commented Jul 11, 2023

There's multiple kinds of network errors. If the error occurs after a connection is established already I believe grpc gives us an error object that says so much.

But if the error happens during the connection handshake, we have a larger diversity of cases due to circuit breaker, heartbeat and whatnot.

Maybe a question for @aliher1911 (or @tbg): do we have a stable and exhaustive error characterization for rpc.Context?

@tbg
Copy link
Member

tbg commented Jul 18, 2023

Maybe a question for @aliher1911 (or @tbg): do we have a stable and exhaustive error characterization for rpc.Context?

No, the best we have is

// RequestDidNotStart returns true if the given RPC error means that the request
// definitely could not have started on the remote server.
func RequestDidNotStart(err error) bool {
// NB: gRPC doesn't provide a way to distinguish unambiguous failures, but
// InitialHeartbeatFailedError serves mostly the same purpose. See also
// https://github.com/grpc/grpc-go/issues/1443.
return errors.HasType(err, (*netutil.InitialHeartbeatFailedError)(nil)) ||
errors.Is(err, circuit.ErrBreakerOpen) ||
IsConnectionRejected(err) ||
IsWaitingForInit(err)
}

Which errors to expect to bubble up to SQL is a bit of a mess. This comment links a few issues that I know of. Basically there is no official contract and also no testing that verifies which errors do occur. In some sense, our perf tests are the best testing we have because they don't tolerate errors and will fail1 when something does bubble up.

Footnotes

  1. https://github.com/cockroachdb/cockroach/issues/106474#issuecomment-1640382354

@rafiss rafiss removed the request for review from a team July 25, 2023 20:10
@yuzefovich
Copy link
Member Author

Thanks everyone for the input. Over in #108271 we saw another scenario that Jeff previously mentioned

The port is in use by an RPC server for the same tenant, but with a new instance id. This generates an error that complains about the recipient having the wrong node ID.

I'll close this PR and will make a bit more target "fix" ("fix" is in quotes because it'll still rely on "retry-as-local" mechanism).

@yuzefovich yuzefovich closed this Aug 7, 2023
@yuzefovich yuzefovich deleted the sql-retryable branch August 7, 2023 22:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sql Server Fails to Start with DistSQL Error
5 participants