Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql/pgwire: implement pgwire query cancellation #67501

Merged
merged 3 commits into from
Feb 10, 2022

Conversation

rafiss
Copy link
Collaborator

@rafiss rafiss commented Jul 12, 2021

closes #41335

See individual commits for more details.

This adds a BackendKeyData struct which identifies a session and SQLInstance
running that session. Clients send it to cancel a request, and it's handled by a new
CancelQueryByBackendKeyData endpoint in the status server. The cancellation
gets forwarded to the correct node, and there is a per-node rate limit to
prevent a DoS attack.

Also, refer to the RFC: #75870

This PR only adds support for non-multitenant clusters. Refer to the RFC for additional
work needed for serverless/multitenant support.

@rafiss rafiss requested review from jordanlewis, knz, otan and a team July 12, 2021 17:46
@rafiss rafiss requested a review from a team as a code owner July 12, 2021 17:46
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@@ -1185,6 +1209,14 @@ service Status {
};
}

// PGWireCancelQuery cancels a SQL query given its pgwire "secret ID".
rpc PGWireCancelQuery(PGWireCancelQueryRequest) returns (PGWireCancelQueryResponse) {
option (google.api.http) = {
Copy link
Contributor

@otan otan Jul 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i got some pushback on making this available as a http endpoint and to remove it in a different PR, wonder if you'll be told the same

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah interesting, let us see what people think.. I put it here since the other CancelQuery endpoint is in here, but open to hearing alternatives.

// Response returned by target query's gateway node for a pgwire cancel request.
message PGWireCancelQueryResponse {
// Whether the cancellation request succeeded and the query was canceled.
bool canceled = 1;
Copy link
Contributor

@otan otan Jul 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nit: i think it's more common to have something called success, but up to you.
would it be more sane to have this have no response field but a grpc error if this fails?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me, but I was matching the response format of the non-pgwire CancelQueryResponse. I lean in favor of matching that

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mm, up to you.

// Cancellation has no effect on ongoing query.
if _, err := db.QueryContext(cancelCtx, "select pg_sleep(0)"); err != nil {
// Cancellation should stop the query.
if _, err := db.QueryContext(cancelCtx, "select pg_sleep(1)"); err == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would sleep longer (30s?) in case of, e.g., race.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good

resp, err = s.execCfg.SQLStatusServer.PGWireCancelQuery(ctx, req)
if err == nil && len(resp.Error) > 0 {
err = fmt.Errorf(resp.Error)
}
telemetry.Inc(sqltelemetry.CancelRequestCounter)
_ = conn.Close()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, why do we close the current connection? wouldn't this render the session that issued the cancel unusable in subsequent requests?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well the session that issued the cancel is not really a SQL session -- it's an entirely new connection that's created just to send the cancel. so there isn't anything left for the server to do after it handles the pgwire-cancel.

also, seems like lib/pq actually relies on the server closing the connection. see https://github.com/lib/pq/blob/9e747ca50601fcb6c958dd89f4cb8aea3e067767/conn_go18.go#L169-L172

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe worth a comment

@blathers-crl blathers-crl bot requested a review from otan July 13, 2021 04:12
Copy link
Contributor

@otan otan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, probably want a review from the server team too

resp, err = s.execCfg.SQLStatusServer.PGWireCancelQuery(ctx, req)
if err == nil && len(resp.Error) > 0 {
err = fmt.Errorf(resp.Error)
}
telemetry.Inc(sqltelemetry.CancelRequestCounter)
_ = conn.Close()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe worth a comment

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All right after looking at this we need to have two discussions

Discussion about the encoding and the secret entropy

I am not fundamentally opposed to the approach but the design discussion should be reported on in the commit message (I am mildly surprised there was no RFC). For example, it could include answers to the following questions:

  • If a node has 10000 sessions open, there's only 19 bits of entropy to protect us against duplicate IDs or an attacker guessing a valid session ID by chance. are we comfortable with this?
  • the chance that that session's IDs gets allocated twice increases over time. for long-lived sessions (which is not unreasonable for SQL), then the chance of duplicates increases with the age of the session. What does the math say?
  • what does pg do? Are we doing something similar? (If not, why? what was the trade-off?)
  • what's the probability chance of a duplicate ID as a function of the concurrent number of session and their age?
  • what happens if there are, in fact, duplicate IDs, and a cancel is issued? How would an operator notice that multiple sessions were cancelled?
  • what is the % chance for an attacker to guess a session ID as a function of the number of open sessions?

Also I observe the solution described here uses a full 32-bit value to describe the node ID. In practice, it's somewhat unlikely we ever need all 32 bits. I wonder if we could get some mileage by letting the two values overlap somehow, and then doing a fan-out across multiple nodes to "find the session" if there is an ambiguity?

Alternatively, we can use a variable encoding. For example, we could use a scheme looking as follows:

  • designate one bit in the "node ID part" (e.g. bit 31) to indicate whether the field contains a "large node ID" or a "small node ID". Based on our deployment experience, a node ID is "small" if it is smaller than 1024 (i.e. 10 bits).
  • based on that idea, use the field as follows: if bit 31 is unset (small node ID), then use bits 30-11 as additional secret bits. Otherwise, consider the node ID is large and then keep the secret at 32 bit.
  • when processing a cancellation request, look at bit 31 to decide how to compute the node ID from the first 32 bits.

Why this works and in fact makes for better security:

  • in the common case of small node IDs, we get more secret bits. This is good! It lowers the chance of collisions and random guesses by increasing the entropy to 51 bits.
  • in the less common case of large node IDs, it is also/still good: in that case, the node IDs are large and so an attacker must also guess the node ID accurately (we get extra entropy from the node IDs themselves).

(We could even devise an even better scheme using variable-length encoding for the node ID and use all remaining bits as secret, so there's no sharp cut off when node IDs become larger than 1024).

Anyway, one thing to take from the convo above is that in the implementation the data type used to represent a node+secret combination should be abstracted into a separate Go type, with methods to retrieve the node ID and secret part, and (conceptually, in the API) treat the secret part as an int64 regardless of implementation. This will ensure the implementation is not painted into a corner.

Also, the file where the data type is implemented should be the place where the scheme is documented inside the source code (as a comment). Then the other places that implement the cancellation should forward the reader to that central place where the thing is documented. Given the sensitive nature of this code, we don't want multiple locations in the code making assumptions about the implementation and so all the comments should forward to one source of truth.

Finally, the specific choices and the ideas for the scheme should be included in a release note with label "security update".

Discussion about rate limiting

The overall idea to use a semaphore with a delay in case of failed cancellation is good. I recommend explaining this in a release note marked with "security update" too.

However, the following comment reveals something problematic:

	// The request is forwarded to the appropriate node. We implement the rate
	// limit in the node to which the request is forwarded. This way, if an
	// attacker spams all the nodes in the cluster with requests that all go to
	// the same node, the per-node rate limit will prevent them from having
	// too many guesses.

It's important to have a semaphore at the destination, however the mechanism you're introducing here is creating a DDoS attack surface that did not exist before: it becomes possible for an attacker to flood the network with this. It's even a bit worse than that, as it will really flood the inter-node gRPC connections that are also used for KV traffic and cause general cluster instability.

I think we need two semaphores, one at the source, per node, to protect against what I wrote; and one at the destination, for the reason you wrote above.

Reviewed 4 of 4 files at r1, 7 of 7 files at r2, 8 of 9 files at r3, 1 of 1 files at r4.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis, @otan, and @rafiss)


pkg/server/serverpb/status.proto, line 649 at r2 (raw file):

  // node_id is a string so that "local" can be used to specify that no
  // forwarding is necessary.
  string node_id = 1;

I don't think requesters will ever use "local" so you can keep an int here (specifically, an in32 and use gogo.casttype to roachpb.NodeID).


pkg/server/serverpb/status.proto, line 652 at r2 (raw file):

  // "Secret" ID of query to be canceled. This comes from the BackendKeyData
  // generated during the pgwrite protocol flow.
  uint32 secret_id = 2 [(gogoproto.customname) = "SecretID" ];

see discussion at top: probably worth implementing this as uint64 and keep the specifics encapsulated where the node/secret pair type is implemented in SQL.


pkg/server/serverpb/status.proto, line 1214 at r2 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

Ah interesting, let us see what people think.. I put it here since the other CancelQuery endpoint is in here, but open to hearing alternatives.

Yes indeed this functionality does not warrant a HTTP endpoint.


pkg/sql/pgwire/conn.go, line 722 at r1 (raw file):

	// connection. The cancel code and secret are both int32's. The cancel code
	// is supposed to be the connection's PID, but since we don't have one of
	// those, we use the node ID and a random 32-bit ID.

Maybe change this comment into a link to another file elsewhere where the full scheme is documented.

@knz
Copy link
Contributor

knz commented Jul 13, 2021

Also, I forgot the most important discussion of them all

How is this supposed to work with CockroachCloud?

Especially, serverless clusters, which hopefully will constitute the majority of our CC users.

Reminder, serverless uses a "SQL proxy" which takes connections from customers on a single hostname (cockroachcloud.com? I think) and then routes the requests to the appropriate back-end SQL tenant depending on the identity of the CC account, as per the SQL authentication.

Obviously, the current username/password-based routing simply cannot work with the pg cancellation protocol. What to do instead?

Here are the parameters of the problem:

  • the pg cancel protocol is unauthenticated, so there's nothing besides a 64-bit value to determine where to send the cancellation request.
  • although we can tolerate duplicate IDs inside one customer's cluster, so that a cancellation request can (very infrequently) result in cancellation of two or more sessions, it is not acceptable for a cancellation request to cancel queries across multiple CC customers.
  • each SQL proxy instance can route requests for hundreds of thousands of customers, so if we use bits from the cancellation ID as routing identifiers, that correspondingly reduces entropy available for the secret.
  • in CC there is not just 1 SQL proxy for all of CC; we'll have a fleet of SQL proxies, at least one per region, and each proxy able to route requests for many customers. So we can't simply create a 64-bit ID per session in RAM in the SQL proxy, because the values need to be globally unique across separate processes.

There's a perhaps naive approach to side-step these questions, by asking: "Would it be OK for the cancellation protocol to only work for on-prem and dedicated clusters?"

To which there are two answers:

  • (I believe) our goal is to ensure that CC customers who migrate from on-prem to CC, or from CC dedicated to serverless, should not see their UX and application features degrade in that migration. An approach to cancellation that only works for on-prem/dedicated would fail here.

  • we are going to slowly introduce multi-tenancy for on-prem customers too. Some of our larger accounts have requested this. Once these customers use multiple SQL tenants, they will ask us "how does pg query cancellation work in this architecture?" and we need to have an answer for that.

What to do about it?

I think we could perhaps make-do by not forwarding the pg cancel request in the sql proxy, and handling cancellation inside the SQL proxy instead. In that view, the cancellation ID would identify a session inside a SQL proxy, not inside a SQL tenant. (We'd use SQL proxy ID, session ID inside proxy, as cancellation key).

Then when a SQL proxy receives a valid cancel request, it could either send a proper CANCEL SESSION statement to the back-end SQL tenant, or perhaps even more simply close the back-end TCP connection.

@knz
Copy link
Contributor

knz commented Jul 13, 2021

In the hypothetical case where cancellation for CC can indeed be handled inside the SQL proxy without participation from CockroachDB, then I wonder if we should not implement that first -- it would have higher impact on our business.

@knz
Copy link
Contributor

knz commented Jul 13, 2021

Another thing, in the case this is meant to work db-side for CC server less clusters.

How does this work with session migration? Reminder, we're aiming to hold idle SQL sessions open in the proxy while shutting down the SQL pod. When the session becomes active again, a new SQL pod is started.

However the new SQL pod likely gets a new instance ID!

I think in serverless clusters we can't really assume that the SQL instance ID is stable across the lifetime of the SQL session. Andy K might be able to inform further.

How would cancellation work in that case?

@knz
Copy link
Contributor

knz commented Jul 13, 2021

And then one more question, if we evolve our product offering to recommend a CRDB SDK to develop clients, could we perhaps implement cancellation in the SDK with a CANCEL SESSION statement?

@andy-kimball
Copy link
Contributor

Could we use the IP address of the caller to increase security? For example, we could have the requirement that only the IP address that originally established the authenticated session is allowed to cancel active queries in that session. While something like this would be essential to do for Serverless clusters, I'd think that'd be a good check even for Dedicated clusters.

Also, for Serverless, could we use the IPv4 address of the SQL instance instead of the NodeID of the CRDB node as the BackendKeyData.ProcessID field? This would allow the proxy to efficiently route the request to the right SQL instance without any coordination necessary between proxies. Checking the source IP address in the SQL instance would then mitigate the danger of an unauthorized caller cancelling queries. And while IPv4 addresses can be reused in K8s, the BackendKeyData.SecretKey would minimize the possibility that the wrong queries are cancelled. The downside of using IPv4 is that this would not work with IPv6 networks. But perhaps that's a problem for another day?

@andy-kimball
Copy link
Contributor

I also agree with Raphael that this feature could use an RFC, given the complexity of the architectural considerations in play here.

@rafiss
Copy link
Collaborator Author

rafiss commented Jul 13, 2021

I will work on providiong answers to these questions. They're good to ask.

Specifically for "Discussion about the encoding and secret entropy" the decisions I made were based on the conclusions of the large thread here: #34520 (comment)

But I agree I should summarize and present those conclusions in these commit messages. I didn't write an RFC since in my mind, the discussion already happened, but looks like there's still enough demand for an RFC (esp given that Serverless was not considered in the previous discussion).

I did consider adding a second semaphore, but kept going back and forth on that. Your comment assures me that it's a good idea to add.

Good points about CC-Serverless. This approach I've implemented will not work with that or if we do session migration in other places. I'm not advocating for keeping it not working, but I will also just point out that the pgwire cancel protocol is "best effort." specifically, from the PG docs:

The cancellation signal might or might not have any effect — for example, if it arrives after the backend has finished processing the query, then it will have no effect. If the cancellation is effective, it results in the current command being terminated early with an error message.

The upshot of all this is that for reasons of both security and efficiency, the frontend has no direct way to tell whether a cancel request has succeeded. It must continue to wait for the backend to respond to the query. Issuing a cancel simply improves the odds that the current query will finish soon, and improves the odds that it will fail with an error message instead of succeeding."

And also for some context, the big motivation for us to pick up this work is that it is the top unimplemented feature in our telemetry. cc @vy-ton in case there is more info about the priorities and for awareness on the rest of the discussion

@knz
Copy link
Contributor

knz commented Jul 13, 2021

I will also just point out that the pgwire cancel protocol is "best effort."

I'd like to semi-seriously ask: if pg considers this feature best effort, what is the difference being doing something that's only going to work for a minority of clusters, and correctly doing absolutely nothing in response to a cancellation request?

Is it perhaps easier to ... remove the telemetry counter altogether?

@andy-kimball
Copy link
Contributor

Another even simpler idea: just use the two fields in BackendKeyData as a 64-bit secret. Then broadcast that secret + source IP address to all CRDB nodes or SQL pods, just as we do with CANCEL QUERY today. Whichever nodes/pods match that secret + source IP address will cancel active queries in the matching session. If no nodes/pods match, then we'd back off for 1 second as in the PR to mitigate DOS threats.

This method would work for both Dedicated and Serverless clusters, be relatively easy to implement, would be reasonably secure, and should be no worse than CANCEL QUERY today in terms of performance.

@knz
Copy link
Contributor

knz commented Jul 13, 2021

Andy, you're forgetting that these requests are not authenticated. In serverless, that means a proxy would need to broadcast to all tenants too.
That seems expensive?

@rafiss
Copy link
Collaborator Author

rafiss commented Jul 13, 2021

The existing CANCEL QUERY sql command doesn't broadcast to all nodes -- it forwards to one specific node (for dedicated clusters):

cockroach/pkg/server/status.go

Lines 2213 to 2222 in fdb86a2

if !local {
// This request needs to be forwarded to another node.
ctx = propagateGatewayMetadata(ctx)
ctx = s.AnnotateCtx(ctx)
status, err := s.dialNode(ctx, nodeID)
if err != nil {
return nil, err
}
return status.CancelQuery(ctx, req)
}

For tenants, CANCEL QUERY currently does not get broadcasted at all (also see the comment on the tenantStatusServer type above this):

output.Canceled, err = t.sessionRegistry.CancelQuery(request.QueryID)

This latter point makes me feel that this PR is not the place for us to make pgwire cancel fully work for CC-Serverless. (I certainly do support achieving a design that will let it work well in the future though.)

@rafiss
Copy link
Collaborator Author

rafiss commented Jul 13, 2021

In serverless, that means a proxy would need to broadcast to all tenants too.

Could sqlproxy also use the source IP of the cancel request to decide which tenant to broadcast to?

@andy-kimball
Copy link
Contributor

andy-kimball commented Jul 13, 2021

Ah, I'd thought someone said we broadcast when canceling - I was mistaken. And Raphael's right that it's way too expensive to broadcast to every tenant anyway. The problem with using the source IP to decide upon the tenant is that every SQL proxy won't necessarily have knowledge of the mapping from source IP to tenant(s), because not every source IP would be connected through every proxy. We'd have to coordinate among SQL proxies to distribute the information, which would introduce complexity and fragility.

One thing that every proxy does have: a complete directory of the TenantID => TenantPodList for every tenant with active pods (though there's a short pub-sub delay to be fully fresh). We could take a hash of the pod address (IP/host + port) to fit even IPv6 addresses within the 32-bit BackendKeyData.ProcessID field. When the proxy receives a CancelRequest message, it'd do these steps:

  1. Do a lookup in the proxy directory for the hash value, in order to get the TenantID + Pod address to connect to.
  2. Send a cancel request message to any pods that match the hash. This would be an internal message extended with the original source IP address of the caller (we don't want to use the source IP address of the proxy).
  3. Each pod uses the secret + source IP address to verify the message and to cancel queries in the matching session.

One downside of this design is that there are race conditions where the proxy directory is stale. But this seems a small downside, since mostly callers would cancel longer-running queries and also since CancelRequest is "best effort" anyways. We could also back off for a 1-2 seconds and try the directory lookup again before giving up, since we probably want to do that anyway for DOS threat mitigation (as in this PR).

@vy-ton
Copy link
Contributor

vy-ton commented Jul 28, 2021

I'd like to semi-seriously ask: if pg considers this feature best effort, what is the difference being doing something that's only going to work for a minority of clusters, and correctly doing absolutely nothing in response to a cancellation request?

Is it perhaps easier to ... remove the telemetry counter altogether?

Telemetry does indicate that this unsupported feature contributes to user confusion (todo docs issue). We also see TS repeatedly answering this question.

Is it accurate to say that we do want to support CancelRequest across all CockroachDB deployments, so the options are:

  1. Do not merge this yet, wait and design the multi-tenant solution
  2. Merge acknowledging that this only works for self-hosted/CC dedicated. The feature parity with CC Serverless remains an issue as Raphael noted and would have to be addressed later on.

@andy-kimball
Copy link
Contributor

I'd be worried that the design for Serverless might change how we want to design for Dedicated. In that case, it'd be a mistake to merge this now, since it'd mean possibly painting ourselves into a corner later on when we generalize the design. If we've done enough thinking and have written down how this design can be extended in the future to handle Serverless, then I think that's sufficient due-diligence to merge this PR.

@rafiss rafiss requested a review from a team February 8, 2022 22:41
Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks this looks nice now.

I think it's pretty important to have two separate metrics:

  • cancel count (I think we track this already)
  • ignored cancel queries due to semaphore (we'll need this to troubleshoot cancel issues under load)
  • cancel queries processed successfully

Reviewed 1 of 20 files at r5, 4 of 23 files at r11, 4 of 6 files at r12, 14 of 15 files at r13.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis, @knz, @otan, and @rafiss)


pkg/server/tenant_status.go, line 253 at r13 (raw file):

	resp = &serverpb.CancelQueryByKeyResponse{}

	// Acquiring the semaphore here helps protect both the source and destination

I'm not completely clear on the semaphore behavior.

Why do you use a semaphore both on the ingress and internally? Why is a semaphore only on the ingress insufficient?

Also I'm not completely clear on what are the consequences of using TryAcquire instead of regular Acquire. Does this mean that if the cluster is overloaded, the query requests get ignored?

I think that the desired behavior is this:

  • TryAcquire on the ingress from sql clients, and drop excess requests, but with a high(er) semaphore count, perhaps 256
  • Acquire (and wait if necessary) on the inter-node RPC, so that requests that have been accepted externally all get a chance to be processed (albeit, possibly, with a delay)

Copy link
Collaborator Author

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call on metrics. will add that as another commit.

i don't think we actually have any metrics now -- we just have product telemetry on the count of requests

Dismissed @knz and @otan from 6 discussions.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @jordanlewis, @knz, and @otan)


pkg/server/tenant_status.go, line 253 at r13 (raw file):

Why do you use a semaphore both on the ingress and internally? Why is a semaphore only on the ingress insufficient?

I tried to explain that in the comment: "The destination node is protected so that if an attacker spams all the nodes in the cluster with requests that all go to the same node, this semaphore will prevent them from having too many guesses." More concretely, suppose you have a 100-node cluster. If you limit the ingress only, then each node is limited to processing X requests per second. But if an attacker crafts the CancelRequests to all target the same SQLInstance, then that one instance would have to process 100*X requests per second.

I don't think the proposal to to Acquire and Wait on the inter-node RPC would address this concern.


pkg/server/serverpb/status.proto, line 649 at r2 (raw file):

Previously, knz (kena) wrote…

I don't think requesters will ever use "local" so you can keep an int here (specifically, an in32 and use gogo.casttype to roachpb.NodeID).

Done.


pkg/server/serverpb/status.proto, line 652 at r2 (raw file):

Previously, knz (kena) wrote…

see discussion at top: probably worth implementing this as uint64 and keep the specifics encapsulated where the node/secret pair type is implemented in SQL.

Done.


pkg/server/serverpb/status.proto, line 658 at r2 (raw file):

Previously, otan (Oliver Tan) wrote…

mm, up to you.

Done.


pkg/server/serverpb/status.proto, line 1214 at r2 (raw file):

Previously, knz (kena) wrote…

Yes indeed this functionality does not warrant a HTTP endpoint.

Done.


pkg/server/serverpb/status.proto, line 853 at r10 (raw file):

Previously, knz (kena) wrote…

Would it make sense to break down the fields of the int64 into sub-fields? It may make the RPC code easier to read.

To me the term "backend keydata" is a pgwire-specific concept. Do we need to propagate it internally through the cluster?

I could even argue it does not need to propagate to the sql package. The more narrow we can keep the uses of the magic logic in the new pgwirecancel package, the more flexibility we'll have later to change/add new ways to use cancellation.

The main reason for my concern is that the cancellation key is the only part that deserves to be (and should remain) opaque everywhere else than the code that checks it.
The sql instance ID, in contrast, should be prominent. We'll want that code that organizes RPCs (and pod-to-pod conns), which needs to be maintained over time by non-SQL experts, has all the information it needs (SQL instance ID for routing) and does not get "polluted" by concerns that the maintainer doesn't need to concern themselves about.

Done.


pkg/sql/pgwire/server.go, line 763 at r3 (raw file):

Previously, otan (Oliver Tan) wrote…

maybe worth a comment

Done.

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: with nits, you can do metrics here or in a followup PR.

Reviewed 1 of 15 files at r13.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @jordanlewis, @knz, @otan, and @rafiss)


pkg/server/tenant_status.go, line 253 at r13 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

Why do you use a semaphore both on the ingress and internally? Why is a semaphore only on the ingress insufficient?

I tried to explain that in the comment: "The destination node is protected so that if an attacker spams all the nodes in the cluster with requests that all go to the same node, this semaphore will prevent them from having too many guesses." More concretely, suppose you have a 100-node cluster. If you limit the ingress only, then each node is limited to processing X requests per second. But if an attacker crafts the CancelRequests to all target the same SQLInstance, then that one instance would have to process 100*X requests per second.

I don't think the proposal to to Acquire and Wait on the inter-node RPC would address this concern.

Can you add the example to the comment, thanks.


pkg/sql/exec_util.go, line 1884 at r13 (raw file):

	id ClusterWideID, queryCancelKey pgwirecancel.BackendKeyData, s registrySession,
) {
	r.Lock()

nit: defer unlock, in case there's an error accessing the maps


pkg/sql/exec_util.go, line 1891 at r13 (raw file):

func (r *SessionRegistry) deregister(id ClusterWideID, queryCancelKey pgwirecancel.BackendKeyData) {
	r.Lock()

ditto


pkg/sql/pgwire/server.go, line 771 at r13 (raw file):

	var resp *serverpb.CancelQueryByKeyResponse
	resp, err = s.execCfg.SQLStatusServer.CancelQueryByKey(ctx, req)
	if err == nil && len(resp.Error) > 0 {
if len(resp.Error) > 0 {
   err =  errors.CombineErrors(err, errors.Newf("...", resp.Error))
}

@rafiss rafiss requested a review from matthewtodd February 9, 2022 20:04
@rafiss
Copy link
Collaborator Author

rafiss commented Feb 9, 2022

cc @matthewtodd no need to review, but you may be interested in the last commit of this PR, since it relates to the cancel endpoints in the status servers. see the RFC or DM me if you need more context.

Copy link
Collaborator Author

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dismissed @knz from 2 discussions.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @jordanlewis, @knz, @matthewtodd, and @otan)


pkg/server/tenant_status.go, line 253 at r13 (raw file):

Previously, knz (kena) wrote…

Can you add the example to the comment, thanks.

done


pkg/sql/exec_util.go, line 1884 at r13 (raw file):

Previously, knz (kena) wrote…

nit: defer unlock, in case there's an error accessing the maps

done


pkg/sql/exec_util.go, line 1891 at r13 (raw file):

Previously, knz (kena) wrote…

ditto

Done.


pkg/sql/pgwire/server.go, line 771 at r13 (raw file):

Previously, knz (kena) wrote…
if len(resp.Error) > 0 {
   err =  errors.CombineErrors(err, errors.Newf("...", resp.Error))
}

fixed

This is an ID used for pgwire query cancellation. In Postgres it
contains a process ID and a secret ID. Since we don't have process IDs,
we use a SQLInstanceID instead, which will also allow us to forward these
requests to the correct node.

To maximize the number of bits used for the secret, we first check if
the SQLInstanceID can fit into 11 bits. If not, the leading bit of the
BackendKeyData is set, and the remaining 31 bits are used for the
SQLInstanceID.

We introduce another map in the session registry that is keyed by the
BackendKeyData.

No release note since this doesn't yet cause a user-facing change -- we
previously were sending a BackendKeyData of 0 all the time.

Release note: None
@rafiss rafiss force-pushed the pgwire-cancel branch 2 times, most recently from e249ed1 to 306bb9d Compare February 9, 2022 23:21
This will be used by the pgwire query cancelation protocol. Nothing in
this commit uses the endpoint and it is not implemented yet, so there
is no release note.

Release note: None
Release note (sql change): Added support for query cancellation via the
pgwire protocol. CockroachDB will now respond to a pgwire cancellation
by forwarding the request to the node that is running a particular
query. That node will then cancel the query that is currently running in
the session identified by the cancel request. The cancel request is made
through the pgwire protocol when initializing a new connection. The
client must first send 32 bits containing the integer 80877102, followed
immediately by the 64-bit BackendKeyData message that the server sent to
the client when the session was started. Most Postgres drivers handle this
protocol already, so there's nothing for the end-user to do apart from
calling the "cancel" function that their driver offers.

See https://www.postgresql.org/docs/13/protocol-flow.html#id-1.10.5.7.9

Release note (security update): Added support for query cancellation via
the pgwire protocol. Since this protocol is unauthenticated, there are a
few precautions included.

(1) The protocol requires that a 64-bit key is used to uniquely identify
a session. Some of these bits are used to identify the CockroachDB node
that owns the session. The rest of the bits are all random. If the node
ID is small enough, then only 12 bits are used for the ID, and the
remaining 52 bits are random. Otherwise, 32 bits are used for both the
ID and the random secret.

(2) A fixed per-node rate limit is used. There can only be at most 256
failed cancellation attempts per second. Any other cancel requests that
exceed this rate are ignored. This makes it harder for an attacker to
guess random cancellation keys. Specifically, if we assume a 32-bit
secret and 256 concurrent sessions on a node, it would take 2^16 seconds
(about 18 hours) for an attacker to be certain they have cancelled a
query.

(3) No response is returned for a cancel request. This makes it
impossible for an attacker to know if their guesses are working. There
are internal  warning logs for unsuccessful attempts. Large numbers
of these messages could indicate malicious activity.
@rafiss rafiss requested a review from a team as a code owner February 10, 2022 04:25
@rafiss rafiss requested review from miretskiy and removed request for a team February 10, 2022 04:25
Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 6 files at r15, 5 of 16 files at r16.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @jordanlewis, @knz, @matthewtodd, @miretskiy, @otan, and @rafiss)


pkg/server/status.go, line 2472 at r16 (raw file):

	client, err := s.dialNode(ctx, roachpb.NodeID(req.SQLInstanceID))
	if err != nil {
		return nil, err

return nil, serverError(ctx, err)


pkg/server/tenant_status.go, line 295 at r16 (raw file):

	statusClient, err := t.dialPod(ctx, req.SQLInstanceID, instance.InstanceAddr)
	if err != nil {
		return nil, err

return nil, serverError(ctx, err)

@rafiss rafiss removed the request for review from miretskiy February 10, 2022 14:05
Copy link
Collaborator Author

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @jordanlewis, @knz, @matthewtodd, and @otan)


pkg/server/status.go, line 2472 at r16 (raw file):

Previously, knz (kena) wrote…

return nil, serverError(ctx, err)

i can't find the serverError function. the other endpoints in this file aren't calling it either.


pkg/server/tenant_status.go, line 295 at r16 (raw file):

Previously, knz (kena) wrote…

return nil, serverError(ctx, err)

ditto - i can't find the serverError function. the other endpoints in this file aren't calling it either.

@rafiss
Copy link
Collaborator Author

rafiss commented Feb 10, 2022

thanks for the reviews and discussion all!

bors r=knz

@craig
Copy link
Contributor

craig bot commented Feb 10, 2022

🕐 Waiting for PR status (Github check) to be set, probably by CI. Bors will automatically try to run when all required PR statuses are set.

@craig
Copy link
Contributor

craig bot commented Feb 10, 2022

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sql: support pgwire query cancellation
7 participants