Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: separate Raft RPC connection class #111262

Closed
erikgrinaker opened this issue Sep 26, 2023 · 6 comments · Fixed by #117385
Closed

kvserver: separate Raft RPC connection class #111262

erikgrinaker opened this issue Sep 26, 2023 · 6 comments · Fixed by #117385
Assignees
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team

Comments

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Sep 26, 2023

Currently, most RPC traffic is sent over a single TCP connection (the default class). Unrelated streams on this connection can be subject to head-of-line blocking under heavy load, often due to RPC processing latency rather than TCP processing.

We should consider splitting out a separate connection class for Raft traffic, and potentially one for snapshot traffic as well, to reduce interference with foreground RPC traffic. This might also to a small extent mitigate bandwidth-delay product impact on high-latency links (see #111241) and packet loss (see #110099).

Jira issue: CRDB-31842

Epic CRDB-32846

@erikgrinaker erikgrinaker added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-replication Relating to Raft, consensus, and coordination. T-kv-replication labels Sep 26, 2023
@blathers-crl
Copy link

blathers-crl bot commented Sep 26, 2023

cc @cockroachdb/replication

@pav-kv
Copy link
Collaborator

pav-kv commented Jan 4, 2024

Currently raft messages for a Replica are sent under SystemClass or DefaultClass which is determined from the range's key span.

We can introduce a separate RaftClass that is independent of the range, then RaftTransport would operate only under this class instead of all the classes.

Do we want to preserve the system/default distinction at this level though, to separate raft traffic for system ranges vs other ranges? I.e. do we want just a single RaftClass, or add some combinatorics and have SystemRaftClass and DefaultRaftClass?

In the latter case, do we want to replace the flat enum by a bitmask so that more combinatorics is possible?

// ConnectionClass is a bitmask:
//	- 1 bit for system/default distinction (or more, if we want some QoS notion)
//	- 2 bits for ConnectionStream (or more if we want to "reserve" for more types)
//	- other bits can be used for some extra sub-ConnectionStream sharding
type ConnectionClass int8

// ConnectionStream is the "logical" class for the connection.
// Can be Rangefeed, Raft, Snapshot, or "all other" RPCs.
type ConnectionStream int8

func (c ConnectionClass) IsSystem() bool {
	return int8(c) & 1 == 0
}

func (c ConnectionClass) Stream() ConnectionStream {
	return ConnectionStream((int8(c) >> 1) & 3)
}

// Alternatively:
//
// func (c ConnectionClass) QOS() ConnectionQOS {
// 	return ConnectionQOS(int8(c) & (1<<qosBits - 1))
// }
//
// func (c ConnectionClass) Stream() ConnectionStream {
// 	return ConnectionStream((int8(c) >> qosBits) & 3)
// }

... // other helpers

@erikgrinaker @nvanbenschoten

@pav-kv
Copy link
Collaborator

pav-kv commented Jan 4, 2024

The current ConnectionClass type mixes 2 things together (QoS and connection sharding). I think we should decouple that either way.

@erikgrinaker
Copy link
Contributor Author

We'll need to keep using SystemClass for traffic that needs low latency, i.e. the liveness ranges.

The current ConnectionClass type mixes 2 things together (QoS and connection sharding). I think we should decouple that either way.

How? I'm not aware of a QoS mechanism in gRPC, and TCP has limited support for this anyway (for a single connection). See e.g. grpc/grpc-go#1448.

@pav-kv
Copy link
Collaborator

pav-kv commented Jan 5, 2024

So we have another option:

  1. Use SystemClass for raft traffic on the system ranges.
  2. Use RaftClass for non-system ranges raft traffic (instead of DefaultClass).

Are we happy if all the system traffic uses the same class regardless of the traffic type? I.e. if all the system ranges send all the raft, snapshots, rangefeeds, etc via the same SystemClass connection.


It would be good to understand how exactly the classes differ, and whether it's single or multi-dimensional. My current impression is that it is multi-dimensional.

I don't mean a concrete QOS/priority mechanism, we don't use any today. Logically though, the "system" class feels like "high priority", while all other classes feel like "default priority" and differ only in the type of traffic they process (e.g. they have different parameters) and give some isolation/fairness across the classes.

For example, imagine we want to separate and prioritize system rangefeeds from non-system rangefeeds. Do we introduce a SystemRangefeedClass in addition to RangefeedClass, or we just use SystemClass? But in the latter case, how do we work around the difference in gRPC settings between RangefeedClass and SystemClass?

@erikgrinaker
Copy link
Contributor Author

So we have another option:

  • Use SystemClass for raft traffic on the system ranges.
  • Use RaftClass for non-system ranges raft traffic (instead of DefaultClass).

Yes, I think this seems reasonable. Note that Raft heartbeats are always sent across SystemClass for all ranges, via the heartbeat coalescing mechanism.

It would be good to understand how exactly the classes differ

It's basically "anything that's important and needs low latency should go via SystemClass". This is a poor-man's QoS scheme, which relies on SystemClass being low volume, and thus low latency. We likely won't gain anything by splitting it up any further, because it's already low volume. I'm not aware of any escalations that were caused by SystemClass connection contention.

If we want to do this properly, we should use QUIC and assign per-stream priorities that translate to actual QoS priorities through the networking stack. That would allow us to specify QoS priorities down to individual gRPC requests, without head-of-line blocking. But that's not practical in the near term, and until then we should just be pragmatic about this -- we won't be able to to design a good QoS scheme without proper protocol support.

imagine we want to separate and prioritize system rangefeeds from non-system rangefeeds. Do we introduce a SystemRangefeedClass in addition to RangefeedClass, or we just use SystemClass?

We should just use SystemClass.

But in the latter case, how do we work around the difference in gRPC settings between RangefeedClass and SystemClass?

These settings were a kludge due to the number of gRPC streams we need to set up with the legacy Rangefeed protocol (one per range, which can blow up to >100k streams, each with their own 2 MB buffer which can cause OOMs). In 24.1 we unconditionally use the mux rangefeed protocol, which only uses a single stream for each client/node pair. We can possibly get rid of those settings entirely in 24.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team
Projects
No open projects
Status: Closed
Development

Successfully merging a pull request may close this issue.

2 participants