kvserver: separate Raft RPC connection class #111262

erikgrinaker · 2023-09-26T11:03:07Z

Currently, most RPC traffic is sent over a single TCP connection (the default class). Unrelated streams on this connection can be subject to head-of-line blocking under heavy load, often due to RPC processing latency rather than TCP processing.

We should consider splitting out a separate connection class for Raft traffic, and potentially one for snapshot traffic as well, to reduce interference with foreground RPC traffic. This might also to a small extent mitigate bandwidth-delay product impact on high-latency links (see #111241) and packet loss (see #110099).

Jira issue: CRDB-31842

Epic CRDB-32846

The text was updated successfully, but these errors were encountered:

blathers-crl · 2023-09-26T11:03:10Z

cc @cockroachdb/replication

pav-kv · 2024-01-04T19:02:45Z

Currently raft messages for a Replica are sent under SystemClass or DefaultClass which is determined from the range's key span.

We can introduce a separate RaftClass that is independent of the range, then RaftTransport would operate only under this class instead of all the classes.

Do we want to preserve the system/default distinction at this level though, to separate raft traffic for system ranges vs other ranges? I.e. do we want just a single RaftClass, or add some combinatorics and have SystemRaftClass and DefaultRaftClass?

In the latter case, do we want to replace the flat enum by a bitmask so that more combinatorics is possible?

// ConnectionClass is a bitmask:
//	- 1 bit for system/default distinction (or more, if we want some QoS notion)
//	- 2 bits for ConnectionStream (or more if we want to "reserve" for more types)
//	- other bits can be used for some extra sub-ConnectionStream sharding
type ConnectionClass int8

// ConnectionStream is the "logical" class for the connection.
// Can be Rangefeed, Raft, Snapshot, or "all other" RPCs.
type ConnectionStream int8

func (c ConnectionClass) IsSystem() bool {
	return int8(c) & 1 == 0
}

func (c ConnectionClass) Stream() ConnectionStream {
	return ConnectionStream((int8(c) >> 1) & 3)
}

// Alternatively:
//
// func (c ConnectionClass) QOS() ConnectionQOS {
// 	return ConnectionQOS(int8(c) & (1<<qosBits - 1))
// }
//
// func (c ConnectionClass) Stream() ConnectionStream {
// 	return ConnectionStream((int8(c) >> qosBits) & 3)
// }

... // other helpers

@erikgrinaker @nvanbenschoten

pav-kv · 2024-01-04T19:32:24Z

The current ConnectionClass type mixes 2 things together (QoS and connection sharding). I think we should decouple that either way.

erikgrinaker · 2024-01-05T10:04:26Z

We'll need to keep using SystemClass for traffic that needs low latency, i.e. the liveness ranges.

The current ConnectionClass type mixes 2 things together (QoS and connection sharding). I think we should decouple that either way.

How? I'm not aware of a QoS mechanism in gRPC, and TCP has limited support for this anyway (for a single connection). See e.g. grpc/grpc-go#1448.

pav-kv · 2024-01-05T12:26:31Z

So we have another option:

Use SystemClass for raft traffic on the system ranges.
Use RaftClass for non-system ranges raft traffic (instead of DefaultClass).

Are we happy if all the system traffic uses the same class regardless of the traffic type? I.e. if all the system ranges send all the raft, snapshots, rangefeeds, etc via the same SystemClass connection.

It would be good to understand how exactly the classes differ, and whether it's single or multi-dimensional. My current impression is that it is multi-dimensional.

I don't mean a concrete QOS/priority mechanism, we don't use any today. Logically though, the "system" class feels like "high priority", while all other classes feel like "default priority" and differ only in the type of traffic they process (e.g. they have different parameters) and give some isolation/fairness across the classes.

For example, imagine we want to separate and prioritize system rangefeeds from non-system rangefeeds. Do we introduce a SystemRangefeedClass in addition to RangefeedClass, or we just use SystemClass? But in the latter case, how do we work around the difference in gRPC settings between RangefeedClass and SystemClass?

erikgrinaker · 2024-01-05T12:42:39Z

So we have another option:

Use SystemClass for raft traffic on the system ranges.

Use RaftClass for non-system ranges raft traffic (instead of DefaultClass).

Yes, I think this seems reasonable. Note that Raft heartbeats are always sent across SystemClass for all ranges, via the heartbeat coalescing mechanism.

It would be good to understand how exactly the classes differ

It's basically "anything that's important and needs low latency should go via SystemClass". This is a poor-man's QoS scheme, which relies on SystemClass being low volume, and thus low latency. We likely won't gain anything by splitting it up any further, because it's already low volume. I'm not aware of any escalations that were caused by SystemClass connection contention.

If we want to do this properly, we should use QUIC and assign per-stream priorities that translate to actual QoS priorities through the networking stack. That would allow us to specify QoS priorities down to individual gRPC requests, without head-of-line blocking. But that's not practical in the near term, and until then we should just be pragmatic about this -- we won't be able to to design a good QoS scheme without proper protocol support.

imagine we want to separate and prioritize system rangefeeds from non-system rangefeeds. Do we introduce a SystemRangefeedClass in addition to RangefeedClass, or we just use SystemClass?

We should just use SystemClass.

But in the latter case, how do we work around the difference in gRPC settings between RangefeedClass and SystemClass?

These settings were a kludge due to the number of gRPC streams we need to set up with the legacy Rangefeed protocol (one per range, which can blow up to >100k streams, each with their own 2 MB buffer which can cause OOMs). In 24.1 we unconditionally use the mux rangefeed protocol, which only uses a single stream for each client/node pair. We can possibly get rid of those settings entirely in 24.1.

erikgrinaker added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-replication Relating to Raft, consensus, and coordination. T-kv-replication labels Sep 26, 2023

erikgrinaker mentioned this issue Sep 26, 2023

kv: partition raft transport over high RTT links across multiple TCP connections #111238

Closed

exalate-issue-sync bot added T-kv KV Team and removed T-kv-replication labels Jan 2, 2024

exalate-issue-sync bot assigned pav-kv Jan 2, 2024

This was referenced Jan 5, 2024

rpc: add raft connection class #117385

Merged

rangefeed: enable and improve rangefeed connection class #108992

Closed

craig bot closed this as completed in e4bd2ec Jan 9, 2024

craig bot closed this as completed in #117385 Jan 9, 2024

github-project-automation bot added this to KV Aug 28, 2024

github-project-automation bot moved this to Closed in KV Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: separate Raft RPC connection class #111262

kvserver: separate Raft RPC connection class #111262

erikgrinaker commented Sep 26, 2023 •

edited by exalate-issue-sync bot

Loading

blathers-crl bot commented Sep 26, 2023

pav-kv commented Jan 4, 2024 •

edited

Loading

pav-kv commented Jan 4, 2024

erikgrinaker commented Jan 5, 2024

pav-kv commented Jan 5, 2024

erikgrinaker commented Jan 5, 2024

kvserver: separate Raft RPC connection class #111262

kvserver: separate Raft RPC connection class #111262

Comments

erikgrinaker commented Sep 26, 2023 • edited by exalate-issue-sync bot Loading

blathers-crl bot commented Sep 26, 2023

pav-kv commented Jan 4, 2024 • edited Loading

pav-kv commented Jan 4, 2024

erikgrinaker commented Jan 5, 2024

pav-kv commented Jan 5, 2024

erikgrinaker commented Jan 5, 2024

erikgrinaker commented Sep 26, 2023 •

edited by exalate-issue-sync bot

Loading

pav-kv commented Jan 4, 2024 •

edited

Loading