changefeed,kv: Excessive memory usage when using rangefeeds. #74219

miretskiy · 2021-12-22T20:12:16Z

This is technically a rangefeed issue as the problem clearly lies in the rangefeed subsystem.

However, it is only applicable to the changefeed system at this point (but will almost certainly impact replication streaming as well). Other users of rangefeeds use them on small system tables and are unlikely to experience this issue.

The crux of the problem is that running rangefeed against substantially sized table (thousands of ranges per node) can result in OOMs.

When establishing rangefeed streaming rpc, each range is treated as a separate stream. Each stream is configured to use
2MB initial window size (pkg/rpc/context.go). The "initial" name is a bit of a misnomer since using any window size above gRPC default of 65K turns off dynamic window sizing -- so, in effect, this is an upper bound on the amount of memory which can be used by an RPC stream (lots of historical context can be found here: #35161).

Critically though the memory used by gRPC stream is not tracked by any memory monitors.
This memory is "inflight" and the consumer of RPC stream is expected to consume this data quickly. However, no matter how fast the consumer is, it is possible that in a sufficiently sized system, just the mere fact of establishing a rangefeed could result in an OOM. For example, consider a 31 node cluster where a single node wants to talk to all other nodes, each containing 500 ranges. This would result in 30->1 fan-in scenario, where if those 500 ranges per node had enough data to send, we could wind up attempting to ingest 30GB of data very rapidly. Pre-reserving 30GB of memory on any memory monitor is excessive -- and probably incorrect.

The second issue, applicable to changefeeds (and soon to replication streaming) is the fact that these system emit data downstream to external system -- possibly over high latency links. Changefeed has an effective pushback mechanism (which uses memory monitor) to deal with slow sinks. Unfortunately, as described above, if the consumer of the RPC stream is not consuming, then there will be 2MB worth of data buffered by the stream itself. From the dist_sender_rangefeed.go:

for {
			event, err := stream.Recv()
			...
			select {
			case eventCh <- event:
			...
			}
		}

When emission to events channel blocks -- for any reason, the stream still keeps receiving the data, up to 2MB.

The above observations were verified by writing up a unit test that emitted 4MB of data into each of the 100 ranges, and capturing heap profiles.

As expected, 200 MB (2MB per stream) are allocated by http2 client transport.

Lowering initial window size to 64K+1 (plus 1 byte to ensure hard limit) results in 6MB:

A full solution to this problem is expected not to be backported. So, we'll start off with a partial solution that
will be backportable.

We need to add a client side connection class specifically for rangefeed. This connection class will use much lower initial window size. Perhaps 196K (hopefully making OOMs 10x less likely); or even 65K+1 -- a 30x reduction.

A real solution is to ensure that each client rangefeed call (which can span many ranges) establishes 1 stream per node and multiplex all of feeds on that single stream so that per-stream pushback mechanism is effective. That is, each changefeed will establish 1 stream per node instead of 1 stream per range.

Solutions that probably don't solve anything:

Let's have changefeeds disconnect when they enter pushback. This doesn't solve anything because resume w/ catchup scan could hit us with more data even more rapidly. Changefeed disconnect is probably too late anyway.
Let's have connection level limits and pushback: GRPC explicitly removed connection level limits. All pushback is per stream.

Jira issue: CRDB-11977
Epic CRDB-23738

The text was updated successfully, but these errors were encountered:

blathers-crl · 2021-12-22T20:12:18Z

cc @cockroachdb/cdc

74222: kv,rpc: Introduce dedicated rangefeed connection class. r=miretskiy a=miretskiy Rangefeeds executing against very large tables may experience OOMs under certain conditions. The underlying reason for this is that cockroach uses 2MB per gRPC stream buffer size to improve system throughput (particularly under high latency scenarios). However, because the rangefeed against a large table establishes a dedicated stream for each range, the possiblity of an OOM exist if there are many streams from which events are not consumed quickly enough. This PR does two things. First, it introduces a dedicated RPC connection class for use with rangefeeds. The rangefeed connnection class configures rangefeed streams to use less memory per stream than the default connection class. The initial window size for rangefeed connection class can be adjusted by updating `COCKROACH_RANGEFEED_RPC_INITIAL_WINDOW_SIZE` environment variable whose default is 128KB. For rangefeeds to use this new connection class, a `kv.rangefeed.use_dedicated_connection_class.enabled` setting must be turned on. Another change in this PR is that the default RPC window size can be adjusted via `COCKROACH_RPC_INITIAL_WINDOW_SIZE` environment variable. The default for this variable is kept at 2MB. Changing the values of either of those variables is an advanced operation and should not be taken lightly since changes to those variables impact all aspects of cockroach performance characteristics. Values larger than 2MB will be trimmed to 2MB. Setting either of those to 64KB or below will turn on "dynamic window" gRPC window sizes. It should be noted that this change is a partial mitigation, and not a complete fix. A full fix will require rework of rangefeed behavior, and will be done at a later time. An alternative to introduction of a dedicated connection class was to simply lower the default connection window size. However, such change would impact all RPCs in a system, and it was deemed too risky at this point. Substantial benchmarking is needed before such change. Informs #74219 Release Notes (performance improvement): Allow rangefeed streams to use separate http connection when `kv.rangefeed.use_dedicated_connection_class.enabled` setting is turned on. Using separate connection class reduces the possiblity of OOMs when running rangefeeds against very large tables. The connection window size for rangefeed can be adjusted via `COCKROACH_RANGEFEED_INITIAL_WINDOW_SIZE` environment variable, whose default is 128KB. Co-authored-by: Yevgeniy Miretskiy <[email protected]>

mwang1026 · 2022-01-11T22:17:44Z

@miretskiy @amruss removing kv since you seem to be handling this

As demonstrated in cockroachdb#74219, running rangefeed over large enough table or perhaps not consuming rangefeed events quickly enough, may cause undesirable memory blow up, including up to OOMing the node. Since rangefeeds tend to run on (almost) every node, a possiblity exists that a entire cluster might get distabilized. One of the reasons for this is that the memory used by low(er) level http2 RPC transport is not currently being tracked -- nor do we have a mechanism to add such tracking. cockroachdb#74456 provided partial mitigation for this issue, where a single RangeFeed stream could be restricted to use less memory. Nonetheless, the possibility of excessive memory usage remains since the number of rangefeed streams in the system could be very large. This PR introduce a new bi-directional streaming RPC called `RangeFeedStream`. In a uni-directonal streaming RPC, the client estabilishes connection to the KV server and requests to receive events for 1 span. A separate RPC stream is created for each span. In contrast, `RangeFeedStream` is bi-directional RPC: the client is expected to connect to the kv server and request as many spans as it wishes to receive from that server. The server multiplexes all events onto a single stream, and the client de-multiplexes those events appropriately on the receiving end. Note: just like in a uni-directional implementation, each call to `RangeFeedStream` method (i.e. a logical rangefeed established by the client) is treated independently. That is, even though it is possible to multiplex all logical rangefeed streams onto a single bi-directional stream to a single KV node, this optimization is not currently implementation as it raises questions about scheduling and fairness. If we need to further reduce the number of bi-directional streams in the future, from `number_logical_feeds*number_of_nodes` down to the `number_of_nodes`, such optimization can be added at a later time. Multiplexing all of the spans hosted on a node onto a single bi-directional stream reduces the number of such stream from the `number_of_ranges` down to `number_of_nodes` (per logical range feed). This, in turn, enables the client to reserve the worst case amount of memory before starting the rangefeed. This amount of memory is bound by the number of nodes in the cluster times per-stream maximum memory -- default is `2MB`, but this can be changed via dedicated rangefeed connection class. The server side of the equation may also benefit from knowing how many rangefeeds are running right now. Even though the server is somewhat protected from memory blow up via http2 pushback, we may want to also manage server side memory better. Memory accounting for client (and possible server) will be added in the follow on PRs. The new RPC is turned on by default since the intention is to retire the uni-directional rangefeed during later release. However, this default maybe disable (and rangefeed reverted to use old implementation) by setting `COCKROACH_ENABLE_STREAMING_RANGEFEED` environment variable to `false` and restarting the cluster. Release Notes (enteriprise change): Introduce a bi-directional implementation of range feed called `RangeFeedStream`. Range feeds are used by multiple subsystems including changefeeds. The bi-directional implementation provides a better mechanism to monitor memory used by range feeds. The rangefeed implementation can be revered back to its old implementation via restarting the clluster with the `COCKROACH_ENABLE_STREAMING_RANGEFEED` environment variable set to false.

As demonstrated in cockroachdb#74219, running rangefeed over large enough table or perhaps not consuming rangefeed events quickly enough, may cause undesirable memory blow up, including up to OOMing the node. Since rangefeeds tend to run on (almost) every node, a possiblity exists that a entire cluster might get distabilized. One of the reasons for this is that the memory used by low(er) level http2 RPC transport is not currently being tracked -- nor do we have a mechanism to add such tracking. cockroachdb#74456 provided partial mitigation for this issue, where a single RangeFeed stream could be restricted to use less memory. Nonetheless, the possibility of excessive memory usage remains since the number of rangefeed streams in the system could be very large. This PR introduce a new bi-directional streaming RPC called `MuxRangeFeed`. In a uni-directonal streaming RPC, the client estabilishes connection to the KV server and requests to receive events for 1 span. A separate RPC stream is created for each span. In contrast, `MuxRangeFeed` is bi-directional RPC: the client is expected to connect to the kv server and request as many spans as it wishes to receive from that server. The server multiplexes all events onto a single stream, and the client de-multiplexes those events appropriately on the receiving end. Note: just like in a uni-directional implementation, each call to `MuxRangeFeed` method (i.e. a logical rangefeed established by the client) is treated independently. That is, even though it is possible to multiplex all logical rangefeed streams onto a single bi-directional stream to a single KV node, this optimization is not currently implementation as it raises questions about scheduling and fairness. If we need to further reduce the number of bi-directional streams in the future, from `number_logical_feeds*number_of_nodes` down to the `number_of_nodes`, such optimization can be added at a later time. Multiplexing all of the spans hosted on a node onto a single bi-directional stream reduces the number of such stream from the `number_of_ranges` down to `number_of_nodes` (per logical range feed). This, in turn, enables the client to reserve the worst case amount of memory before starting the rangefeed. This amount of memory is bound by the number of nodes in the cluster times per-stream maximum memory -- default is `2MB`, but this can be changed via dedicated rangefeed connection class. The server side of the equation may also benefit from knowing how many rangefeeds are running right now. Even though the server is somewhat protected from memory blow up via http2 pushback, we may want to also manage server side memory better. Memory accounting for client (and possible server) will be added in the follow on PRs. The new RPC is turned on by default since the intention is to retire the uni-directional rangefeed during later release. However, this default maybe disable (and rangefeed reverted to use old implementation) by setting `COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED` environment variable to `false` and restarting the cluster. Release Notes (enterprise change) Rangefeeds (which are used internally e.g. by changefeeds) now use a common HTTP/2 stream per client for all range replicas on a node, instead of one per replica. This significantly reduces the amount of network buffer memory usage, which could cause nodes to run out of memory if a client was slow to consume events. The rangefeed implementation can be revered back to its old implementation via restarting the cluster with the `COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED` environment variable set to false.

As demonstrated in cockroachdb#74219, running rangefeed over large enough table or perhaps not consuming rangefeed events quickly enough, may cause undesirable memory blow up, including up to OOMing the node. Since rangefeeds tend to run on (almost) every node, a possiblity exists that a entire cluster might get distabilized. One of the reasons for this is that the memory used by low(er) level http2 RPC transport is not currently being tracked -- nor do we have a mechanism to add such tracking. cockroachdb#74456 provided partial mitigation for this issue, where a single RangeFeed stream could be restricted to use less memory. Nonetheless, the possibility of excessive memory usage remains since the number of rangefeed streams in the system could be very large. This PR introduce a new bi-directional streaming RPC called `MuxRangeFeed`. In a uni-directonal streaming RPC, the client estabilishes connection to the KV server and requests to receive events for 1 span. A separate RPC stream is created for each span. In contrast, `MuxRangeFeed` is bi-directional RPC: the client is expected to connect to the kv server and request as many spans as it wishes to receive from that server. The server multiplexes all events onto a single stream, and the client de-multiplexes those events appropriately on the receiving end. Note: just like in a uni-directional implementation, each call to `MuxRangeFeed` method (i.e. a logical rangefeed established by the client) is treated independently. That is, even though it is possible to multiplex all logical rangefeed streams onto a single bi-directional stream to a single KV node, this optimization is not currently implementation as it raises questions about scheduling and fairness. If we need to further reduce the number of bi-directional streams in the future, from `number_logical_feeds*number_of_nodes` down to the `number_of_nodes`, such optimization can be added at a later time. Multiplexing all of the spans hosted on a node onto a single bi-directional stream reduces the number of such stream from the `number_of_ranges` down to `number_of_nodes` (per logical range feed). This, in turn, enables the client to reserve the worst case amount of memory before starting the rangefeed. This amount of memory is bound by the number of nodes in the cluster times per-stream maximum memory -- default is `2MB`, but this can be changed via dedicated rangefeed connection class. The server side of the equation may also benefit from knowing how many rangefeeds are running right now. Even though the server is somewhat protected from memory blow up via http2 pushback, we may want to also manage server side memory better. Memory accounting for client (and possible server) will be added in the follow on PRs. The use of new RPC is contingent on the callers providing `WithMuxRangefeed` option when starting the rangefeed. However, an envornmnet variable "kill switch" `COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED` exists to disable the use of mux rangefeed in case of serious issues being discovered. Release note (enterprise change): Introduce a new Rangefeed RPC called `MuxRangeFeed`. Rangefeeds may now use a common HTTP/2 stream per client for all range replicas on a node, instead of one per replica. This significantly reduces the amount of network buffer memory usage, which could cause nodes to run out of memory if a client was slow to consume events. The caller may opt in to use new mechanism by specifying `WithMuxRangefeed` option when starting the rangefeed. However, a cluster wide `COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED` environment variable may be set to `false` to inhibit the use of this new RPC.

As demonstrated in cockroachdb#74219, running rangefeed over large enough table or perhaps not consuming rangefeed events quickly enough, may cause undesirable memory blow up, including up to OOMing the node. Since rangefeeds tend to run on (almost) every node, a possiblity exists that a entire cluster might get distabilized. One of the reasons for this is that the memory used by low(er) level http2 RPC transport is not currently being tracked -- nor do we have a mechanism to add such tracking. cockroachdb#74456 provided partial mitigation for this issue, where a single RangeFeed stream could be restricted to use less memory. Nonetheless, the possibility of excessive memory usage remains since the number of rangefeed streams in the system could be very large. This PR introduce a new bi-directional streaming RPC called `MuxRangeFeed`. In a uni-directonal streaming RPC, the client estabilishes connection to the KV server and requests to receive events for 1 span. A separate RPC stream is created for each span. In contrast, `MuxRangeFeed` is bi-directional RPC: the client is expected to connect to the kv server and request as many spans as it wishes to receive from that server. The server multiplexes all events onto a single stream, and the client de-multiplexes those events appropriately on the receiving end. Note: just like in a uni-directional implementation, each call to `MuxRangeFeed` method (i.e. a logical rangefeed established by the client) is treated independently. That is, even though it is possible to multiplex all logical rangefeed streams onto a single bi-directional stream to a single KV node, this optimization is not currently implementation as it raises questions about scheduling and fairness. If we need to further reduce the number of bi-directional streams in the future, from `number_logical_feeds*number_of_nodes` down to the `number_of_nodes`, such optimization can be added at a later time. Multiplexing all of the spans hosted on a node onto a single bi-directional stream reduces the number of such stream from the `number_of_ranges` down to `number_of_nodes` (per logical range feed). This, in turn, enables the client to reserve the worst case amount of memory before starting the rangefeed. This amount of memory is bound by the number of nodes in the cluster times per-stream maximum memory -- default is `2MB`, but this can be changed via dedicated rangefeed connection class. The server side of the equation may also benefit from knowing how many rangefeeds are running right now. Even though the server is somewhat protected from memory blow up via http2 pushback, we may want to also manage server side memory better. Memory accounting for client (and possible server) will be added in the follow on PRs. The use of new RPC is contingent on the callers providing `WithMuxRangefeed` option when starting the rangefeed. However, an envornmnet variable "kill switch" `COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED` exists to disable the use of mux rangefeed in case of serious issues being discovered. Release note (enterprise change): Introduce a new Rangefeed RPC called `MuxRangeFeed`. Rangefeeds now use a common HTTP/2 stream per client for all range replicas on a node, instead of one per replica. This significantly reduces the amount of network buffer memory usage, which could cause nodes to run out of memory if a client was slow to consume events. The caller may opt in to use new mechanism by specifying `WithMuxRangefeed` option when starting the rangefeed. However, a cluster wide `COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED` environment variable may be set to `false` to inhibit the use of this new RPC.

As demonstrated in cockroachdb#74219, running rangefeed over large enough table or perhaps not consuming rangefeed events quickly enough, may cause undesirable memory blow up, including up to OOMing the node. Since rangefeeds tend to run on (almost) every node, a possiblity exists that a entire cluster might get distabilized. One of the reasons for this is that the memory used by low(er) level http2 RPC transport is not currently being tracked -- nor do we have a mechanism to add such tracking. cockroachdb#74456 provided partial mitigation for this issue, where a single RangeFeed stream could be restricted to use less memory. Nonetheless, the possibility of excessive memory usage remains since the number of rangefeed streams in the system could be very large. This PR introduce a new bi-directional streaming RPC called `MuxRangeFeed`. In a uni-directonal streaming RPC, the client estabilishes connection to the KV server and requests to receive events for 1 span. A separate RPC stream is created for each span. In contrast, `MuxRangeFeed` is bi-directional RPC: the client is expected to connect to the kv server and request as many spans as it wishes to receive from that server. The server multiplexes all events onto a single stream, and the client de-multiplexes those events appropriately on the receiving end. Note: just like in a uni-directional implementation, each call to `MuxRangeFeed` method (i.e. a logical rangefeed established by the client) is treated independently. That is, even though it is possible to multiplex all logical rangefeed streams onto a single bi-directional stream to a single KV node, this optimization is not currently implementation as it raises questions about scheduling and fairness. If we need to further reduce the number of bi-directional streams in the future, from `number_logical_feeds*number_of_nodes` down to the `number_of_nodes`, such optimization can be added at a later time. Multiplexing all of the spans hosted on a node onto a single bi-directional stream reduces the number of such stream from the `number_of_ranges` down to `number_of_nodes` (per logical range feed). This, in turn, enables the client to reserve the worst case amount of memory before starting the rangefeed. This amount of memory is bound by the number of nodes in the cluster times per-stream maximum memory -- default is `2MB`, but this can be changed via dedicated rangefeed connection class. The server side of the equation may also benefit from knowing how many rangefeeds are running right now. Even though the server is somewhat protected from memory blow up via http2 pushback, we may want to also manage server side memory better. Memory accounting for client (and possible server) will be added in the follow on PRs. The use of new RPC is contingent on the callers providing `WithMuxRangefeed` option when starting the rangefeed. However, an envornmnet variable "kill switch" `COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED` exists to disable the use of mux rangefeed in case of serious issues being discovered. Release note (enterprise change): Introduce a new Rangefeed RPC called `MuxRangeFeed`. Rangefeeds now use a common HTTP/2 stream per client for all range replicas on a node, instead of one per replica. This significantly reduces the amount of network buffer memory usage, which could cause nodes to run out of memory if a client was slow to consume events. The caller may opt in to use new mechanism by specifying `WithMuxRangefeed` option when starting the rangefeed. However, a cluster wide `COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED` environment variable may be set to `false` to inhibit the use of this new RPC. Release justification:

As demonstrated in cockroachdb#74219, running rangefeed over large enough table or perhaps not consuming rangefeed events quickly enough, may cause undesirable memory blow up, including up to OOMing the node. Since rangefeeds tend to run on (almost) every node, a possiblity exists that a entire cluster might get destabilized. One of the reasons for this is that the memory used by low(er) level http2 RPC transport is not currently being tracked -- nor do we have a mechanism to add such tracking. cockroachdb#74456 provided partial mitigation for this issue, where a single RangeFeed stream could be restricted to use less memory. Nonetheless, the possibility of excessive memory usage remains since the number of rangefeed streams in the system could be very large. This PR introduce a new bi-directional streaming RPC called `MuxRangeFeed`. In a uni-directional streaming RPC, the client establishes connection to the KV server and requests to receive events for 1 span. A separate RPC stream is created for each span. In contrast, `MuxRangeFeed` is bi-directional RPC: the client is expected to connect to the kv server and request as many spans as it wishes to receive from that server. The server multiplexes all events onto a single stream, and the client de-multiplexes those events appropriately on the receiving end. Note: just like in a uni-directional implementation, each call to `MuxRangeFeed` method (i.e. a logical rangefeed established by the client) is treated independently. That is, even though it is possible to multiplex all logical rangefeed streams onto a single bi-directional stream to a single KV node, this optimization is not currently implementation as it raises questions about scheduling and fairness. If we need to further reduce the number of bi-directional streams in the future, from `number_logical_feeds*number_of_nodes` down to the `number_of_nodes`, such optimization can be added at a later time. Multiplexing all of the spans hosted on a node onto a single bi-directional stream reduces the number of such stream from the `number_of_ranges` down to `number_of_nodes` (per logical range feed). This, in turn, enables the client to reserve the worst case amount of memory before starting the rangefeed. This amount of memory is bound by the number of nodes in the cluster times per-stream maximum memory -- default is `2MB`, but this can be changed via dedicated rangefeed connection class. The server side of the equation may also benefit from knowing how many rangefeeds are running right now. Even though the server is somewhat protected from memory blow up via http2 pushback, we may want to also manage server side memory better. Memory accounting for client (and possible server) will be added in the follow on PRs. The use of new RPC is contingent on the callers providing `WithMuxRangefeed` option when starting the rangefeed. However, an envornmnet variable "kill switch" `COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED` exists to disable the use of mux rangefeed in case of serious issues being discovered. Release note (enterprise change): Introduce a new Rangefeed RPC called `MuxRangeFeed`. Rangefeeds now use a common HTTP/2 stream per client for all range replicas on a node, instead of one per replica. This significantly reduces the amount of network buffer memory usage, which could cause nodes to run out of memory if a client was slow to consume events. The caller may opt in to use new mechanism by specifying `WithMuxRangefeed` option when starting the rangefeed. However, a cluster wide `COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED` environment variable may be set to `false` to inhibit the use of this new RPC. Release justification:

As demonstrated in cockroachdb#74219, running rangefeed over large enough table or perhaps not consuming rangefeed events quickly enough, may cause undesirable memory blow up, including up to OOMing the node. Since rangefeeds tend to run on (almost) every node, a possiblity exists that a entire cluster might get destabilized. One of the reasons for this is that the memory used by low(er) level http2 RPC transport is not currently being tracked -- nor do we have a mechanism to add such tracking. cockroachdb#74456 provided partial mitigation for this issue, where a single RangeFeed stream could be restricted to use less memory. Nonetheless, the possibility of excessive memory usage remains since the number of rangefeed streams in the system could be very large. This PR introduce a new bi-directional streaming RPC called `MuxRangeFeed`. In a uni-directional streaming RPC, the client establishes connection to the KV server and requests to receive events for 1 span. A separate RPC stream is created for each span. In contrast, `MuxRangeFeed` is bi-directional RPC: the client is expected to connect to the kv server and request as many spans as it wishes to receive from that server. The server multiplexes all events onto a single stream, and the client de-multiplexes those events appropriately on the receiving end. Note: just like in a uni-directional implementation, each call to `MuxRangeFeed` method (i.e. a logical rangefeed established by the client) is treated independently. That is, even though it is possible to multiplex all logical rangefeed streams onto a single bi-directional stream to a single KV node, this optimization is not currently implementation as it raises questions about scheduling and fairness. If we need to further reduce the number of bi-directional streams in the future, from `number_logical_feeds*number_of_nodes` down to the `number_of_nodes`, such optimization can be added at a later time. Multiplexing all of the spans hosted on a node onto a single bi-directional stream reduces the number of such stream from the `number_of_ranges` down to `number_of_nodes` (per logical range feed). This, in turn, enables the client to reserve the worst case amount of memory before starting the rangefeed. This amount of memory is bound by the number of nodes in the cluster times per-stream maximum memory -- default is `2MB`, but this can be changed via dedicated rangefeed connection class. The server side of the equation may also benefit from knowing how many rangefeeds are running right now. Even though the server is somewhat protected from memory blow up via http2 pushback, we may want to also manage server side memory better. Memory accounting for client (and possible server) will be added in the follow on PRs. The use of new RPC is contingent on the callers providing `WithMuxRangefeed` option when starting the rangefeed. However, an envornmnet variable "kill switch" `COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED` exists to disable the use of mux rangefeed in case of serious issues being discovered. Release note (enterprise change): Introduce a new Rangefeed RPC called `MuxRangeFeed`. Rangefeeds now use a common HTTP/2 stream per client for all range replicas on a node, instead of one per replica. This significantly reduces the amount of network buffer memory usage, which could cause nodes to run out of memory if a client was slow to consume events. The caller may opt in to use new mechanism by specifying `WithMuxRangefeed` option when starting the rangefeed. However, a cluster wide `COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED` environment variable may be set to `false` to inhibit the use of this new RPC. Release justification: Rangefeed scalability and stability improvement. Safe to merge since the functionality disabled by default.

75581: kv: Introduce MuxRangeFeed bi-directional RPC. r=miretskiy a=miretskiy As demonstrated in #74219, running rangefeed over large enough table or perhaps not consuming rangefeed events quickly enough, may cause undesirable memory blow up, including up to OOMing the node. Since rangefeeds t end to run on (almost) every node, a possibility exists that a entire cluster might get destabilized. One of the reasons for this is that the memory used by low(er) level http2 RPC transport is not currently being tracked -- nor do we have a mechanism to add such tracking. #74456 provided partial mitigation for this issue, where a single RangeFeed stream could be restricted to use less memory. Nonetheless, the possibility of excessive memory usage remains since the number of rangefeed streams in the system could be very large. This PR introduce a new bi-directional streaming RPC called `MuxRangeFeed`. In a uni-directional streaming RPC, the client establishes connection to the KV server and requests to receive events for 1 span. A separate RPC stream is created for each span. In contrast, `MuxRangeFeed` is bi-directional RPC: the client is expected to connect to the kv server and request as many spans as it wishes to receive from that server. The server multiplexes all events onto a single stream, and the client de-multiplexes those events appropriately on the receiving end. Note: just like in a uni-directional implementation, each call to `MuxRangeFeed` method (i.e. a logical rangefeed established by the client) is treated independently. That is, even though it is possible to multiplex all logical rangefeed streams onto a single bi-directional stream to a single KV node, this optimization is not currently implementation as it raises questions about scheduling and fairness. If we need to further reduce the number of bi-directional streams in the future, from `number_logical_feeds*number_of_nodes` down to the `number_of_nodes`, such optimization can be added at a later time. Multiplexing all of the spans hosted on a node onto a single bi-directional stream reduces the number of such stream from the `number_of_ranges` down to `number_of_nodes` (per logical range feed). This, in turn, enables the client to reserve the worst case amount of memory before starting the rangefeed. This amount of memory is bound by the number of nodes in the cluster times per-stream maximum memory -- default is `2MB`, but this can be changed via dedicated rangefeed connection class. The server side of the equation may also benefit from knowing how many rangefeeds are running right now. Even though the server is somewhat protected from memory blow up via http2 pushback, we may want to also manage server side memory better. Memory accounting for client (and possible server) will be added in the follow on PRs. The use of new RPC is contingent on the callers providing `WithMuxRangefeed` option when starting the rangefeed. However, an envornmnet variable "kill switch" `COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED` exists to disable the use of mux rangefeed in case of serious issues being discovered. Release note (enterprise change): Introduce a new Rangefeed RPC called `MuxRangeFeed`. Rangefeeds now use a common HTTP/2 stream per client for all range replicas on a node, instead of one per replica. This significantly reduces the amount of network buffer memory usage, which could cause nodes to run out of memory if a client was slow to consume events. The caller may opt in to use new mechanism by specifying `WithMuxRangefeed` option when starting the rangefeed. However, a cluster wide `COCKROACH_ENABLE_MULTIPLEXING_RANGEFEED` environment variable may be set to `false` to inhibit the use of this new RPC. Release Justification: Rangefeed scalability and stability improvement. Safe to merge since the functionality disabled by default. 84683: obsservice: ingest events r=andreimatei a=andreimatei This patch adds a couple of things: 1. an RPC endpoint to CRDB for streaming out events. This RPC service is called by the Obs Service. 2. a library in CRDB for exporting events. 3. code in the Obs Service for ingesting the events and writing them into a table in the sink cluster. The first use of the event exporting is for the system.eventlog events. All events written to that table are now also exported. Once the Obs Service takes hold in the future, I hope we can remove system.eventlog. The events are represented using OpenTelemetry protos. Unfortunately, I've had to copy over the otel protos to our tree because I couldn't figure out a vendoring solution. Problems encountered for vendoring are: 1. The repo where these protos live (https://github.com/open-telemetry/opentelemetry-proto) is not go-get-able. This means hacks are needed for our vendoring tools. 2. Some of the protos in that repo do not build with gogoproto (they only build with protoc), because they use the new-ish "optional" option on some fields. The logs protos that we use in this patch do not have this problem, but others do (so we'll need to figure something out in the future when dealing with the metrics proto). FWIW, the OpenTelemetry Collector ironically has the same problem (it uses gogoproto), and it solved it through a sed that changes all the optional fields to one-of's. 3. Even if you solve the first two problems, the next one is that we already have a dependency on these compiled protos in our tree (go.opentelemetry.io/proto/otlp). This repo contains generated code, using protoc. We need it because of our use of the otlp trace exporter. Bringing in the protos again, and building them with gogo, results in go files that want to live in the same package/have the same import path. So changing the import paths is needed. Between all of these, I've given up - at least for the moment - and decided to copy over to our tree the few protos that we actually need. I'm also changing their import paths. You'll notice that there is a script that codifies the process of transforming the needed protos from their otel upstream. Release note: None Release justification: Non-production. Co-authored-by: Yevgeniy Miretskiy <[email protected]> Co-authored-by: Andrei Matei <[email protected]>

miretskiy · 2023-01-09T15:15:05Z

Mux rangefeed merged; it is somewhat undertested, but that's not part of this issue.
Closing this issue.

miretskiy added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-cdc Change Data Capture A-kv Anything in KV that doesn't belong in a more specific category. T-cdc labels Dec 22, 2021

blathers-crl bot added the T-kv KV Team label Dec 22, 2021

miretskiy mentioned this issue Dec 22, 2021

kv,rpc: Introduce dedicated rangefeed connection class. #74222

Merged

miretskiy mentioned this issue Jan 5, 2022

release-21.2: kv,rpc: Introduce dedicated rangefeed connection class. #74456

Merged

amruss assigned miretskiy Jan 5, 2022

exalate-issue-sync bot removed the T-kv KV Team label Jan 21, 2022

miretskiy mentioned this issue Jan 26, 2022

kv: Introduce MuxRangeFeed bi-directional RPC. #75581

Merged

miretskiy closed this as completed Jan 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

changefeed,kv: Excessive memory usage when using rangefeeds. #74219

changefeed,kv: Excessive memory usage when using rangefeeds. #74219

miretskiy commented Dec 22, 2021 •

edited by shermanCRL

Loading

blathers-crl bot commented Dec 22, 2021

mwang1026 commented Jan 11, 2022

miretskiy commented Jan 9, 2023

changefeed,kv: Excessive memory usage when using rangefeeds. #74219

changefeed,kv: Excessive memory usage when using rangefeeds. #74219

Comments

miretskiy commented Dec 22, 2021 • edited by shermanCRL Loading

blathers-crl bot commented Dec 22, 2021

mwang1026 commented Jan 11, 2022

miretskiy commented Jan 9, 2023

miretskiy commented Dec 22, 2021 •

edited by shermanCRL

Loading