kvcoord: account for the span overhead when condensing refresh spans #84230

yuzefovich · 2022-07-12T03:31:41Z

Previously, we would only account for the lengths of the key and the end
key of the span for the purposes of memory estimation while condensing
the refresh spans set. However, each span has non-trivial overhead (48
bytes) of roachpb.Span object itself which we previously ignored. As
a result, the actual footprint of the refresh spans could previously
significantly exceed the target size, especially when the keys are
small. For example, when looking at a recently collected core dump,
I saw the refresh spans taking up about 24MB in the heap whereas the
target setting is only 4MiB. This memory currently is not tracked
against the memory accounting system at all, so such over-shots are
quite bad, especially so given the recent bump of the setting from
256KiB to 4MiB.

Addresses: #64906.
Addresses: #81451.

Release note (ops change): The way we track memory against
kv.transaction.max_intents_bytes and
kv.transaction.max_refresh_spans_bytes has been adjusted to be more
precise (we no longer ignore some of the overhead). As a result, the
stability of CRDB improves (we're less likely to OOM), however, this
change effectively reduces the budgets determined by those cluster
settings. In practice, this means that

the intents might be tracked more coarsely (due to span coalescing)
which makes the intent resolution less efficient
the refresh spans become more coarse too making it more likely that
ReadWithinUncertaintyIntervalErrors are returned to the user rather
than are retried transparently.

cockroach-teamcity · 2022-07-12T03:31:50Z

This change is

yuzefovich · 2022-07-12T15:52:25Z

I'd like to see what we can do here to alleviate the impact of the memory used by the refresh spans (I'm less interested in the lock footprint which also uses the same struct). Here are the top usages in a recent core dump I analyzed:

   count   size  bytes live%   sum% type
      55 8.4 MB 461 MB 16.80  16.80 [1048576]uint64
      15  25 MB 368 MB 13.42  30.22 [511829]github.com/cockroachdb/cockroach/pkg/roachpb.Span
   43638 6.1 kB 268 MB  9.76  39.98 [1+127?]github.com/cockroachdb/cockroach/pkg/sql/colexec/colexecagg.anyNotNullInt64HashAgg
       9  20 MB 184 MB  6.71  46.69 [426837]github.com/cockroachdb/cockroach/pkg/roachpb.Span

The first and the third are related to the execution and are accounted for. The second one comes from the refresh spans set, and the fourth from the LeafTxnFinalState which directly depends on the refresh spans size. As you can see, we have 368MB of unaccounted for memory which in turn drives large unaccounted for allocations for LeafTxnFinalState. I'll look into addressing the latter, but I'd like to see what we can do with those 368MB.

As is, this PR effectively disables the recent bump to kv.transaction.max_refresh_spans_bytes setting since it could shrink the memory footprint by a factor of 6 or so (depending on the key sizes). If this is acceptable, we might want to bump kv.transaction.max_intents_bytes setting to, say, 16MiB from the current 4MiB to preserve it.

Alternatively, we could actually perform the memory accounting for all of this memory against the root SQL monitor.

Thoughts?

nvanbenschoten

This change makes sense to me. Thanks for finding this and writing up the patch.

If this is acceptable, we might want to bump kv.transaction.max_intents_bytes setting to, say, 16MiB from the current 4MiB to preserve it.

Do you mean kv.transaction.max_refresh_spans_bytes? If so then bumping the default to 8 MiB seems reasonable, given that we were exceeding this budget in many cases already and we're now more accurately accounting for memory.

Alternatively, we could actually perform the memory accounting for all of this memory against the root SQL monitor.

Accounting for KV client memory against a memory monitor has been a goal of KV for a while. It would open up the opportunity for these limits to be more dynamic, based on available memory. Some care would need to be given to avoiding unpredictable performance cliffs when intent/refresh spans get coalesced, but I think an upfront memory reservation which could later be expanded if memory headroom permits could help with that.

Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani and @yuzefovich)

pkg/kv/kvclient/kvcoord/condensable_span_set.go line 31 at r1 (raw file):

// overestimate.
type condensableSpanSet struct {
	s     []roachpb.Span

It feels like there is a lot that we could do with this data structure to 1) reduce the per span overhead, and 2) avoid the retention of many small keys for the duration of a transaction. For instance, we could allocate a single large block of memory and copy keys into it. We could also store key lengths inline to minimize the per-span overhead. Recognizing that many spans are actually point keys would also help.

yuzefovich

Do you mean kv.transaction.max_refresh_spans_bytes?

I did mean kv.transaction.max_intents_bytes there. This condensableSpanSet struct is used by both txnPipeliner and txnSpanRefresher, so the change to how we account for each span affects both kv.transaction.max_intents_bytes and kv.transaction.max_refresh_spans_bytes.

Personally, I haven't seen issues with the former setting (since I'm focused on the analytical read-only queries), so I don't have any preference for what we do with that setting, but in order to roughly preserve the previous behavior on max_intents_bytes we might bump the default up to 8MiB or 16MiB.

I do care more about max_refresh_spans_bytes. The value was bumped from 256KiB to 4MiB during this release cycle, and this exposed us to higher likelihood of OOMs when analytical queries run with some concurrency. Using the numbers from my example, this patch effectively reduces it back to maybe 700KiB. I'd be a bit worried if we increase the default value further without doing the memory accounting.

Another question that we need to consider is - how do we treat non-default cluster setting values on 22.1 once they upgrade to 22.2? Say a user has max_intents_bytes setting set to 4MiB currently on 22.1, do we just do nothing (and then the user will effectively get smaller intents budget? Or do we proactively bump it for them? Given that both of the settings mentioned here are public, maybe a release note covering the change would be sufficient?

My current thinking is the following:

include the overhead of roachpb.Span into spanSize calculation
bump the default of kv.transaction.max_intents_bytes to 8MiB but keep kv.transaction.max_refresh_spans_bytes at 4MiB
include a release note
do not do anything about non-default cluster settings values on 22.1 clusters when they upgrade to 22.2.

Thoughts?

Alternatively, we could actually perform the memory accounting for all of this memory against the root SQL monitor.

Accounting for KV client memory against a memory monitor has been a goal of KV for a while. It would open up the opportunity for these limits to be more dynamic, based on available memory. Some care would need to be given to avoiding unpredictable performance cliffs when intent/refresh spans get coalesced, but I think an upfront memory reservation which could later be expanded if memory headroom permits could help with that.

This makes sense. It seems a bit involved, so I won't take a stab at this. Is there an issue open for this already? I found #19721 and #49387, but they seem to be about other things.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani and @nvanbenschoten)

pkg/kv/kvclient/kvcoord/condensable_span_set.go line 31 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

It feels like there is a lot that we could do with this data structure to 1) reduce the per span overhead, and 2) avoid the retention of many small keys for the duration of a transaction. For instance, we could allocate a single large block of memory and copy keys into it. We could also store key lengths inline to minimize the per-span overhead. Recognizing that many spans are actually point keys would also help.

I took liberty and added your comment as a TODO with your name on it, hope you don't mind :)

yuzefovich

I fixed up the tests with an effort for the least amount of changes as well as added the release note.

Also I decided to not change the default of kv.transaction.max_intents_bytes to 8MiB since there are other things (the in-flight write set) that aren't affected by this PR and count towards the budget.

PTAL.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani and @nvanbenschoten)

nvanbenschoten

Another question that we need to consider is - how do we treat non-default cluster setting values on 22.1 once they upgrade to 22.2? Say a user has max_intents_bytes setting set to 4MiB currently on 22.1, do we just do nothing (and then the user will effectively get smaller intents budget? Or do we proactively bump it for them? Given that both of the settings mentioned here are public, maybe a release note covering the change would be sufficient?

This is a good point. For context, we don't document the effects of this setting very precisely. It was also increased recently in one of the past few releases. We're considering dropping the default back down now that ranged intent resolution is optimized, but that would need some validation: #85313.

Also I decided to not change the default of kv.transaction.max_intents_bytes to 8MiB since there are other things (the in-flight write set) that aren't affected by this PR and count towards the budget.

I'm +1 on not changing the default. A release note mentioning the effects of this change on intent span coalescing and refresh span validity seems sufficient.

Reviewed 5 of 5 files at r2, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @arulajmani)

Previously, we would only account for the lengths of the key and the end key of the span for the purposes of memory estimation while condensing the refresh spans set. However, each span has non-trivial overhead (48 bytes) of `roachpb.Span` object itself which we previously ignored. As a result, the actual footprint of the refresh spans could previously significantly exceed the target size, especially when the keys are small. For example, when looking at a recently collected core dump, I saw the refresh spans taking up about 24MB in the heap whereas the target setting is only 4MiB. This memory currently is not tracked against the memory accounting system at all, so such over-shots are quite bad, especially so given the recent bump of the setting from 256KiB to 4MiB. Release note (ops change): The way we track memory against `kv.transaction.max_intents_bytes` and `kv.transaction.max_refresh_spans_bytes` has been adjusted to be more precise (we no longer ignore some of the overhead). As a result, the stability of CRDB improves (we're less likely to OOM), however, this change effectively reduces the budgets determined by those cluster settings. In practice, this means that - the intents might be tracked more coarsely (due to span coalescing) which makes the intent resolution less efficient - the refresh spans become more coarse too making it more likely that `ReadWithinUncertaintyIntervalError`s are returned to the user rather than are retried transparently.

yuzefovich · 2022-08-01T17:18:15Z

Added more words to the release note.

TFTR!

bors r+

craig · 2022-08-01T19:04:01Z

Build failed (retrying...):

GitHub CI (Cockroach)

craig · 2022-08-01T23:26:56Z

Build succeeded:

GitHub CI (Cockroach)

yuzefovich requested review from nvanbenschoten, arulajmani and a team July 12, 2022 03:31

yuzefovich mentioned this pull request Jul 12, 2022

kvstreamer: improve the stability #82160

Closed

nvanbenschoten approved these changes Jul 13, 2022

View reviewed changes

yuzefovich force-pushed the span-set branch from f83518f to 12f3c78 Compare July 13, 2022 18:42

yuzefovich commented Jul 13, 2022

View reviewed changes

yuzefovich force-pushed the span-set branch from 12f3c78 to 31ff49c Compare July 28, 2022 03:07

yuzefovich commented Jul 28, 2022

View reviewed changes

nvanbenschoten approved these changes Aug 1, 2022

View reviewed changes

yuzefovich force-pushed the span-set branch from 31ff49c to 649c7fa Compare August 1, 2022 17:17

craig bot merged commit 9c4662b into cockroachdb:master Aug 1, 2022

cockroach-teamcity mentioned this pull request Aug 1, 2022

kvcoord: account for the span overhead when condensing refresh spans cockroachdb/docs#14692

Open

yuzefovich deleted the span-set branch August 2, 2022 00:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvcoord: account for the span overhead when condensing refresh spans #84230

kvcoord: account for the span overhead when condensing refresh spans #84230

yuzefovich commented Jul 12, 2022 •

edited

Loading

cockroach-teamcity commented Jul 12, 2022

yuzefovich commented Jul 12, 2022

nvanbenschoten left a comment

yuzefovich left a comment

yuzefovich left a comment •

edited

Loading

nvanbenschoten left a comment

yuzefovich commented Aug 1, 2022

craig bot commented Aug 1, 2022

craig bot commented Aug 1, 2022

kvcoord: account for the span overhead when condensing refresh spans #84230

kvcoord: account for the span overhead when condensing refresh spans #84230

Conversation

yuzefovich commented Jul 12, 2022 • edited Loading

cockroach-teamcity commented Jul 12, 2022

yuzefovich commented Jul 12, 2022

nvanbenschoten left a comment

Choose a reason for hiding this comment

yuzefovich left a comment

Choose a reason for hiding this comment

yuzefovich left a comment • edited Loading

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

yuzefovich commented Aug 1, 2022

craig bot commented Aug 1, 2022

craig bot commented Aug 1, 2022

yuzefovich commented Jul 12, 2022 •

edited

Loading

yuzefovich left a comment •

edited

Loading