-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvcoord: account for the span overhead when condensing refresh spans #84230
Conversation
I'd like to see what we can do here to alleviate the impact of the memory used by the refresh spans (I'm less interested in the lock footprint which also uses the same struct). Here are the top usages in a recent core dump I analyzed:
The first and the third are related to the execution and are accounted for. The second one comes from the refresh spans set, and the fourth from the As is, this PR effectively disables the recent bump to Alternatively, we could actually perform the memory accounting for all of this memory against the root SQL monitor. Thoughts? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change makes sense to me. Thanks for finding this and writing up the patch.
If this is acceptable, we might want to bump kv.transaction.max_intents_bytes setting to, say, 16MiB from the current 4MiB to preserve it.
Do you mean kv.transaction.max_refresh_spans_bytes
? If so then bumping the default to 8 MiB seems reasonable, given that we were exceeding this budget in many cases already and we're now more accurately accounting for memory.
Alternatively, we could actually perform the memory accounting for all of this memory against the root SQL monitor.
Accounting for KV client memory against a memory monitor has been a goal of KV for a while. It would open up the opportunity for these limits to be more dynamic, based on available memory. Some care would need to be given to avoiding unpredictable performance cliffs when intent/refresh spans get coalesced, but I think an upfront memory reservation which could later be expanded if memory headroom permits could help with that.
Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani and @yuzefovich)
pkg/kv/kvclient/kvcoord/condensable_span_set.go
line 31 at r1 (raw file):
// overestimate. type condensableSpanSet struct { s []roachpb.Span
It feels like there is a lot that we could do with this data structure to 1) reduce the per span overhead, and 2) avoid the retention of many small keys for the duration of a transaction. For instance, we could allocate a single large block of memory and copy keys into it. We could also store key lengths inline to minimize the per-span overhead. Recognizing that many spans are actually point keys would also help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean
kv.transaction.max_refresh_spans_bytes
?
I did mean kv.transaction.max_intents_bytes
there. This condensableSpanSet
struct is used by both txnPipeliner
and txnSpanRefresher
, so the change to how we account for each span affects both kv.transaction.max_intents_bytes
and kv.transaction.max_refresh_spans_bytes
.
Personally, I haven't seen issues with the former setting (since I'm focused on the analytical read-only queries), so I don't have any preference for what we do with that setting, but in order to roughly preserve the previous behavior on max_intents_bytes
we might bump the default up to 8MiB or 16MiB.
I do care more about max_refresh_spans_bytes
. The value was bumped from 256KiB to 4MiB during this release cycle, and this exposed us to higher likelihood of OOMs when analytical queries run with some concurrency. Using the numbers from my example, this patch effectively reduces it back to maybe 700KiB. I'd be a bit worried if we increase the default value further without doing the memory accounting.
Another question that we need to consider is - how do we treat non-default cluster setting values on 22.1 once they upgrade to 22.2? Say a user has max_intents_bytes
setting set to 4MiB currently on 22.1, do we just do nothing (and then the user will effectively get smaller intents budget? Or do we proactively bump it for them? Given that both of the settings mentioned here are public, maybe a release note covering the change would be sufficient?
My current thinking is the following:
- include the overhead of
roachpb.Span
intospanSize
calculation - bump the default of
kv.transaction.max_intents_bytes
to 8MiB but keepkv.transaction.max_refresh_spans_bytes
at 4MiB - include a release note
- do not do anything about non-default cluster settings values on 22.1 clusters when they upgrade to 22.2.
Thoughts?
Alternatively, we could actually perform the memory accounting for all of this memory against the root SQL monitor.
Accounting for KV client memory against a memory monitor has been a goal of KV for a while. It would open up the opportunity for these limits to be more dynamic, based on available memory. Some care would need to be given to avoiding unpredictable performance cliffs when intent/refresh spans get coalesced, but I think an upfront memory reservation which could later be expanded if memory headroom permits could help with that.
This makes sense. It seems a bit involved, so I won't take a stab at this. Is there an issue open for this already? I found #19721 and #49387, but they seem to be about other things.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani and @nvanbenschoten)
pkg/kv/kvclient/kvcoord/condensable_span_set.go
line 31 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
It feels like there is a lot that we could do with this data structure to 1) reduce the per span overhead, and 2) avoid the retention of many small keys for the duration of a transaction. For instance, we could allocate a single large block of memory and copy keys into it. We could also store key lengths inline to minimize the per-span overhead. Recognizing that many spans are actually point keys would also help.
I took liberty and added your comment as a TODO with your name on it, hope you don't mind :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed up the tests with an effort for the least amount of changes as well as added the release note.
Also I decided to not change the default of kv.transaction.max_intents_bytes
to 8MiB since there are other things (the in-flight write set) that aren't affected by this PR and count towards the budget.
PTAL.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani and @nvanbenschoten)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another question that we need to consider is - how do we treat non-default cluster setting values on 22.1 once they upgrade to 22.2? Say a user has max_intents_bytes setting set to 4MiB currently on 22.1, do we just do nothing (and then the user will effectively get smaller intents budget? Or do we proactively bump it for them? Given that both of the settings mentioned here are public, maybe a release note covering the change would be sufficient?
This is a good point. For context, we don't document the effects of this setting very precisely. It was also increased recently in one of the past few releases. We're considering dropping the default back down now that ranged intent resolution is optimized, but that would need some validation: #85313.
Also I decided to not change the default of kv.transaction.max_intents_bytes to 8MiB since there are other things (the in-flight write set) that aren't affected by this PR and count towards the budget.
I'm +1 on not changing the default. A release note mentioning the effects of this change on intent span coalescing and refresh span validity seems sufficient.
Reviewed 5 of 5 files at r2, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @arulajmani)
Previously, we would only account for the lengths of the key and the end key of the span for the purposes of memory estimation while condensing the refresh spans set. However, each span has non-trivial overhead (48 bytes) of `roachpb.Span` object itself which we previously ignored. As a result, the actual footprint of the refresh spans could previously significantly exceed the target size, especially when the keys are small. For example, when looking at a recently collected core dump, I saw the refresh spans taking up about 24MB in the heap whereas the target setting is only 4MiB. This memory currently is not tracked against the memory accounting system at all, so such over-shots are quite bad, especially so given the recent bump of the setting from 256KiB to 4MiB. Release note (ops change): The way we track memory against `kv.transaction.max_intents_bytes` and `kv.transaction.max_refresh_spans_bytes` has been adjusted to be more precise (we no longer ignore some of the overhead). As a result, the stability of CRDB improves (we're less likely to OOM), however, this change effectively reduces the budgets determined by those cluster settings. In practice, this means that - the intents might be tracked more coarsely (due to span coalescing) which makes the intent resolution less efficient - the refresh spans become more coarse too making it more likely that `ReadWithinUncertaintyIntervalError`s are returned to the user rather than are retried transparently.
Added more words to the release note. TFTR! bors r+ |
Build failed (retrying...): |
Build succeeded: |
Previously, we would only account for the lengths of the key and the end
key of the span for the purposes of memory estimation while condensing
the refresh spans set. However, each span has non-trivial overhead (48
bytes) of
roachpb.Span
object itself which we previously ignored. Asa result, the actual footprint of the refresh spans could previously
significantly exceed the target size, especially when the keys are
small. For example, when looking at a recently collected core dump,
I saw the refresh spans taking up about 24MB in the heap whereas the
target setting is only 4MiB. This memory currently is not tracked
against the memory accounting system at all, so such over-shots are
quite bad, especially so given the recent bump of the setting from
256KiB to 4MiB.
Addresses: #64906.
Addresses: #81451.
Release note (ops change): The way we track memory against
kv.transaction.max_intents_bytes
andkv.transaction.max_refresh_spans_bytes
has been adjusted to be moreprecise (we no longer ignore some of the overhead). As a result, the
stability of CRDB improves (we're less likely to OOM), however, this
change effectively reduces the budgets determined by those cluster
settings. In practice, this means that
which makes the intent resolution less efficient
ReadWithinUncertaintyIntervalError
s are returned to the user ratherthan are retried transparently.