Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvcoord: account for the span overhead when condensing refresh spans #84230

Merged
merged 1 commit into from
Aug 1, 2022

Conversation

yuzefovich
Copy link
Member

@yuzefovich yuzefovich commented Jul 12, 2022

Previously, we would only account for the lengths of the key and the end
key of the span for the purposes of memory estimation while condensing
the refresh spans set. However, each span has non-trivial overhead (48
bytes) of roachpb.Span object itself which we previously ignored. As
a result, the actual footprint of the refresh spans could previously
significantly exceed the target size, especially when the keys are
small. For example, when looking at a recently collected core dump,
I saw the refresh spans taking up about 24MB in the heap whereas the
target setting is only 4MiB. This memory currently is not tracked
against the memory accounting system at all, so such over-shots are
quite bad, especially so given the recent bump of the setting from
256KiB to 4MiB.

Addresses: #64906.
Addresses: #81451.

Release note (ops change): The way we track memory against
kv.transaction.max_intents_bytes and
kv.transaction.max_refresh_spans_bytes has been adjusted to be more
precise (we no longer ignore some of the overhead). As a result, the
stability of CRDB improves (we're less likely to OOM), however, this
change effectively reduces the budgets determined by those cluster
settings. In practice, this means that

  • the intents might be tracked more coarsely (due to span coalescing)
    which makes the intent resolution less efficient
  • the refresh spans become more coarse too making it more likely that
    ReadWithinUncertaintyIntervalErrors are returned to the user rather
    than are retried transparently.

@yuzefovich yuzefovich requested review from nvanbenschoten, arulajmani and a team July 12, 2022 03:31
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@yuzefovich
Copy link
Member Author

I'd like to see what we can do here to alleviate the impact of the memory used by the refresh spans (I'm less interested in the lock footprint which also uses the same struct). Here are the top usages in a recent core dump I analyzed:

   count   size  bytes live%   sum% type
      55 8.4 MB 461 MB 16.80  16.80 [1048576]uint64
      15  25 MB 368 MB 13.42  30.22 [511829]github.com/cockroachdb/cockroach/pkg/roachpb.Span
   43638 6.1 kB 268 MB  9.76  39.98 [1+127?]github.com/cockroachdb/cockroach/pkg/sql/colexec/colexecagg.anyNotNullInt64HashAgg
       9  20 MB 184 MB  6.71  46.69 [426837]github.com/cockroachdb/cockroach/pkg/roachpb.Span

The first and the third are related to the execution and are accounted for. The second one comes from the refresh spans set, and the fourth from the LeafTxnFinalState which directly depends on the refresh spans size. As you can see, we have 368MB of unaccounted for memory which in turn drives large unaccounted for allocations for LeafTxnFinalState. I'll look into addressing the latter, but I'd like to see what we can do with those 368MB.

As is, this PR effectively disables the recent bump to kv.transaction.max_refresh_spans_bytes setting since it could shrink the memory footprint by a factor of 6 or so (depending on the key sizes). If this is acceptable, we might want to bump kv.transaction.max_intents_bytes setting to, say, 16MiB from the current 4MiB to preserve it.

Alternatively, we could actually perform the memory accounting for all of this memory against the root SQL monitor.

Thoughts?

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change makes sense to me. Thanks for finding this and writing up the patch.

If this is acceptable, we might want to bump kv.transaction.max_intents_bytes setting to, say, 16MiB from the current 4MiB to preserve it.

Do you mean kv.transaction.max_refresh_spans_bytes? If so then bumping the default to 8 MiB seems reasonable, given that we were exceeding this budget in many cases already and we're now more accurately accounting for memory.

Alternatively, we could actually perform the memory accounting for all of this memory against the root SQL monitor.

Accounting for KV client memory against a memory monitor has been a goal of KV for a while. It would open up the opportunity for these limits to be more dynamic, based on available memory. Some care would need to be given to avoiding unpredictable performance cliffs when intent/refresh spans get coalesced, but I think an upfront memory reservation which could later be expanded if memory headroom permits could help with that.

Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani and @yuzefovich)


pkg/kv/kvclient/kvcoord/condensable_span_set.go line 31 at r1 (raw file):

// overestimate.
type condensableSpanSet struct {
	s     []roachpb.Span

It feels like there is a lot that we could do with this data structure to 1) reduce the per span overhead, and 2) avoid the retention of many small keys for the duration of a transaction. For instance, we could allocate a single large block of memory and copy keys into it. We could also store key lengths inline to minimize the per-span overhead. Recognizing that many spans are actually point keys would also help.

Copy link
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean kv.transaction.max_refresh_spans_bytes?

I did mean kv.transaction.max_intents_bytes there. This condensableSpanSet struct is used by both txnPipeliner and txnSpanRefresher, so the change to how we account for each span affects both kv.transaction.max_intents_bytes and kv.transaction.max_refresh_spans_bytes.

Personally, I haven't seen issues with the former setting (since I'm focused on the analytical read-only queries), so I don't have any preference for what we do with that setting, but in order to roughly preserve the previous behavior on max_intents_bytes we might bump the default up to 8MiB or 16MiB.

I do care more about max_refresh_spans_bytes. The value was bumped from 256KiB to 4MiB during this release cycle, and this exposed us to higher likelihood of OOMs when analytical queries run with some concurrency. Using the numbers from my example, this patch effectively reduces it back to maybe 700KiB. I'd be a bit worried if we increase the default value further without doing the memory accounting.

Another question that we need to consider is - how do we treat non-default cluster setting values on 22.1 once they upgrade to 22.2? Say a user has max_intents_bytes setting set to 4MiB currently on 22.1, do we just do nothing (and then the user will effectively get smaller intents budget? Or do we proactively bump it for them? Given that both of the settings mentioned here are public, maybe a release note covering the change would be sufficient?

My current thinking is the following:

  • include the overhead of roachpb.Span into spanSize calculation
  • bump the default of kv.transaction.max_intents_bytes to 8MiB but keep kv.transaction.max_refresh_spans_bytes at 4MiB
  • include a release note
  • do not do anything about non-default cluster settings values on 22.1 clusters when they upgrade to 22.2.

Thoughts?

Alternatively, we could actually perform the memory accounting for all of this memory against the root SQL monitor.

Accounting for KV client memory against a memory monitor has been a goal of KV for a while. It would open up the opportunity for these limits to be more dynamic, based on available memory. Some care would need to be given to avoiding unpredictable performance cliffs when intent/refresh spans get coalesced, but I think an upfront memory reservation which could later be expanded if memory headroom permits could help with that.

This makes sense. It seems a bit involved, so I won't take a stab at this. Is there an issue open for this already? I found #19721 and #49387, but they seem to be about other things.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani and @nvanbenschoten)


pkg/kv/kvclient/kvcoord/condensable_span_set.go line 31 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

It feels like there is a lot that we could do with this data structure to 1) reduce the per span overhead, and 2) avoid the retention of many small keys for the duration of a transaction. For instance, we could allocate a single large block of memory and copy keys into it. We could also store key lengths inline to minimize the per-span overhead. Recognizing that many spans are actually point keys would also help.

I took liberty and added your comment as a TODO with your name on it, hope you don't mind :)

Copy link
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed up the tests with an effort for the least amount of changes as well as added the release note.

Also I decided to not change the default of kv.transaction.max_intents_bytes to 8MiB since there are other things (the in-flight write set) that aren't affected by this PR and count towards the budget.

PTAL.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani and @nvanbenschoten)

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another question that we need to consider is - how do we treat non-default cluster setting values on 22.1 once they upgrade to 22.2? Say a user has max_intents_bytes setting set to 4MiB currently on 22.1, do we just do nothing (and then the user will effectively get smaller intents budget? Or do we proactively bump it for them? Given that both of the settings mentioned here are public, maybe a release note covering the change would be sufficient?

This is a good point. For context, we don't document the effects of this setting very precisely. It was also increased recently in one of the past few releases. We're considering dropping the default back down now that ranged intent resolution is optimized, but that would need some validation: #85313.

Also I decided to not change the default of kv.transaction.max_intents_bytes to 8MiB since there are other things (the in-flight write set) that aren't affected by this PR and count towards the budget.

I'm +1 on not changing the default. A release note mentioning the effects of this change on intent span coalescing and refresh span validity seems sufficient.

:lgtm:

Reviewed 5 of 5 files at r2, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @arulajmani)

Previously, we would only account for the lengths of the key and the end
key of the span for the purposes of memory estimation while condensing
the refresh spans set. However, each span has non-trivial overhead (48
bytes) of `roachpb.Span` object itself which we previously ignored. As
a result, the actual footprint of the refresh spans could previously
significantly exceed the target size, especially when the keys are
small. For example, when looking at a recently collected core dump,
I saw the refresh spans taking up about 24MB in the heap whereas the
target setting is only 4MiB. This memory currently is not tracked
against the memory accounting system at all, so such over-shots are
quite bad, especially so given the recent bump of the setting from
256KiB to 4MiB.

Release note (ops change): The way we track memory against
`kv.transaction.max_intents_bytes` and
`kv.transaction.max_refresh_spans_bytes` has been adjusted to be more
precise (we no longer ignore some of the overhead). As a result, the
stability of CRDB improves (we're less likely to OOM), however, this
change effectively reduces the budgets determined by those cluster
settings. In practice, this means that
- the intents might be tracked more coarsely (due to span coalescing)
which makes the intent resolution less efficient
- the refresh spans become more coarse too making it more likely that
`ReadWithinUncertaintyIntervalError`s are returned to the user rather
than are retried transparently.
@yuzefovich
Copy link
Member Author

Added more words to the release note.

TFTR!

bors r+

@craig
Copy link
Contributor

craig bot commented Aug 1, 2022

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Aug 1, 2022

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants