-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvcoord: Consider increasing max_intents_bytes
default
#54029
Comments
Hi @bdarnell, please add a C-ategory label to your issue. Check out the label system docs. While you're here, please consider adding an A- label to help keep our repository tidy. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
cc @tbg, this is a better writeup of what we've discussed a couple of times. |
Relates to but is more pragmatic than #52768. |
Condensable Span Set BudgetingProblemTransaction coordinator keeps track of intents in a number of condensableSpanSet's to perform intent cleanup when transaction is committed or aborted. For large transactions the size of this set could become considerable. To limit consumed memory cockroach uses a configurable limit to start merging spans once limit is exceeded to try to reduce memory footprint. Naive SolutionWe could use the overall memory budget that we could borrow from when growing the spanset and return memory back when the transaction is finished. IssuesThere are a number of immediate issues that this approach could cause. DetailsCurrent spanset usage
ChangesThis logic could be changed to track unbudgeted intents and when it check if newly added data would run over budget. Then it would condense and return freed up memory if possible, or register additional memory if condensing didn't free enough memory. Other usesSpanset is also use in txn_interceptor_span_refresher where it could be updated, condensed and also cleared in multiple places, but the pattern is similar, we collect unbudgeted spans while adding them and consult budget when we can condense. We also return memory back when we either clean explicitly or destroy the interceptor. Alternatively…We can just bump the current limit as it is set too conservatively. |
Yes, I think a shared memory budget is the right answer (this is what I was referring to with "If this were hooked up to proper memory accounting" in the original post).
This is the biggest potential issue in my view. As you suggest below, I think we need some minimum allowance per transaction so that smallish transactions can get consistent performance. Thinking further about this, should we have multiple budgets (per tenant or per user) so that there's less risk of a noisy neighbor disrupting other tenant/users' operations? |
Ideally I think it would nice to have something similar to how cgroups manages cpu slices where every tenant has guaranteed allocation but any unused slices are up for grabs, but as soon as owner wants them freeloaders are slowed down. But that works for them because slices are ephemeral. For lazy condensing it would be more tricky as long running bulk transaction would hold them uncompressed while others will already starve. I image it could work with multiple "subpools" where we could actually go above global pool limit for well behaved tenants while misbehaving one would capped on next update. Or maybe have separate "guaranteed" pool for small transactions per tenant and have a shared one for larger ones. If the concept of budget/policy that spanset interacts with is well defined we can actually start with global pool and then change it to accommodate per tenant/client pools. It would be helpful to know how much transaction size imbalance we actually have within a node on multi tenant clusters. |
There's going to be plumbing to account for memory once #66362 merges. It would be good to account in other places in KV-land. Also, do we need this if we have #66268 for 21.2? |
That would allow us to use ranged resolve more aggressively without current penalties, right? In that case increasing limits might not make sense. Having a cap on memory usage might still be helpful. Do we have any other resources that we track/limit per tenant inside KV to shield them from each other? |
There's another aspect to range cleanup that I missed, we add range to latchspan and condensed ranges won't be free till there's a latchless cleanup. So even in presence of pebble optimizations it still looks as an improvement. |
What I was thinking of doing is maintain a maximum per txn in addition to the global limit, but have it be much larger - maybe 100MB. Particularly since we're also in the process of introducing an option to reject txns over their limit, instead of condensing the lock spans. |
There is the rate limiting ( |
66859: kvcoord: increase max_intent_bytes default to 4MB r=andreimatei,erikgrinaker a=aliher1911 Previously limit was set conservatively to 256KB. Having a conservative limit can cause intent cleanup batches to condense aggressively resorting to range cleanups which are more expensive. Increasing defaults will reduce contention during cleanup at the expense of increased memory for large transactions on tx_coord_sender. Release note (performance improvement): Increase default value for kv.transaction.max_intents_bytes cluster setting from 256KB to 4MB to improve transaction performance for large transactions at the expense of increased memory usage. Transactions above this size limit use slower cleanup mode for commits and aborts. #54029 Co-authored-by: Oleg Afanasyev <[email protected]>
#66859 sorts this |
When a transaction's write set exceeds the
kv.transaction.max_intents_bytes
size, performance gets dramatically worse since point intent resolutions are changed to range intent resolutions. This can cause unexpectedly high contention between transactions if batched point deletes (for example) must perform intent resolution over large ranges. A motivating example is a "garbage collection" transaction that does a lot of point deletes - if it is big enough to exceedmax_intents_bytes
, it can have contention with (high-priority) read-only transactions even if they do not touch any garbage-collected keys.The default value of 256KB was chosen without much thought IIRC; we should consider increasing it. We know of at least one customer who runs with this setting increased to 5MB. The downside to making this setting too high is that if you have a lot of concurrent transactions with large write sets, you could use a lot of unaccounted-for memory. If this were hooked up to proper memory accounting, it would be feasible to make the limit very high (perhaps limited only by the accounting system).
Similar arguments may apply to the
max_refresh_spans_bytes
setting, although this has not been observed to be as much of a problem in practice.These settings are (deliberately) soft limits, so that we can accept arbitrarily large transactions with controlled memory usage, but the resulting discontinuity in performance is dangerous. It may be worth providing an option to turn these into hard limits, so that transactions will fail rather than have surprising performance implications.
Epic: CRDB-8282
The text was updated successfully, but these errors were encountered: