Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvcoord: Consider increasing max_intents_bytes default #54029

Closed
bdarnell opened this issue Sep 8, 2020 · 12 comments
Closed

kvcoord: Consider increasing max_intents_bytes default #54029

bdarnell opened this issue Sep 8, 2020 · 12 comments
Assignees
Labels
A-kv-transactions Relating to MVCC and the transactional model. C-performance Perf of queries or internals. Solution not expected to change functional behavior. N-followup Needs followup. O-postmortem Originated from a Postmortem action item. T-kv KV Team

Comments

@bdarnell
Copy link
Contributor

bdarnell commented Sep 8, 2020

When a transaction's write set exceeds the kv.transaction.max_intents_bytes size, performance gets dramatically worse since point intent resolutions are changed to range intent resolutions. This can cause unexpectedly high contention between transactions if batched point deletes (for example) must perform intent resolution over large ranges. A motivating example is a "garbage collection" transaction that does a lot of point deletes - if it is big enough to exceed max_intents_bytes, it can have contention with (high-priority) read-only transactions even if they do not touch any garbage-collected keys.

The default value of 256KB was chosen without much thought IIRC; we should consider increasing it. We know of at least one customer who runs with this setting increased to 5MB. The downside to making this setting too high is that if you have a lot of concurrent transactions with large write sets, you could use a lot of unaccounted-for memory. If this were hooked up to proper memory accounting, it would be feasible to make the limit very high (perhaps limited only by the accounting system).

Similar arguments may apply to the max_refresh_spans_bytes setting, although this has not been observed to be as much of a problem in practice.

These settings are (deliberately) soft limits, so that we can accept arbitrarily large transactions with controlled memory usage, but the resulting discontinuity in performance is dangerous. It may be worth providing an option to turn these into hard limits, so that transactions will fail rather than have surprising performance implications.

Epic: CRDB-8282

@blathers-crl
Copy link

blathers-crl bot commented Sep 8, 2020

Hi @bdarnell, please add a C-ategory label to your issue. Check out the label system docs.

While you're here, please consider adding an A- label to help keep our repository tidy.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@bdarnell bdarnell added A-kv-transactions Relating to MVCC and the transactional model. C-performance Perf of queries or internals. Solution not expected to change functional behavior. labels Sep 8, 2020
@jordanlewis
Copy link
Member

cc @tbg, this is a better writeup of what we've discussed a couple of times.

@ajwerner
Copy link
Contributor

ajwerner commented Sep 9, 2020

Relates to but is more pragmatic than #52768.

@aliher1911
Copy link
Contributor

Condensable Span Set Budgeting

Problem

Transaction coordinator keeps track of intents in a number of condensableSpanSet's to perform intent cleanup when transaction is committed or aborted. For large transactions the size of this set could become considerable. To limit consumed memory cockroach uses a configurable limit to start merging spans once limit is exceeded to try to reduce memory footprint.
In current implementation limit is set quite low to reduce potential damage it could cause by exhausting all available memory. We can increase the limit to 2x-4x-?x or try to use shared memory budget to increase the limit but try to keep total memory usage constrained.

Naive Solution

We could use the overall memory budget that we could borrow from when growing the spanset and return memory back when the transaction is finished.

Issues

There are a number of immediate issues that this approach could cause.
What to do when memory budget is exhausted
Budget would act as a soft limit and would not prevent new spans from being added. We may treat it as a budget for non codensed spansets. If you don't have a budget you must condense the spanset. We could also have a hard limit on when we always try to condense.
Would large transactions consuming too much budget unnecessarily affect smaller ones
This could be resolved by having a minimum size threshold under which we would not try to condens spansets.
What happens if budget setting was changed and current usage is violated by non condensed spansets
Applying budget changes to existing spansets would require too much synchronization for what is actually local data. That would be too complicated and would introduce too much contention. Instead we could just look at the currently set value whenever any spanset is ready to condense and check if it needs condensing. As soon as all active spansets are gone we would be down to new memory budget.

Details

Current spanset usage

  1. txn_interceptor_pipeliner iteratively updates intents when processing request and response in Send. Once the send operation is processed it would check if span needs to be condensed.
  2. it also updates spanset in epoch bump and savepoint rollback, but without checking limits
  3. spanset is destroyed when interceptor is finished in closeLocked

Changes

This logic could be changed to track unbudgeted intents and when it check if newly added data would run over budget. Then it would condense and return freed up memory if possible, or register additional memory if condensing didn't free enough memory.
Memory would be returned when spanset is destroyed in closeLocked

Other uses

Spanset is also use in txn_interceptor_span_refresher where it could be updated, condensed and also cleared in multiple places, but the pattern is similar, we collect unbudgeted spans while adding them and consult budget when we can condense. We also return memory back when we either clean explicitly or destroy the interceptor.

Alternatively…

We can just bump the current limit as it is set too conservatively.
In that case the question is what would be the limit and who are the most exposed people to it to ask for knowledge.

@bdarnell
Copy link
Contributor Author

Yes, I think a shared memory budget is the right answer (this is what I was referring to with "If this were hooked up to proper memory accounting" in the original post).

Would large transactions consuming too much budget unnecessarily affect smaller ones

This is the biggest potential issue in my view. As you suggest below, I think we need some minimum allowance per transaction so that smallish transactions can get consistent performance.

Thinking further about this, should we have multiple budgets (per tenant or per user) so that there's less risk of a noisy neighbor disrupting other tenant/users' operations?

@aliher1911
Copy link
Contributor

Ideally I think it would nice to have something similar to how cgroups manages cpu slices where every tenant has guaranteed allocation but any unused slices are up for grabs, but as soon as owner wants them freeloaders are slowed down. But that works for them because slices are ephemeral. For lazy condensing it would be more tricky as long running bulk transaction would hold them uncompressed while others will already starve.

I image it could work with multiple "subpools" where we could actually go above global pool limit for well behaved tenants while misbehaving one would capped on next update.

Or maybe have separate "guaranteed" pool for small transactions per tenant and have a shared one for larger ones.

If the concept of budget/policy that spanset interacts with is well defined we can actually start with global pool and then change it to accommodate per tenant/client pools.

It would be helpful to know how much transaction size imbalance we actually have within a node on multi tenant clusters.

@sumeerbhola
Copy link
Collaborator

Yes, I think a shared memory budget is the right answer (this is what I was referring to with "If this were hooked up to proper memory accounting" in the original post).

There's going to be plumbing to account for memory once #66362 merges. It would be good to account in other places in KV-land.

Also, do we need this if we have #66268 for 21.2?

@aliher1911
Copy link
Contributor

Also, do we need this if we have #66268 for 21.2?

That would allow us to use ranged resolve more aggressively without current penalties, right? In that case increasing limits might not make sense. Having a cap on memory usage might still be helpful.

Do we have any other resources that we track/limit per tenant inside KV to shield them from each other?

@aliher1911
Copy link
Contributor

There's another aspect to range cleanup that I missed, we add range to latchspan and condensed ranges won't be free till there's a latchless cleanup. So even in presence of pebble optimizations it still looks as an improvement.
Memory budget wise it is less clear as there's some demand for better resource accounting overall but it will need some plumbing for client/tenant propagation.

@lunevalex lunevalex added N-followup Needs followup. O-postmortem Originated from a Postmortem action item. labels Jun 25, 2021
@andreimatei
Copy link
Contributor

Would large transactions consuming too much budget unnecessarily affect smaller ones

What I was thinking of doing is maintain a maximum per txn in addition to the global limit, but have it be much larger - maybe 100MB. Particularly since we're also in the process of introducing an option to reject txns over their limit, instead of condensing the lock spans.

@sumeerbhola
Copy link
Collaborator

Do we have any other resources that we track/limit per tenant inside KV to shield them from each other?

There is the rate limiting (tenantrate and tenantcostmodel packages).
Note that the memory accounting being added is not per tenant, though we should consider extending it to be per-tenant (e.g. no tenant can consume more than the limit/4).

craig bot pushed a commit that referenced this issue Jun 29, 2021
66859: kvcoord: increase max_intent_bytes default to 4MB r=andreimatei,erikgrinaker a=aliher1911

Previously limit was set conservatively to 256KB. Having a
conservative limit can cause intent cleanup batches to condense
aggressively resorting to range cleanups which are more expensive.
Increasing defaults will reduce contention during cleanup at the
expense of increased memory for large transactions on
tx_coord_sender.

Release note (performance improvement): Increase default value for
kv.transaction.max_intents_bytes cluster setting from 256KB to 4MB
to improve transaction performance for large transactions at the
expense of increased memory usage. Transactions above this size
limit use slower cleanup mode for commits and aborts.

#54029

Co-authored-by: Oleg Afanasyev <[email protected]>
@lunevalex
Copy link
Collaborator

#66859 sorts this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-transactions Relating to MVCC and the transactional model. C-performance Perf of queries or internals. Solution not expected to change functional behavior. N-followup Needs followup. O-postmortem Originated from a Postmortem action item. T-kv KV Team
Projects
None yet
Development

No branches or pull requests

9 participants