schema: add a knob to control the speed of schema changes #36430

awoods187 · 2019-04-02T17:45:42Z

We should add a knob to allow for control of schema change speed in 19.1.

We should also use this knob to slow down schema changes in 19.1 to address issues like #34744, #36385, and others preemptively.

In 19.2, we can turn the knob back up after addressing CPU spike (#34744) and other stability concerns.

Without this knob, users can only turn off this implementation if they run into problems. Since we expect this implementation to be better than 2.1's overall, it would be a shame to need to direct users back to 2.1.

Epic CRDB-8816

Jira issue: CRDB-4499

dt · 2019-04-02T18:39:09Z

#36403 is the knob -- it limits, at the receiving node, the number of expensive SSTs we handle at once. We'll be merging and backporting ASAP, and then exploring tuning it in a later back ported patch if needed.

While that the knob, there's one case it doesn't cover where a follower of many ranges might get swapped with files, even if each of those ranges' leaders have the knob turned down, since while they each obeyed the knob, as a follower you could get files from all of them. For that, just getting and applying SSTs as a follower isn't that expensive though, so if we can keep rocksdb from freaking out about the number of files, we should be good. That's where #36424 (or #34258) come in.

36403: storage: rate-limit AddSST requests r=lucy-zhang a=lucy-zhang We've been seeing extremely high latency for foreground traffic during bulk index backfills, because AddSST requests into non-empty ranges can be expensive, and write requests that are queued behind an AddSST request for an overlapping span can get stuck waiting for multiple seconds. This PR limits the number of concurrent AddSST requests for a single store, determined by a new cluster setting, `kv.bulk_io_write.concurrent_addsstable_requests`, to decrease the impact of index backfills on foreground writes. (It also decreases the risk of writing too many L0 files to RocksDB at once, which causes stalls.) Fixes #36430 Release note (general change): Add a new cluster setting, `kv.bulk_io_write.concurrent_addsstable_requests`, which limits the number of SSTables that can be added concurrently during bulk operations. 36436: roachtest: handle duplicates in cdc/schemareg r=nvanbenschoten a=danhhz There are various internal races and retries in changefeeds that can produce duplicates. This test is really only to verify that the confluent schema registry works end-to-end, so do the simplest thing and sort + unique the output. Closes #36409 Release note: None Co-authored-by: Lucy Zhang <[email protected]> Co-authored-by: Daniel Harrison <[email protected]>

awoods187 · 2019-04-04T19:55:37Z

I'm reopening this issue because it didn't address latency concerns from #34744

dt · 2019-04-04T19:56:57Z

try SET CLUSTER SETTING kv.bulk_io_write.max_rate = '10mb'

awoods187 · 2019-04-04T21:11:25Z

Still there: #34744 (comment)

36735: storage: limit number of AddSSTable requests per second r=lucy-zhang a=lucy-zhang Add rate limiting on the number of AddSSTable requests per second to a store, to allow slowing down index backfills when they're impacting foreground traffic. Closes #36430 Release note: None Co-authored-by: Lucy Zhang <[email protected]>

vivekmenezes · 2019-04-15T20:14:18Z

I ran the same setup and created two indexes: CREATE INDEX, DROP INDEX, CREATE INDEX with the new setting SET CLUSTER SETTING kv.bulk_io_write.addsstable_max_rate=0.1 for the first index and SET CLUSTER SETTING kv.bulk_io_write.addsstable_max_rate=0.03 for the second index.

As you can see the index creation creates a spike in latency at the start but then it levels down to a reasonable level.

This issue is no longer a release blocker. The spike in latency will be addressed in a different issue.

bdarnell · 2019-04-15T20:19:30Z

So which of those values are we going to use as the default? What do these rate limits do to the time taken to complete the schema changes?

dt · 2019-04-15T20:35:13Z

Picking a good default value here is going to be hard.

What this should be set to depends so heavily on the cluster, the other traffic and the data distribution being indexed.

The cluster itself is a big one: I think the reason we've seen this more in under-provisioned or nearer-the-limit clusters is available capacity -- in disk io and cpu -- for the compactor to keep up with the backfill.

How much compacting is required to keep up with the backfill also depends on the ssts we add. That comes down to the specific data distribution being indexed and how it ends up being chunked up: a single SST -- the unit this is limiting -- could be 100kb or it could be 16mb. Perhaps more importantly, it could span nothing -- and go straight to l6 -- or it could span lots of other backfill SSTs and recent online writes, and be forced to L0, trigger a memtable flush, and overlap everything.

Existing traffic is a factor too in that it is using some capacity matters and overlapping with the backfill matters: both in how it affects the backfill above, but also in how much of the online traffic is affected by the backfill.

We could pick something we think is a conservative number, but if we picked something we knew was safe on some hardware/data, I could easily see that ending up being too harsh on small/cheap ssts on different hardware. This is the same reason we haven't picked a max_write_rate default: the right value depends on many things, and absent something smart enough to figure it out, we need the operator to choose. Picking the slowest we think we could ever deploy on -- which we rejected for that setting already -- is even harder here than it was when we rejected that idea the first time, since we have to consider the added lack of predictability of the SSTs we need to ingest.

awoods187 · 2019-04-16T20:46:52Z

After talking with @lucy-zhang I'm not sure if the knob is doing what we hoped it would do as seen in #34744 (comment). I'm re-opening this issue until we can determine what exactly is going on

awoods187 · 2019-04-17T03:21:17Z

I was able to replicate @vivekmenezes's success in running with SET CLUSTER SETTING kv.bulk_io_write.addsstable_max_rate=.01. I suspect that this knob only really works at the beginning of the schema change based on my testing #34744 (comment). However, the schema change is much slower. In fact, it's slower than the old index backfill implementation. I think this knob is therefore not as effective as we had hoped. It seems like the current flow for a customer experiencing a schema change concern will be to cancel the job and add capacity. At the problems the new index backfill runs into, the old index backfill isn't much better.

thoszhang · 2020-04-28T13:55:44Z

I'm removing this from SQL Schema because it seems in the same category of backfill-related performance concerns as, e.g., #47215 (and maybe this issue is obsolete anyway given the last few months of progress and planned work for 20.2), but if anyone disagrees then feel free to put it back.

github-actions · 2023-09-20T11:09:45Z

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

awoods187 added the A-schema-changes label Apr 2, 2019

awoods187 assigned vivekmenezes Apr 2, 2019

andreimatei mentioned this issue Apr 2, 2019

19.1 release blockers list #35554

Closed

23 tasks

vivekmenezes assigned thoszhang and unassigned vivekmenezes Apr 2, 2019

thoszhang mentioned this issue Apr 2, 2019

storage: rate-limit AddSST requests #36403

Merged

craig bot closed this as completed in #36403 Apr 2, 2019

thoszhang mentioned this issue Apr 2, 2019

release-19.1: storage: rate-limit AddSST requests #36444

Merged

awoods187 reopened this Apr 4, 2019

thoszhang mentioned this issue Apr 10, 2019

storage: limit number of AddSSTable requests per second #36735

Merged

thoszhang mentioned this issue Apr 11, 2019

release-19.1: storage: limit number of AddSSTable requests per second #36777

Merged

craig bot closed this as completed in #36735 Apr 11, 2019

vivekmenezes mentioned this issue Apr 15, 2019

sql: zeroqps during schema change with reduced bulk_io_write.max_rate #36806

Closed

awoods187 reopened this Apr 16, 2019

awoods187 mentioned this issue Apr 18, 2019

New index backfill schema change causes a spike in latency #34744

Closed

thoszhang removed their assignment Apr 28, 2020

kenliu added the T-disaster-recovery label Dec 5, 2020

github-actions bot added the no-issue-activity label Sep 20, 2023

github-actions bot added the X-stale label Oct 2, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 2, 2023

exalate-issue-sync bot closed this as completed Oct 2, 2023

github-project-automation bot added this to Disaster Recovery Backlog Aug 28, 2024

github-project-automation bot moved this to Done in Disaster Recovery Backlog Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

schema: add a knob to control the speed of schema changes #36430

schema: add a knob to control the speed of schema changes #36430

awoods187 commented Apr 2, 2019 •

edited by cockroach-jira-scripts

Loading

dt commented Apr 2, 2019

awoods187 commented Apr 4, 2019

dt commented Apr 4, 2019 •

edited

Loading

awoods187 commented Apr 4, 2019

vivekmenezes commented Apr 15, 2019

bdarnell commented Apr 15, 2019

dt commented Apr 15, 2019

awoods187 commented Apr 16, 2019

awoods187 commented Apr 17, 2019

thoszhang commented Apr 28, 2020

github-actions bot commented Sep 20, 2023

schema: add a knob to control the speed of schema changes #36430

schema: add a knob to control the speed of schema changes #36430

Comments

awoods187 commented Apr 2, 2019 • edited by cockroach-jira-scripts Loading

dt commented Apr 2, 2019

awoods187 commented Apr 4, 2019

dt commented Apr 4, 2019 • edited Loading

awoods187 commented Apr 4, 2019

vivekmenezes commented Apr 15, 2019

bdarnell commented Apr 15, 2019

dt commented Apr 15, 2019

awoods187 commented Apr 16, 2019

awoods187 commented Apr 17, 2019

thoszhang commented Apr 28, 2020

github-actions bot commented Sep 20, 2023

awoods187 commented Apr 2, 2019 •

edited by cockroach-jira-scripts

Loading

dt commented Apr 4, 2019 •

edited

Loading