schema changes: pre-backfill without online traffic then incremental backfill after #36850

dt · 2019-04-15T19:48:39Z

TL;DR

Pre-backfill a new index before it gets any online traffic, then start online traffic and catch it up using a new "incremental backfill" that is cheaper, so online traffic mixes with less backfill traffic.

Background

Currently our schema change process follows the online schema change process documented in the F1 paper pretty closely: we advance though a series of states, but most importantly, we ensure all nodes have advanced to a state in which they update the new schema element during all online traffic before we start backfilling. We do this so we can be sure we correctly process every row: if every node is sure to update the new schema element for any added, changed, or removed row before the backfiller starts, we don't have to worry about the backfiller missing a row added or changed while it runs.

Motivation

However, a bulk operation like a backfill is often tuned differently and behaves differently from online traffic and mixing the two can hurt both: backfill operations often want to be larger and like to get lots of work done during a given request (to get more work done for some fixed request overhead) while online operations are small and tend to be concerned with returning quickly. Thus user-observed tail latencies may suffer when small, quick online requests are serialized with big, slow bulk operations. At the same time, having even just few point writes from online traffic on disk in a range might mean all bulk ingestion to that range is forced to go into L0 instead a lower level. All that data may then need to be completely rewritten for compaction, increasing the cost of the backfill significantly. Thus neither the backfill nor the online traffic is particularly happy with the other happening at the same time on in the same range (TODO: measure this. see point (1) below).

We mix this traffic for the reason mentioned above: the fact that the online updates are happening while we backfill is how we ensure the backfill doesn't miss a new or changed row during the backfill. However, we already have a built-in way to tell what, if anything changed, in a table since a given time and indeed have built incremental BACKUP on precisely this: MVCC iteration with a lower time-bound.

Proposal

We could utilize this in schema changes as well: before altering a table to send online traffic writes to our new index or column, we could instead just run the backfill of it into otherwise disused keyspace. After that "prebackfill", we could use the same mechanism as incremental BACKUP, but instead of just writing what changed to files, feed it back into out backfiller, thus making an "incremental backfill" that can catch us up.

In practice, this could then look something like this:
a) run prebackfill, reading at time t0, writing to its own key space.
b) advance schema states to start online traffic writing to new key space.
c) wait until schema states / leases ensure all online traffic is running updates. pick time t1
d) run incremental backfill from time t0 to t1
e) make descs public

Rough Implementation

Given that we have most of the pieces already in place, I think this could be a pretty straightforward project. I imagine the implementation plan looks something like this:

get a speed-of-light number: Actually measure this cost by just skipping the current idex states and going straight to backfill, then adding the new element as directly as public, with heavy traffic on the table. If it isn't a win, we should focus further efforts elsewhere -- the rest of these steps are just for getting correctness back after making such a change.
Add a new state for pre-backfilling that doesn't trigger online writes.
Add a lower-time bound table reader (or teach the chunkbackfiller to read tables a different way). This will require either adding a lower-time-bound to the Scan it runs or teaching TableReader to send ExportRequests and polishing up ExportRequest some.
Update the schema changer to start in pre-backfill and to run the backfiller, with no time bound, there before stepping though current states, then add the time-bound to the current backfill after delete_and_write_only.

More random details/thoughts:

This looks a lot like CDC if you squint, but I suspect we don't actually want to use actual CDC: what we actually want -- no changes while we're working, then all the changes -- seem like it might actually be the opposite of what CDC wants to give you -- i.e. a nearly realtime feed of changes. We probably could subscribe to the whole table but we'd just need to buffer everything during the potentially long and unbounded backfill. I think incremental backup is closer to what we're looking for.

If for some reason we a) see a big benefit to this and b) discover that under heavy traffic so much changes during the online part, we could potentially add more incremental passes that hopefully shrink in size as we catch up.

The reader+writer w/ incremental-reader might also be how we do "almost online" table rewrites (CREATE AS, change pk, repartition, etc).

This assumes we can run the whole change within the gc window. This is probably fine for a PoC but we shouldn't assume this long-term --if we're actively running a job that depends on mvcc in a given window, we should prevent GC from removing those revisions.

Jira issue: CRDB-4482

The text was updated successfully, but these errors were encountered:

dt · 2019-04-15T19:57:29Z

This is currently just an initial brain dump of something that's I've mentioned a couple times recently, and seemed like it was time to put some thoughts in writing (particularly if it is potentially offered as a later stepping stone for table-rewriting jobs building on a fast CREATE AS).
cc @lucy-zhang @rolandcrosby @vivekmenezes (and @danhhz since there's some non-zero conceptual, if not code overlap with CDC).

also, h/t to @andy-kimball for getting me thinking about this, with what I think might have just been an offhand comment asking why we didn't do, in broad strokes, something like this.

danhhz · 2019-04-15T20:49:11Z

This all seems pretty tractable to me.

If for some reason we a) see a big benefit to this and b) discover that under heavy traffic so much changes during the online part, we could potentially add more incremental passes that hopefully shrink in size as we catch up.

I've thought about this a little in the context of an extremely large table (terabytes, petabytes) with a high write rate. We'd likely want to do AddSSTable-only backfill passes until we get "close enough" then switch to something like WriteBatch for the backfill once we enter delete_and_write_only. This is likely overkill for v1, but I'd like to have a plausible answer to this hypothetical at some point.

This looks a lot like CDC if you squint, but I suspect we don't actually want to use actual CDC: what we actually want -- no changes while we're working, then all the changes -- seem like it might actually be the opposite of what CDC wants to give you -- i.e. a nearly realtime feed of changes. We probably could subscribe to the whole table but we'd just need to buffer everything during the potentially long and unbounded backfill. I think incremental backup is closer to what we're looking for.

Agreed. CDC doesn't really seem to me like a building block for this, why buffer when we have the source of truth available.

vivekmenezes · 2019-04-16T01:28:30Z

This makes a lot of sense especially if you can create the initial backfilled index very fast, which I'm pretty sure you can. You're right a major savings is from this index getting into rocksDB L6.

#19749 is the issue related to timebound scans

ajwerner · 2020-03-11T00:17:52Z

Protected timestamps mitigate the GC problems outlined in the description.

I look forward to understanding not how this approach improves the performance of the backfill but rather how it improves the performance of foreground workloads which either:

A) Are highly contended
B) Rely on the 1PC optimization

ajwerner · 2020-09-25T16:15:15Z

One issue with concurrent writes to indexes is that they make AddSSTable much worse and tend to lead to backpressure. Relatively uniform writes to a table which has an index being added can bring that index build to a crawl as compaction debt builds up quickly. This is readily observed in 20.1 with both Pebble and RocksDB. Maybe it is better in 20.2 pebble. Another note is that we don't rebalance ranges based on applying SSTs so we tend to not spread this compaction load around in large clusters. @dt do you think that last point is worthy of its own issue?

postamar · 2022-03-11T15:46:43Z

MVCC-compatible backfills do almost exactly what's described in this issue, so I'm going to close this. Please re-open if I'm wrong.

dt assigned dt and thoszhang Apr 15, 2019

dt added A-disaster-recovery A-schema-changes C-performance Perf of queries or internals. Solution not expected to change functional behavior. C-wishlist A wishlist feature. labels Apr 15, 2019

dt changed the title ~~schema changes: pre-backfill before starting online traffic to reduce impact on live traffic~~ schema changes: pre-backfill without online traffic then incremental backfill after Apr 15, 2019

adityamaru27 self-assigned this Jul 5, 2019

ajwerner mentioned this issue Mar 10, 2020

sql: large latency spikes when creating a large storing index while running tpcc #45888

Closed

ajwerner mentioned this issue Mar 12, 2020

changefeedccl: distinguish between schema change backfills and regular updates/inserts #35738

Closed

thoszhang mentioned this issue Apr 15, 2020

sql: investigate ways of cleaning up failed schema change rollbacks with mutations #47456

Closed

thoszhang mentioned this issue Apr 23, 2020

sql: column backfills should build a new index instead of mutating the existing one #47989

Closed

ajwerner unassigned adityamaru27 Sep 25, 2020

thoszhang mentioned this issue Sep 29, 2020

sql: benchmark "offline backfills" #54955

Closed

kenliu added the T-disaster-recovery label Dec 5, 2020

thoszhang removed their assignment Feb 26, 2021

dt mentioned this issue May 26, 2021

kv/sql: index backfill and truncate temporarily crater write throughput on large tables #62672

Closed

dt removed their assignment Jun 1, 2021

jlinder added the T-sql-schema-deprecated Use T-sql-foundations instead label Jun 16, 2021

postamar closed this as completed Mar 11, 2022

exalate-issue-sync bot added T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) and removed T-disaster-recovery T-sql-schema-deprecated Use T-sql-foundations instead labels May 10, 2023

github-project-automation bot added this to Disaster Recovery Backlog Aug 28, 2024

github-project-automation bot moved this to Done in Disaster Recovery Backlog Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

schema changes: pre-backfill without online traffic then incremental backfill after #36850

schema changes: pre-backfill without online traffic then incremental backfill after #36850

dt commented Apr 15, 2019 •

edited by cockroach-jira-scripts

Loading

dt commented Apr 15, 2019 •

edited

Loading

danhhz commented Apr 15, 2019

vivekmenezes commented Apr 16, 2019

ajwerner commented Mar 11, 2020

ajwerner commented Sep 25, 2020

postamar commented Mar 11, 2022

schema changes: pre-backfill without online traffic then incremental backfill after #36850

schema changes: pre-backfill without online traffic then incremental backfill after #36850

Comments

dt commented Apr 15, 2019 • edited by cockroach-jira-scripts Loading

TL;DR

Background

Motivation

Proposal

Rough Implementation

More random details/thoughts:

dt commented Apr 15, 2019 • edited Loading

danhhz commented Apr 15, 2019

vivekmenezes commented Apr 16, 2019

ajwerner commented Mar 11, 2020

ajwerner commented Sep 25, 2020

postamar commented Mar 11, 2022

dt commented Apr 15, 2019 •

edited by cockroach-jira-scripts

Loading

dt commented Apr 15, 2019 •

edited

Loading