-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
schema changes: pre-backfill without online traffic then incremental backfill after #36850
Comments
This is currently just an initial brain dump of something that's I've mentioned a couple times recently, and seemed like it was time to put some thoughts in writing (particularly if it is potentially offered as a later stepping stone for table-rewriting jobs building on a fast also, h/t to @andy-kimball for getting me thinking about this, with what I think might have just been an offhand comment asking why we didn't do, in broad strokes, something like this. |
This all seems pretty tractable to me.
I've thought about this a little in the context of an extremely large table (terabytes, petabytes) with a high write rate. We'd likely want to do AddSSTable-only backfill passes until we get "close enough" then switch to something like WriteBatch for the backfill once we enter
Agreed. CDC doesn't really seem to me like a building block for this, why buffer when we have the source of truth available. |
This makes a lot of sense especially if you can create the initial backfilled index very fast, which I'm pretty sure you can. You're right a major savings is from this index getting into rocksDB L6. #19749 is the issue related to timebound scans |
Protected timestamps mitigate the GC problems outlined in the description. I look forward to understanding not how this approach improves the performance of the backfill but rather how it improves the performance of foreground workloads which either: A) Are highly contended |
One issue with concurrent writes to indexes is that they make AddSSTable much worse and tend to lead to backpressure. Relatively uniform writes to a table which has an index being added can bring that index build to a crawl as compaction debt builds up quickly. This is readily observed in 20.1 with both Pebble and RocksDB. Maybe it is better in 20.2 pebble. Another note is that we don't rebalance ranges based on applying SSTs so we tend to not spread this compaction load around in large clusters. @dt do you think that last point is worthy of its own issue? |
MVCC-compatible backfills do almost exactly what's described in this issue, so I'm going to close this. Please re-open if I'm wrong. |
TL;DR
Pre-backfill a new index before it gets any online traffic, then start online traffic and catch it up using a new "incremental backfill" that is cheaper, so online traffic mixes with less backfill traffic.
Background
Currently our schema change process follows the online schema change process documented in the F1 paper pretty closely: we advance though a series of states, but most importantly, we ensure all nodes have advanced to a state in which they update the new schema element during all online traffic before we start backfilling. We do this so we can be sure we correctly process every row: if every node is sure to update the new schema element for any added, changed, or removed row before the backfiller starts, we don't have to worry about the backfiller missing a row added or changed while it runs.
Motivation
However, a bulk operation like a backfill is often tuned differently and behaves differently from online traffic and mixing the two can hurt both: backfill operations often want to be larger and like to get lots of work done during a given request (to get more work done for some fixed request overhead) while online operations are small and tend to be concerned with returning quickly. Thus user-observed tail latencies may suffer when small, quick online requests are serialized with big, slow bulk operations. At the same time, having even just few point writes from online traffic on disk in a range might mean all bulk ingestion to that range is forced to go into L0 instead a lower level. All that data may then need to be completely rewritten for compaction, increasing the cost of the backfill significantly. Thus neither the backfill nor the online traffic is particularly happy with the other happening at the same time on in the same range (TODO: measure this. see point (1) below).
We mix this traffic for the reason mentioned above: the fact that the online updates are happening while we backfill is how we ensure the backfill doesn't miss a new or changed row during the backfill. However, we already have a built-in way to tell what, if anything changed, in a table since a given time and indeed have built incremental BACKUP on precisely this: MVCC iteration with a lower time-bound.
Proposal
We could utilize this in schema changes as well: before altering a table to send online traffic writes to our new index or column, we could instead just run the backfill of it into otherwise disused keyspace. After that "prebackfill", we could use the same mechanism as incremental BACKUP, but instead of just writing what changed to files, feed it back into out backfiller, thus making an "incremental backfill" that can catch us up.
In practice, this could then look something like this:
a) run prebackfill, reading at time t0, writing to its own key space.
b) advance schema states to start online traffic writing to new key space.
c) wait until schema states / leases ensure all online traffic is running updates. pick time t1
d) run incremental backfill from time t0 to t1
e) make descs public
Rough Implementation
Given that we have most of the pieces already in place, I think this could be a pretty straightforward project. I imagine the implementation plan looks something like this:
Scan
it runs or teachingTableReader
to sendExportRequests
and polishing upExportRequest
some.delete_and_write_only
.More random details/thoughts:
This looks a lot like CDC if you squint, but I suspect we don't actually want to use actual CDC: what we actually want -- no changes while we're working, then all the changes -- seem like it might actually be the opposite of what CDC wants to give you -- i.e. a nearly realtime feed of changes. We probably could subscribe to the whole table but we'd just need to buffer everything during the potentially long and unbounded backfill. I think incremental backup is closer to what we're looking for.
If for some reason we a) see a big benefit to this and b) discover that under heavy traffic so much changes during the online part, we could potentially add more incremental passes that hopefully shrink in size as we catch up.
The reader+writer w/ incremental-reader might also be how we do "almost online" table rewrites (
CREATE AS
, change pk, repartition, etc).This assumes we can run the whole change within the gc window. This is probably fine for a PoC but we shouldn't assume this long-term --if we're actively running a job that depends on mvcc in a given window, we should prevent GC from removing those revisions.
Jira issue: CRDB-4482
The text was updated successfully, but these errors were encountered: