sql: benchmark "offline backfills" #54955
Labels
A-schema-changes
C-investigation
Further steps needed to qualify. C-label will change.
T-disaster-recovery
See #36850. We know creating an index has a performance impact but we don't fully understand why or what will solve it. We can run some tests to figure out what is happening and what will help, and how much "offline backfills" would help or if it is something else.
Updated suggestions, Spring 2021:
One theory for what causes SQL performance degradation is that it is thundering herd of traffic to a single un-split index range that cannot handle that much traffic, as opposed to the backfill itself. We can test that theory! Create an index and immediately pause the create index job, so it doesn't backfill. Does the performance crater? If so, then it isn't the backfill and we know it is indeed about making the SQL traffic hit this new index, not how we're filling it.
If this is true, once we know it is about making the SQL traffic hit the new index and not the index backfill, we can keep narrowing down why that happens. If we can, we might then try splitting that (empty) index span by hand, if we can come up with some plausible index keys. Select some rows from the table, use those to call SPLIT AT, and see: does that make it happy again? If so, boom, we now have proof it is a) that SQL traffic load has to hit the new index span that is b) under-split for that load. Time to figure out how to split it (which may or may not be to pre-backfill it, which would cause size splitting to happen. while that might be one way to get that span split, there may be others as well). OTOH, if not-backfilling still shows import but the splitting doesn't help, that would be very revealing too. And indeed, it suggests that pre-backfill won't help at all.
Now if the performance doesn't crater just from having the new index in DELETE_AND_WRITE_ONLY, but then does when we unpause the backfill, well then, that is also enormously informative: now we know we need to look at the backfill itself - quotapool, addsstable, etc. This would be where #36850 could potentially be the only solution or at least easiest (despite its size) solution. In this case, we could then partially validate that it truly is the interaction of the backfill and the SQL traffic by having two identical tables side-by-side, and send SQL load to one -- with a paused index backfill as described above -- while backfilling the other copy, which is seeing no load. Does backfilling the unused second table affect the online table? If so, then 36850 may not have much impact, since this test is more or less modeling the same thing it will: doing the backfill to one span while sql uses a different span.
Jira issue: CRDB-3701
The text was updated successfully, but these errors were encountered: