-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
backfill: rate limit progress updating #67523
Comments
Hey @pbardea would this issue have a bigger effect when creating a partial index? For background, I originally filed #67487 where I hit "split failed while applying backpressure" on the jobs table during an index build, and @ajwerner pointed me to this ticket and gave me a workaround. I'm building a simple partial index, like Here is what I'm setting currently to try to work around it:
This got it much further (failing at 70% instead of 5%), but the index build still fails with an error like this: What's odd is that, right around the time the index build stalls, I start getting timeouts trying to view the jobs table in the console. Presumably it's getting tons of write activity and has a lot of tombstones right then. So I was theorizing that, because it's a partial index, maybe it is hitting big swaths that don't meet the condition (the_column is null), and progress is advancing too fast at that point, overwhelming the job table. Anyway, just a guess. If you guys know of any other workaround to make this index build succeed I'd love to hear it. Thanks. |
68215: backfill: reduce checkpoint interval r=adityamaru a=pbardea Informs #67523. This commit introduces a cluster setting as well as reduces the default checkpoint progress interval. Backfill progress is checkpointed by writing the set of spans that are left TODO. However, this set of spans could get quite large so each update to this job record can quickly fill a range on backfills of large tables. This change is a start to improving the situation by introducing a knob that can be tuned back for large backfills. Ideally, the schema change would rate limit itself based on the size of the progress updates. Release note (ops change): Introduce a cluster setting, bulkio.index_backfill.checkpoint_interval to control the rate at which backfills checkpoint their progress. Useful when needed to be dialed back on backfills of large tables. Co-authored-by: Paul Bardea <[email protected]>
Hi @dankinder -- sorry for the delay! The workarounds that you provided are a good start. A cluster setting, |
Describe the problem
Recently we witnessed the following scenario: a create index job with a 1MB payload was being updated every 10s. This lead to the range in the jobs' table to start backpressuring writes and thus blocking writes. See #62627. (100 KB/s write write to the table pretty quickly fills up the range.)
The index backfiller should be less agressive with it's job progress updating. Rather than just updating at a fixed rate (today's one update every 10s), it should backpressure based on how fast it's writing to the jobs table. This is likely exacerbated since the index backfiller maintains a list of spans that it still needs to process. As completed work comes in from all of the 100 nodes it chops up the list of todo spans so the progress can grow quite large...
Instead of writing every 10s, we should at least throttle the write rate when updating the jobs record to some throughput rather than once every 10s regardless of how large the payload is. We may also want to consider limiting the size of the
todoSpans
at the cost of losing some progress to prevent these payloads from growing too large.gz#9008
Epic CRDB-8816
gz#9032
gz#8907
Jira issue: CRDB-8588
The text was updated successfully, but these errors were encountered: