Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jobs: canceling a job fails if the job's range is backpressuring writes #62627

Closed
andreimatei opened this issue Mar 25, 2021 · 7 comments
Closed
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. N-followup Needs followup. O-postmortem Originated from a Postmortem action item. T-kv KV Team

Comments

@andreimatei
Copy link
Contributor

andreimatei commented Mar 25, 2021

In https://github.com/cockroachlabs/support/issues/900 we saw that we couldn't cancel a pathological job once its row in system.jobs had gotten large enough that the range is refusing writes (futilely waiting for a split).

One way or another, canceling needs to work. We need, I guess, to find a way of marking the respective write as immune to write backpressuring.

cc @nvanbenschoten @dt

Jira issue: CRDB-2756

@andreimatei andreimatei added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv-replication Relating to Raft, consensus, and coordination. O-postmortem Originated from a Postmortem action item. labels Mar 25, 2021
@nvanbenschoten
Copy link
Member

We could pretty easily allow any high-priority write to skip the backpressure mechanism, regardless of the range size. Maybe we expose that from KV and then hook up certain operations like CANCEL JOB to run at high-priority?

@ajwerner
Copy link
Contributor

You can already run it at high priority if you just do it in a transaction. One thing to watch out for is that if you let it get big enough, then the backpressure will totally turn off...

@ajwerner
Copy link
Contributor

#46863

@lunevalex lunevalex added the N-followup Needs followup. label Apr 26, 2021
@jlinder jlinder added the T-kv KV Team label Jun 16, 2021
@aliher1911 aliher1911 self-assigned this Oct 14, 2021
@erikgrinaker
Copy link
Contributor

After discussing this in an internal Slack thread, we've decided to put this on the backburner and see if this is still a problem with 21.2. We've implemented lots of mitigations in the jobs infra and components that use it, and we have known workarounds (e.g. increasing the range size).

@shralex
Copy link
Contributor

shralex commented Mar 7, 2023

@erikgrinaker @nvanbenschoten should we close this ? I don't remember anything like this coming up recently, do you ?

@erikgrinaker
Copy link
Contributor

I don't know if this is still a problem. I think it might be, but it's possible that we have mitigations in the jobs code. @cockroachdb/cdc owns the jobs infra now.

@knz
Copy link
Contributor

knz commented Apr 2, 2023

We have addressed this by moving the job updates to a separate job_info table and never overwriting rows, so splits are always possible.

@knz knz closed this as completed Apr 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. N-followup Needs followup. O-postmortem Originated from a Postmortem action item. T-kv KV Team
Projects
None yet
Development

No branches or pull requests

9 participants