Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a CANCELLING state type #7794

Merged
merged 9 commits into from
Dec 15, 2022
Merged

Conversation

bunchesofdonald
Copy link
Contributor

@bunchesofdonald bunchesofdonald commented Dec 6, 2022

The MVP for cancellation used a CANCELLED state with a name of Cancelling for the agent to decide which flows need to be cancelled. The main issue with this is that concurrency slots are immediately released when that transition happens even though the tasks/flows are still technically running. There is also some general 'ick' factor in using a state name for an important operation like cancellation.

This introduces a CANCELLING state type and sets it up so that it maintains / occupies concurrency limit slots, both for tag-based task usage as well as work queues.

Related to #7735

Checklist

  • This pull request references any related issue by including "closes <link to issue>"
    • If no issue exists and your change is not a small fix, please create an issue first.
  • This pull request includes tests or only affects documentation.
  • This pull request includes a label categorizing the change e.g. fix, feature, enhancement

@bunchesofdonald bunchesofdonald added the enhancement An improvement of an existing feature label Dec 6, 2022
@netlify
Copy link

netlify bot commented Dec 6, 2022

Deploy Preview for prefect-orion ready!

Name Link
🔨 Latest commit 880e2c7
🔍 Latest deploy log https://app.netlify.com/sites/prefect-orion/deploys/639a31b1df45c90008647019
😎 Deploy Preview https://deploy-preview-7794--prefect-orion.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

@zanieb
Copy link
Contributor

zanieb commented Dec 6, 2022

Does this feel like the correct move?

@bunchesofdonald
Copy link
Contributor Author

I think it's the correct move, I think it makes more sense than the RUNNING -> RUNNING transition we discussed.

@bunchesofdonald bunchesofdonald force-pushed the issue-7735 branch 2 times, most recently from e58361f to e62e4a0 Compare December 7, 2022 15:14
@anticorrelator
Copy link
Contributor

is there a specific interaction the cancelling state helps us spell out? I'm curious since I implemented something similar for resuming by adding a marker to a run

@bunchesofdonald
Copy link
Contributor Author

@anticorrelator I'd have to know more about the 'marker to a run' to say if that would work here. The interaction that this specifically is about is the interaction between the agent and the running flows. It needs to know which flow runs to Cancel and to also know which flow runs have already been cancelled. Currently this is accomplished by setting the flow run into a Cancelled state with a name of Cancelling, then when the cancellation is complete the state is renamed to Cancelled.

This works, but as stated above, it cause two issues:

  1. The concurrency slots are released when transitioning from RUNNING -> CANCELLED, which isn't what we want as flows in this state are still technically running.
  2. There's an 'ick' factor of using the name of a state for this kind of operation.

@anticorrelator
Copy link
Contributor

@anticorrelator I'd have to know more about the 'marker to a run' to say if that would work here. The interaction that this specifically is about is the interaction between the agent and the running flows. It needs to know which flow runs to Cancel and to also know which flow runs have already been cancelled. Currently this is accomplished by setting the flow run into a Cancelled state with a name of Cancelling, then when the cancellation is complete the state is renamed to Cancelled.

This works, but as stated above, it cause two issues:

1. The concurrency slots are released when transitioning from RUNNING -> CANCELLED, which isn't what we want as flows in this state are still technically running.

2. There's an 'ick' factor of using the name of a state for this kind of operation.

There's definitely an ick factor on doing client operations on state names. I was thinking of a mechanism that looked like attaching metadata to the Run somehow, for example run.cancelling = True and doing clientside operations that way. I don't know if it feels that much better but if we should always treat a Cancelling state as a Running state for the sake of orchestration I was hoping we could maybe discuss some alternatives briefly

@zanieb
Copy link
Contributor

zanieb commented Dec 7, 2022

I'm also a little worried about adding CANCELLING if it's always treated like RUNNING, but I think there are some important differences, like:

  • We won't allow retries from a CANCELLING.
  • We won't allow CRASHED or FAILED from CANCELLING.

I think retaining concurrency slots may be one of the few cases where they are identical.

@bunchesofdonald
Copy link
Contributor Author

@anticorrelator I'd be more that happy to discuss alternatives. In the cancellation doc we outlined a bunch of discussion we had before we implemented the current version: https://www.notion.so/prefect/Run-Cancellation-6ccd41c867aa4abf903e23db42a3f287#3536e0384bc94aebbe27b2a3cb43d543

@bunchesofdonald
Copy link
Contributor Author

@anticorrelator and I discussed this and we agreed that this is probably the best way forward given the needs of the agent.

Copy link
Contributor

@anticorrelator anticorrelator left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, as mentioned async: we would benefit from an extension to PreventRedundantTransitions or a new rule that only allows Cancelling -> Cancelled (and maybe crashed/failed?) transitions

@zanieb
Copy link
Contributor

zanieb commented Dec 8, 2022

What kind of migration strategy do we have in mind for Cloud? How can we make it clear that this feature requires changes and client / server versions can't be mixed in the OSS?

@bunchesofdonald
Copy link
Contributor Author

bunchesofdonald commented Dec 8, 2022

@madkinsz

What kind of migration strategy do we have in mind for Cloud?

After discussion with Zach and Dustin it appears that we'll need to:

  • Bump the Orion API version
  • Add a compatibility check for the rule update and the new rule I'm going to create for preventing Cancelling -> Anything but Cancelled.

How can we make it clear that this feature requires changes and client / server versions can't be mixed in the OSS?

Isn't that what the bump to the Orion API version signals?

@zanieb
Copy link
Contributor

zanieb commented Dec 8, 2022

@bunchesofdonald Bumping the API version isn't particularly user facing. I honestly am often confused about when it causes errors to be raised these days though. What happens when I have a server with this code and a client with the old code calls prefect flow-run cancel? What happens when an agent is running on the old code and uses the wrong filter?

@bunchesofdonald
Copy link
Contributor Author

bunchesofdonald commented Dec 14, 2022

What happens when I have a server with this code and a client with the old code calls prefect flow-run cancel? What happens when an agent is running on the old code and uses the wrong filter?

@madkinsz After discussing this with @abrookins, we decided that feature flagging was the way to go here. This change should be good to go as it is, but I'll follow up with another PR to make the CLI/Agent check that flag and change the state it's using.

if not context.validated_state.is_running():
if context.validated_state.type not in [
states.StateType.RUNNING,
states.StateType.CANCELLING,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will need a matching update in cloud

@bunchesofdonald bunchesofdonald changed the base branch from main to feature/better-cancellation December 15, 2022 18:02
@bunchesofdonald bunchesofdonald merged commit 55b0e60 into feature/better-cancellation Dec 15, 2022
@bunchesofdonald bunchesofdonald deleted the issue-7735 branch December 15, 2022 18:03
bunchesofdonald added a commit that referenced this pull request Dec 16, 2022
zanieb pushed a commit that referenced this pull request Jan 10, 2023
zanieb pushed a commit that referenced this pull request Jan 10, 2023
zanieb pushed a commit that referenced this pull request Jan 11, 2023
zanieb pushed a commit that referenced this pull request Jan 25, 2023
anticorrelator pushed a commit that referenced this pull request Jan 25, 2023
ddelange added a commit to ddelange/prefect that referenced this pull request Jan 26, 2023
…erm-testing

* 'main' of https://github.com/prefecthq/prefect: (77 commits)
  Update roles and permissions in documentation (PrefectHQ#8263)
  Add Prefect Cloud Quickstart tutorial (PrefectHQ#8227)
  Remove needless log
  Update comment for consistency
  Reorder migrations for clarity
  Refactor cancellation cleanup service
  Uses canonical `CANCELLING` states for run cancellations (PrefectHQ#8245)
  Add cancellation cleanup service (PrefectHQ#8128)
  Improve engine shutdown handling of SIGTERM (PrefectHQ#8127)
  Create a `CANCELLING` state type (PrefectHQ#7794)
  Update KubernetesJob options (PrefectHQ#8261)
  Small work pools UI updates (PrefectHQ#8257)
  Removes migration logic (PrefectHQ#8255)
  Consolidate multi-arch docker builds (PrefectHQ#7902)
  Include nested `pydantic.BaseModel` secret fields in blocks' schema (PrefectHQ#8246)
  Improve contributing documentation with venv instructions (PrefectHQ#8247)
  Update Python tests to use a single test matrix for both databases (PrefectHQ#8171)
  Adds migration logic for work pools (PrefectHQ#8214)
  Add `project_urls` to `setup.py` (PrefectHQ#8224)
  Add `is_schedule_active` to client `Deployment` class (PrefectHQ#7430)
  ...
github-actions bot pushed a commit that referenced this pull request Jan 26, 2023
github-actions bot pushed a commit that referenced this pull request Jan 26, 2023
@zanieb zanieb mentioned this pull request Feb 3, 2023
3 tasks
zanieb pushed a commit that referenced this pull request Feb 3, 2023
zanieb pushed a commit that referenced this pull request Feb 3, 2023
zanieb pushed a commit that referenced this pull request Feb 6, 2023
zanieb pushed a commit that referenced this pull request Feb 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement An improvement of an existing feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants