Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the ability to disable rescheduling of lost tasks. Implements issue #10366 #16867

Closed
wants to merge 1 commit into from

Conversation

DominicLavery
Copy link
Contributor

This adds the ability to prevent tasks being rescheduled when their allocations are lost as suggested in #10366

Jobs with lost allocations are left in the running state and require intervention. Currently, jobs can be stopped manually via the CLI or API. e.g. nomad job stop. A future improvement would be to implement nomad alloc reschedule as suggested in the issue.

@hashicorp-cla
Copy link

hashicorp-cla commented Apr 12, 2023

CLA assistant check
All committers have signed the CLA.

@tgross
Copy link
Member

tgross commented May 18, 2023

Hi @DominicLavery! Sorry about the delayed review on this. The issue #10366 was really still a little bit in the design stage, so there's some open questions about how this should interact with the existing max_client_disconnect feature, which prevents tasks from being marked as lost but replaces them temporarily until the allocation reconnects or the max_client_disconnect window expires. It might make sense for reschedule_on_lost=false to interact with that so that we don't create those replacements either (that's certainly how I'd want to use it). I'm going to have an internal discussion with the team in the next day or so and then I'll pop back over here and provide some direction.

(In the meantime, the GitHub Actions run didn't happen here, I think because this PR came in before we fixed the "on PR" bits. You might want to rebase on main.)

@DominicLavery
Copy link
Contributor Author

DominicLavery commented May 30, 2023

Hey @tgross. Thanks for the response. Sorry to skip ahead in the process a bit, we had a bit of a pressing need for this kind of feature. I'd be happy to rework it based off of the results of your discussions. I agree that max_client_disconnect should be taken in to account here.

I've rebased (and re-enabled GH actions on the fork 😅) but I only get the one check, I'm not sure if that is expected.

@Juanadelacuesta Juanadelacuesta self-assigned this Oct 26, 2023
@Juanadelacuesta
Copy link
Member

Juanadelacuesta commented Nov 13, 2023

Hey @DominicLavery! Thank you for all the work on this issue. On reviewing the code, it all seems to be working, but on a closer inspection, there are two problems:

  1. The new option reschedule_on_lost defaults to true, causing a bit of a problem for older job definitions that now need to be updated to include the new option.
  2. Your PR successfully manages to avoid rescheduling of a lost allocation, but if any other event happens, including the lost node coming back online, a new reconciliation on the server is triggered and the lost allocation is rescheduled and starts again.

We think this is a very interesting feature and there is an internal discussion about how to move on with it.

@DominicLavery
Copy link
Contributor Author

Hey @Juanadelacuesta! Thanks for the feedback!
On point 1: rescheduling on lost is the current behaviour, so it defaults to true for that backwards compatibility. The option only needs to be added if you want to disable it. Older job definitions will work as the did previously.

On point 2: I hadn't considered that!

Happy to work on point 2, and @tgross's point about max_client_disconnect once your internal discussions are finished and your happy about the path forward

@Juanadelacuesta Juanadelacuesta removed their assignment Jan 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging this pull request may close these issues.

4 participants