-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
System Jobs never marked as stable and redeploys occur #9804
Comments
Thanks for reporting this. We'll see if we can make a minimal reproduction from these notes. |
I was able to only partially reproduce this issue. I used the following minimal reproduction jobspec: jobspec
I ran the job and got the warnings I'd expect to see for the
The allocations come up and are running, as we'd expect: detailed allocation status
So far so good. If we look at the deployment list and job history, we indeed see that we're not marked as stable and we don't see the deployment in the list. That being said, system jobs don't actually support deployments, so this may be a combination of some bad UX and a documentation gap.
The more troubling symptom you reported was that the job was re-running. I wasn't able to reproduce this.
I tried that a few times and got the same result every time. I've tested that on both 1.0.1 and 1.0.3 In the meantime, I'll get fixing the stability marked onto the backlog. |
@tgross - thanks for looking in to this. Do you think the different behavior may be explained due to the fact that we are always using the HCL1 parser flag? We are also on 1.0.2 (if that helps) |
That's a good idea to check! But I just checked that (both making sure I was testing against the 1.0.2 release and/or with |
@tgross - I am totally at a loss now lol. We can confirm the template is not changing between runs (it does render every time in our pipeline but the resulting file is the same every time). Our allocs are littered with things like below A diff between 2 versions Show the following 4 fields as different JobModifyIndex |
Ok, I think my next step is to try to verify that the Vault token isn't an issue here. This is beginning to remind me of #6471 which I recently spent some time on unsuccessfully reproducing; the issue there was similar that the scheduler would make a placement even though there was no diff. I did just discover a bug in the diffs for volumes (#9973) but maybe there's a similar bug elsewhere that we're just not hitting in most cases. I see things like |
Hey @tgross - that's just the levant stuff we use for rendering. I was wondering if the vault template was maybe a suspect....that's an interesting observation |
I did a first-pass test using our E2E environment to stand up a cluster with Vault, and checked both static ( |
@luckymike it looks like the changelog got accidentally updated incorrectly in #8808. The original pointed to #8559 which was fixed by #8773. That fix only adds the (Also, just a heads up, I'm no longer a maintainer on Nomad or at HashiCorp, although I may continue to contribute now and then in my spare time... you'll want to ping other folks to nudge issues along from here on out 😀 ) |
So - little more data. We're seeing this behavior at times on more than just system jobs. Here's a snapshot of the zookeeper job I just happened to run. Is it possible this is related to token changes or something behind the scene's we are missing? Seems like it's something at the group level? Edit: This did NOT happen the next time I tried to deploy the job with no changes |
sorry to hear you are not a maintainer @tgross . We appreciate all that you have done for the project! |
Hi @BijanJohn!
Can you please verify if you are still running into the same problem and if you do, provide us with more context? |
@BijanJohn Im gonna go ahead and close this ticket, if you are still running into difficulties don't hesitate to reach out to the team again. |
Nomad version
Nomad v1.0.1 (c9c68aa)
Operating system and Environment details
PhotonOS3
Issue
We have noticed that system jobs are not being marked as stable (Fabio in our case) even when they are running as expected.
In addition, we suspect related to this, subsequent attempts to deploy the same job with no changes results in multiple deployment history versions (with no changes)
Reproduction steps
First we Create a Nomad Cluster and deploy a system job (example command we are using). Note we randomly generate a file name, there's nothing sensitive here.
Next, lets get the status of the system job. In this case we have 2 potential clients, 1 windows and 1 Linux but we don't really care if there are no nodes (in this case for windows)
and that results in an output like we'd expect
So far, so good. Now lets get the deployment status:
and we get back
Ok - that's not what we expect, but lets check out the UI. Interesting we see "stable" marked as false here even though everything looks healthy and happy
Just in case, we add the windows node and we see Fabio get scheduled there
and get back what we'd expect
BUT when we run
we see this....just like the UI shows. Not stable but everything is healthy. Consul HC's are happy and Nomad is happily running the job
Nomad
Consul
Now, if we run deploy again (a few times with ZERO changes to the job file) we start to see deployments in the UI that look like this AND the allocations all get started over and over again with each deploy
Lets run
and now we see (kinda as we expect since I'm guessing the scheduler is just trying to get this to "stable true"
and the UI shows this
Job file (if appropriate)
Here's the fabio job file (windows removed for brevity but it's the same for both groups)
The text was updated successfully, but these errors were encountered: