Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sometimes I see -1 running number of jobs for periodic jobs in the UI #13897

Open
shantanugadgil opened this issue Jul 22, 2022 · 4 comments
Open
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/job-summary type/bug

Comments

@shantanugadgil
Copy link
Contributor

Nomad version

Output from nomad version
Nomad v1.3.2 (bf60297)

Operating system and Environment details

Amazon Linux 2

Issue

Sometimes I notice that the UI is showing running count of -1 for a periodic job of a per-minute cron job

image

Reproduction steps

create a cron job to run per minute:

image

Expected Result

running jobs should be >= 0

Actual Result

sometimes I see running jobs == -1

Job file (if appropriate)

can't add the specific file, but will try to add a minimal version of the same soon

Nomad Server logs (if appropriate)

N/A

Nomad Client logs (if appropriate)

N/A

@tgross
Copy link
Member

tgross commented Jul 25, 2022

Hi @shantanugadgil! This is likely another case of the counting issues we have in "job summaries". Basically what happens is that the counts of job status are tracked as a separate object in Nomad from the job itself (to reduce the volume of raft replication required). But there's definitely a concurrency bug in the way this being counted. nomad system reconcile-summaries will probably fix the count for you, but it's somewhat expensive to run, which is why we're not using that logic internally everywhere.

Some other issues that look related to this one: #13519 #10338 #10222 #4731. I'm going to mark this for roadmapping and we'll see about getting some folks to dig into the underlying problem.

@shantanugadgil
Copy link
Contributor Author

fwiw, I have a "cleaner" job which runs every hour, it executes system gc and system reconcile summaries

@shantanugadgil
Copy link
Contributor Author

I haven't noticed this for quite some time now (using version 1.6.1 as of now). Was this fixed?

@wizpresso-steve-cy-fan
Copy link

I haven't noticed this for quite some time now (using version 1.6.1 as of now). Was this fixed?

It was not. This happens frequently if you have a long running job with a constraint (for example, occupies a specific port)

Then the next batch job is run in the right interval, but can't find a node with the right constraint.

And when the long running job finally died (either natural death or through force stop) stopped, the running jobs become -1.

nomad system reconcile summaries did work, so a nice workaround is to run this periodically, for example each hour.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/job-summary type/bug
Projects
Status: Needs Roadmapping
Development

No branches or pull requests

3 participants