sometimes I see `-1` running number of jobs for periodic jobs in the UI #13897

shantanugadgil · 2022-07-22T17:43:59Z

Nomad version

Output from nomad version
Nomad v1.3.2 (bf60297)

Operating system and Environment details

Amazon Linux 2

Issue

Sometimes I notice that the UI is showing running count of -1 for a periodic job of a per-minute cron job

Reproduction steps

create a cron job to run per minute:

Expected Result

running jobs should be >= 0

Actual Result

sometimes I see running jobs == -1

Job file (if appropriate)

can't add the specific file, but will try to add a minimal version of the same soon

Nomad Server logs (if appropriate)

N/A

Nomad Client logs (if appropriate)

N/A

The text was updated successfully, but these errors were encountered:

tgross · 2022-07-25T14:27:08Z

Hi @shantanugadgil! This is likely another case of the counting issues we have in "job summaries". Basically what happens is that the counts of job status are tracked as a separate object in Nomad from the job itself (to reduce the volume of raft replication required). But there's definitely a concurrency bug in the way this being counted. nomad system reconcile-summaries will probably fix the count for you, but it's somewhat expensive to run, which is why we're not using that logic internally everywhere.

Some other issues that look related to this one: #13519 #10338 #10222 #4731. I'm going to mark this for roadmapping and we'll see about getting some folks to dig into the underlying problem.

shantanugadgil · 2022-07-25T14:50:28Z

fwiw, I have a "cleaner" job which runs every hour, it executes system gc and system reconcile summaries

shantanugadgil · 2023-08-28T08:05:09Z

I haven't noticed this for quite some time now (using version 1.6.1 as of now). Was this fixed?

wizpresso-steve-cy-fan · 2023-09-26T05:18:18Z

I haven't noticed this for quite some time now (using version 1.6.1 as of now). Was this fixed?

It was not. This happens frequently if you have a long running job with a constraint (for example, occupies a specific port)

Then the next batch job is run in the right interval, but can't find a node with the right constraint.

And when the long running job finally died (either natural death or through force stop) stopped, the running jobs become -1.

nomad system reconcile summaries did work, so a nice workaround is to run this periodically, for example each hour.

shantanugadgil added the type/bug label Jul 22, 2022

tgross mentioned this issue Jul 25, 2022

Negative numbers in "Children Job Summary" for periodic tasks #4731

Open

tgross added theme/job-summary stage/accepted Confirmed, and intend to work on. No timeline committment though. labels Jul 25, 2022

This was referenced Jul 25, 2022

Wrong Task Summary #10338

Open

Dispatch batch job summary reporting negative values #10222

Open

Nomad Batch Job Inaccurate Job Summary #13519

Open

tgross added this to Nomad - Community Issues Triage Jun 24, 2024

tgross moved this to Needs Roadmapping in Nomad - Community Issues Triage Jun 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sometimes I see `-1` running number of jobs for periodic jobs in the UI #13897

sometimes I see `-1` running number of jobs for periodic jobs in the UI #13897

shantanugadgil commented Jul 22, 2022

tgross commented Jul 25, 2022

shantanugadgil commented Jul 25, 2022

shantanugadgil commented Aug 28, 2023

wizpresso-steve-cy-fan commented Sep 26, 2023

sometimes I see -1 running number of jobs for periodic jobs in the UI #13897

sometimes I see -1 running number of jobs for periodic jobs in the UI #13897

Comments

shantanugadgil commented Jul 22, 2022

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

Job file (if appropriate)

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

tgross commented Jul 25, 2022

shantanugadgil commented Jul 25, 2022

shantanugadgil commented Aug 28, 2023

wizpresso-steve-cy-fan commented Sep 26, 2023

sometimes I see `-1` running number of jobs for periodic jobs in the UI #13897

sometimes I see `-1` running number of jobs for periodic jobs in the UI #13897