Improper periodic job launch #3703

tantra35 · 2017-12-29T11:21:04Z

Nomad version

Nomad v0.5.6

Operating system and Environment details

Ubuntu 16.04 on aws instance

Issue

After some time nomad begins inproper launch of periodic jobs(in our case this happens already 2 times, so no any reason think that this will not happens again)

we have follow job definition which periodically launched(one time a day at 00:00 UTC or 03:00 MSK) on 9 instances:

job "S3apiCacheCron"
{
    datacenters = ["aws"]
    type = "batch"
    priority = 50

    constraint
    {
        attribute = "${attr.kernel.name}"
        value = "linux"
    }

    constraint
    {
        distinct_hosts = true
    }

    constraint
    {
        attribute = "${node.class}"
        value = "s3apicache"
    }

    periodic
    {
        cron             = "@daily"
        prohibit_overlap = true
    }

    ..............................

}

Add time to time nomad launch additional job on only one instance. And this strange job looks like this:

root@ip-172-30-0-53:/home/ruslan# nomad status S3apiCacheCron/periodic-1514505600
ID            = S3apiCacheCron/periodic-1514505600
Name          = S3apiCacheCron/periodic-1514505600
Type          = batch
Priority      = 50
Datacenters   = aws
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group      Queued  Starting  Running  Failed  Complete  Lost
S3apiCacheCron  0       0         1        0       10        0

Allocations
ID        Eval ID   Node ID   Task Group      Desired  Status    Created At
1d6de603  52131d0c  ec27373a  S3apiCacheCron  run      running   12/29/17 11:04:28 MSK
06416e61  c2b1151a  ec2e17ab  S3apiCacheCron  run      complete  12/29/17 03:00:00 MSK
06e8787d  c2b1151a  ec2b1c25  S3apiCacheCron  run      complete  12/29/17 03:00:00 MSK
476e2217  c2b1151a  ec2781fd  S3apiCacheCron  run      complete  12/29/17 03:00:00 MSK
5ab2eedd  c2b1151a  ec256857  S3apiCacheCron  run      complete  12/29/17 03:00:00 MSK
6e777db2  c2b1151a  ec231d4e  S3apiCacheCron  run      complete  12/29/17 03:00:00 MSK
77b849c4  c2b1151a  ec2d11e7  S3apiCacheCron  run      complete  12/29/17 03:00:00 MSK
89dfd89c  c2b1151a  ec2101ba  S3apiCacheCron  run      complete  12/29/17 03:00:00 MSK
a3e58d61  c2b1151a  ec23d159  S3apiCacheCron  stop     complete  12/29/17 03:00:00 MSK
ee799cf1  c2b1151a  ec27373a  S3apiCacheCron  run      complete  12/29/17 03:00:00 MSK

To illustrate problem more clearly, here is a screen shot of our monitoring that illustrate problem more:

As you can see there only one instance launched. And this job is never stops because it wait for completion of all 9 instances, which can't be reached due only one in launched. In server logs there are not any messages that can clarify this behavior

We think about upgrade to 0.7.1 in our production (seems that GH-3201 solve this issue, but I'm not sure) but #3604 stops us from this step, because we have many autoscale jobs.

The text was updated successfully, but these errors were encountered:

dadgar · 2018-01-03T19:09:41Z

@tantra35 Do you have:

The server logs?
The output of /v1/job/S3apiCacheCron/periodic-1514505600/allocations?

What happened that this allocation got marked as stopped:

a3e58d61  c2b1151a  ec23d159  S3apiCacheCron  stop     complete  12/29/17 03:00:00 MSK

It is possible that the node it was running on was drained/lost and then there wasn't enough capacity to run the last allocation till later.

tantra35 · 2018-01-03T22:22:38Z

@dadgar We doesn't find nothing criminal in server logs at that moment . And absolutely nothing happened in hardware or network failure.

As shown on screenshot above, nomad job was fully complete at about 6:40 MSK, then at 11:00 MSK begins absolutely wrong launch of job(it doesn't must happened at that time).

To clarify the situation let me explain - nomad job launched every day at 03:00 MSK and works about 3 hours 30 minutes(this is a fist rectangle on screenshort - it's normal behavior and we can see that it completely finished, the second rectangle doesn't must appears, because scheduling doesn't must be launched for that job at that time)

stale · 2019-05-10T18:00:43Z

This issue will be auto-closed because there hasn't been any activity for a few months. Feel free to open a new one if you still experience this problem 👍

github-actions · 2022-11-23T02:22:55Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

tantra35 changed the title ~~Improper periodic launch~~ Improper periodic job launch Dec 30, 2017

dadgar added theme/scheduling stage/waiting-reply labels Jan 3, 2018

stale bot closed this as completed May 10, 2019

github-actions bot locked as resolved and limited conversation to collaborators Nov 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improper periodic job launch #3703

Improper periodic job launch #3703

tantra35 commented Dec 29, 2017 •

edited

Loading

dadgar commented Jan 3, 2018

tantra35 commented Jan 3, 2018 •

edited

Loading

stale bot commented May 10, 2019

github-actions bot commented Nov 23, 2022

Improper periodic job launch #3703

Improper periodic job launch #3703

Comments

tantra35 commented Dec 29, 2017 • edited Loading

Nomad version

Operating system and Environment details

Issue

dadgar commented Jan 3, 2018

tantra35 commented Jan 3, 2018 • edited Loading

stale bot commented May 10, 2019

github-actions bot commented Nov 23, 2022

tantra35 commented Dec 29, 2017 •

edited

Loading

tantra35 commented Jan 3, 2018 •

edited

Loading