Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improper periodic job launch #3703

Closed
tantra35 opened this issue Dec 29, 2017 · 4 comments
Closed

Improper periodic job launch #3703

tantra35 opened this issue Dec 29, 2017 · 4 comments

Comments

@tantra35
Copy link
Contributor

tantra35 commented Dec 29, 2017

Nomad version

Nomad v0.5.6

Operating system and Environment details

Ubuntu 16.04 on aws instance

Issue

After some time nomad begins inproper launch of periodic jobs(in our case this happens already 2 times, so no any reason think that this will not happens again)

we have follow job definition which periodically launched(one time a day at 00:00 UTC or 03:00 MSK) on 9 instances:

job "S3apiCacheCron"
{
    datacenters = ["aws"]
    type = "batch"
    priority = 50

    constraint
    {
        attribute = "${attr.kernel.name}"
        value = "linux"
    }

    constraint
    {
        distinct_hosts = true
    }

    constraint
    {
        attribute = "${node.class}"
        value = "s3apicache"
    }

    periodic
    {
        cron             = "@daily"
        prohibit_overlap = true
    }

    ..............................

}

Add time to time nomad launch additional job on only one instance. And this strange job looks like this:

root@ip-172-30-0-53:/home/ruslan# nomad status S3apiCacheCron/periodic-1514505600
ID            = S3apiCacheCron/periodic-1514505600
Name          = S3apiCacheCron/periodic-1514505600
Type          = batch
Priority      = 50
Datacenters   = aws
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group      Queued  Starting  Running  Failed  Complete  Lost
S3apiCacheCron  0       0         1        0       10        0

Allocations
ID        Eval ID   Node ID   Task Group      Desired  Status    Created At
1d6de603  52131d0c  ec27373a  S3apiCacheCron  run      running   12/29/17 11:04:28 MSK
06416e61  c2b1151a  ec2e17ab  S3apiCacheCron  run      complete  12/29/17 03:00:00 MSK
06e8787d  c2b1151a  ec2b1c25  S3apiCacheCron  run      complete  12/29/17 03:00:00 MSK
476e2217  c2b1151a  ec2781fd  S3apiCacheCron  run      complete  12/29/17 03:00:00 MSK
5ab2eedd  c2b1151a  ec256857  S3apiCacheCron  run      complete  12/29/17 03:00:00 MSK
6e777db2  c2b1151a  ec231d4e  S3apiCacheCron  run      complete  12/29/17 03:00:00 MSK
77b849c4  c2b1151a  ec2d11e7  S3apiCacheCron  run      complete  12/29/17 03:00:00 MSK
89dfd89c  c2b1151a  ec2101ba  S3apiCacheCron  run      complete  12/29/17 03:00:00 MSK
a3e58d61  c2b1151a  ec23d159  S3apiCacheCron  stop     complete  12/29/17 03:00:00 MSK
ee799cf1  c2b1151a  ec27373a  S3apiCacheCron  run      complete  12/29/17 03:00:00 MSK

To illustrate problem more clearly, here is a screen shot of our monitoring that illustrate problem more:

img-2017-12-29-14-08-18

As you can see there only one instance launched. And this job is never stops because it wait for completion of all 9 instances, which can't be reached due only one in launched. In server logs there are not any messages that can clarify this behavior

We think about upgrade to 0.7.1 in our production (seems that GH-3201 solve this issue, but I'm not sure) but #3604 stops us from this step, because we have many autoscale jobs.

@tantra35 tantra35 changed the title Improper periodic launch Improper periodic job launch Dec 30, 2017
@dadgar
Copy link
Contributor

dadgar commented Jan 3, 2018

@tantra35 Do you have:

  1. The server logs?
  2. The output of /v1/job/S3apiCacheCron/periodic-1514505600/allocations?

What happened that this allocation got marked as stopped:

a3e58d61  c2b1151a  ec23d159  S3apiCacheCron  stop     complete  12/29/17 03:00:00 MSK

It is possible that the node it was running on was drained/lost and then there wasn't enough capacity to run the last allocation till later.

@tantra35
Copy link
Contributor Author

tantra35 commented Jan 3, 2018

@dadgar We doesn't find nothing criminal in server logs at that moment . And absolutely nothing happened in hardware or network failure.

As shown on screenshot above, nomad job was fully complete at about 6:40 MSK, then at 11:00 MSK begins absolutely wrong launch of job(it doesn't must happened at that time).

To clarify the situation let me explain - nomad job launched every day at 03:00 MSK and works about 3 hours 30 minutes(this is a fist rectangle on screenshort - it's normal behavior and we can see that it completely finished, the second rectangle doesn't must appears, because scheduling doesn't must be launched for that job at that time)

@stale
Copy link

stale bot commented May 10, 2019

This issue will be auto-closed because there hasn't been any activity for a few months. Feel free to open a new one if you still experience this problem 👍

@stale stale bot closed this as completed May 10, 2019
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants