Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad do not clean old periodic tasks #10456

Open
TheSpbra1n opened this issue Apr 27, 2021 · 13 comments
Open

Nomad do not clean old periodic tasks #10456

TheSpbra1n opened this issue Apr 27, 2021 · 13 comments
Labels

Comments

@TheSpbra1n
Copy link

Nomad version

Nomad v1.0.2 (4c1d4fc)

Operating system and Environment details

Codename: xenial

Issue

I set in config: JobGCThreshold=48h

root@nomad-01:/home/user# curl -s  'http://localhost:4646/v1/agent/self' | jq . | grep 'JobGCThreshold'
      "JobGCThreshold": "48h"

But in "nomad status" I see a task that ended 6 days ago

root@nomad-01:/home/user# nomad alloc status 118cf459-8aa2-fbf2-6e61-77ed211cb531
ID                  = 118cf459-8aa2-fbf2-6e61-77ed211cb531
Eval ID             = 7ee96439
Name                = correct_suspended_tickets/periodic-1619002800.correct_suspended_tickets[0]
Node ID             = 263fdfd7
Node Name           = docker-15
Job ID              = correct_suspended_tickets/periodic-1619002800
Job Version         = 9
Client Status       = complete
Client Description  = All tasks have completed
Desired Status      = run
Desired Description = <none>
Created             = 6d6h ago
Modified            = 2h42m ago

Task "correct_suspended_tickets" is "dead"
Task Resources
CPU      Memory   Disk     Addresses
100 MHz  512 MiB  300 MiB

Task Events:
Started At     = 2021-04-21T11:01:42Z
Finished At    = 2021-04-21T11:01:44Z
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                       Type        Description
2021-04-27T17:35:21+03:00  Killing     Sent interrupt. Waiting 5s before force killing
2021-04-27T17:35:18+03:00  Received    Task received by client
2021-04-26T21:20:15+03:00  Killing     Sent interrupt. Waiting 5s before force killing
2021-04-26T21:20:12+03:00  Received    Task received by client
2021-04-26T01:04:06+03:00  Killing     Sent interrupt. Waiting 5s before force killing
2021-04-26T01:04:04+03:00  Received    Task received by client
2021-04-24T08:20:08+03:00  Killing     Sent interrupt. Waiting 5s before force killing
2021-04-24T08:20:00+03:00  Received    Task received by client
2021-04-22T18:38:15+03:00  Killing     Sent interrupt. Waiting 5s before force killing
2021-04-21T14:01:44+03:00  Terminated  Exit Code: 0

Nomad Client logs (if appropriate)

In the clients logs, I see that the GC was running, but the allocation is actually still present:

root@docker-15:/home/user# journalctl -u nomad --since "12:00" | grep 118cf459
Apr 27 17:35:18 docker-15 nomad[2838]:     2021-04-27T17:35:18.624+0300 [INFO]  client.gc: marking allocation for GC: alloc_id=118cf459-8aa2-fbf2-6e61-77ed211cb531
Apr 27 17:35:21 docker-15 nomad[2838]:     2021-04-27T17:35:21.710+0300 [INFO]  client.gc: garbage collecting allocation: alloc_id=118cf459-8aa2-fbf2-6e61-77ed211cb531 reason="number of allocations (251) is over the limit (200)"
Apr 27 17:35:21 docker-15 nomad[2838]:     2021-04-27T17:35:21.716+0300 [INFO]  client.gc: marking allocation for GC: alloc_id=118cf459-8aa2-fbf2-6e61-77ed211cb531
Apr 27 17:35:21 docker-15 nomad[2838]:     2021-04-27T17:35:21.721+0300 [DEBUG] client.gc: alloc garbage collected: alloc_id=118cf459-8aa2-fbf2-6e61-77ed211cb531
root@docker-15:/home/user# date
Tue Apr 27 20:41:41 MSK 2021
root@docker-15:/home/user# nomad alloc status 118cf459
ID                  = 118cf459-8aa2-fbf2-6e61-77ed211cb531
Eval ID             = 7ee96439
...
@tgross
Copy link
Member

tgross commented Apr 27, 2021

Hi @TheSpbra1n! This one is tricky!

The log message you're seeing is about GC on the client node, where we clean up the allocation directory on the client. The nomad alloc status is querying the server's view of the world. That JobGCThreshold config value controls that server state.

But what I'm seeing in your nomad alloc status output is that this periodic job is getting placed every few hours, and that it has Desired Status = run. So it looks like the server expects it to be running, which is why it's not GC'ing it. Can you share a redacted version of the job spec? That might help us narrow down the behavior you're seeing.

@tgross tgross added theme/batch Issues related to batch jobs and scheduling stage/waiting-reply labels Apr 27, 2021
@tgross tgross self-assigned this Apr 27, 2021
@TheSpbra1n
Copy link
Author

Thanks for the answer.
SPEC:

# job name
job "correct_suspended_tickets" {
  region      = "global"
  datacenters = ["dc1", "dc2"]

  type = "batch"

  periodic {
    cron      = "00,30 * * * * *"
    time_zone = "Europe/Moscow"
	prohibit_overlap = true
  }

  constraint {
    attribute = "${node.class}"
    value     = "nomad1"
  }

  meta {
    # Application version
    APP_VERSION       = "0.0.1"
    START_SCRIPT = "xxx"
  }

  group "correct_suspended_tickets" {
    count = 1

    task "correct_suspended_tickets" {
      kill_signal = "SIGTERM"
      driver = "docker"

      config {
        image        = "docker-registry/container:${NOMAD_META_APP_VERSION}"
        advertise_ipv6_address = true
        network_mode = "bridge"
        command = "${NOMAD_META_START_SCRIPT}"
      }

      resources {
        memory = 512 # MB
      }
    }
  }
}

@tgross
Copy link
Member

tgross commented Apr 28, 2021

Thanks for the jobspec @TheSpbra1n. That spec looks fine to me, which leads me back to wondering why the Desired Status = run still. The code that does job GC has some special casing around the status for periodic jobs (ref). We did fix a bug in the setting of status for periodic jobs in #10145 and I'm wondering if that could be at play.

If you do a nomad job status correct_suspended_tickets, does the Children Job Summary section have the correct counts? If not, you may have run into the bug described in #10145. That patch will ship in the upcoming Nomad 1.1.0 beta (very soon), and we're planning on backporting it to Nomad 1.0.5 as well.

@TheSpbra1n
Copy link
Author

After nomad system gc, i see:

root@nomad-01:/home/user# nomad system gc
root@nomad-01:/home/user# nomad job status correct_suspended_tickets
ID                       = correct_suspended_tickets
Name                 = correct_suspended_tickets
Submit Date      = 2021-04-05T19:17:04+03:00
Type                   = batch
Priority               = 50
Datacenters      = dc1,dc2
Namespace       = default
Status                = running
Periodic             = true
Parameterized  = false
Next Periodic Launch = 2021-04-29T20:30:00+03:00 (4m16s from now)

Children Job Summary
Pending  Running  Dead
4328     0        29485

No instances of periodic job found
root@nomad-01:/home/user#

I think that "pending" count have a strange value.

@TheSpbra1n
Copy link
Author

Will waiting 1.0.5, thanks!

@tgross tgross added stage/needs-verification Issue needs verifying it still exists and removed stage/waiting-reply labels Apr 29, 2021
@tgross
Copy link
Member

tgross commented Apr 29, 2021

Thanks for that job status output @TheSpbra1n. That definitely looks suspiciously like the #10145 bug. I'm going to mark this issue for verifying once 1.0.5 ships. Thanks again!

@tgross
Copy link
Member

tgross commented Jun 3, 2021

Hi @TheSpbra1n! Have you had a chance to see if 1.0.5+ fixes the issue for you?

@tgross
Copy link
Member

tgross commented Jul 7, 2021

Doing some issue cleanup and I'm going to close this one out. If you do get a chance to test with Nomad 1.0.5+ or later and it doesn't fix the problem, please feel free to reopen.

@TheSpbra1n
Copy link
Author

Thanks!
We have planned an update, but haven't done it yet

@TheSpbra1n
Copy link
Author

Hi @tgross! We upgraded nomad to 1.1.2

# nomad version
Nomad v1.1.2 (60638a086ef9630e2a9ba1e237e8426192a44244)

Вut the problem remains:

# curl -s  'http://localhost:4646/v1/agent/self' | jq . | grep 'JobGCThreshold'
      "JobGCThreshold": "48h",
# nomad status correct_suspended_tickets/periodic-1627318800
ID            = correct_suspended_tickets/periodic-1627318800
Name          = correct_suspended_tickets/periodic-1627318800
Submit Date   = 2021-07-26T20:00:00+03:00
Type          = batch
Priority      = 50
Datacenters   = dc1,dc2
Namespace     = default
Status        = dead
Periodic      = false
Parameterized = false

Summary
Task Group                 Queued  Starting  Running  Failed  Complete  Lost
correct_suspended_tickets  0       0         0        0       1         0

Allocations
ID        Node ID   Task Group                 Version  Desired  Status    Created    Modified
08a5c3d8  32e62217  correct_suspended_tickets  9        run      complete  2d22h ago  1d4h ago

Alloc status:

# nomad alloc status 08a5c3d8
ID                  = 08a5c3d8-f25d-0b56-7b67-03212862552c
Eval ID             = 438f32ce
Name                = correct_suspended_tickets/periodic-1627318800.correct_suspended_tickets[0]
Node ID             = 32e62217
Node Name           = docker-02
Job ID              = correct_suspended_tickets/periodic-1627318800
Job Version         = 9
Client Status       = complete
Client Description  = All tasks have completed
Desired Status      = run
Desired Description = <none>
Created             = 2d22h ago
Modified            = 1d4h ago

Task "correct_suspended_tickets" is "dead"
Task Resources
CPU      Memory   Disk     Addresses
100 MHz  512 MiB  300 MiB

Task Events:
Started At     = 2021-07-26T17:01:16Z
Finished At    = 2021-07-26T17:01:18Z
Total Restarts = 0
Last Restart   = N/A

Recent Events:
Time                       Type        Description
2021-07-26T20:01:18+03:00  Terminated  Exit Code: 0
2021-07-26T20:01:16+03:00  Started     Task started by client
2021-07-26T20:00:01+03:00  Driver      Downloading image
2021-07-26T20:00:01+03:00  Task Setup  Building Task Directory
2021-07-26T20:00:01+03:00  Received    Task received by client
#

Children job count:

# nomad status correct_suspended_tickets | grep -A3 'Children Job Summary'
Children Job Summary
Pending  Running  Dead
8325     -3       33830

@TheSpbra1n
Copy link
Author

#11017

@lgfa29
Copy link
Contributor

lgfa29 commented Aug 19, 2021

Re-opening since the problem persists in v1.1.2

@lgfa29 lgfa29 reopened this Aug 19, 2021
@lgfa29 lgfa29 added stage/needs-investigation and removed stage/needs-verification Issue needs verifying it still exists labels Aug 19, 2021
@tgross tgross removed their assignment Nov 3, 2021
@kemko
Copy link

kemko commented Nov 28, 2022

This issue still seems reproducible in v1.4.2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Needs Roadmapping
Development

No branches or pull requests

4 participants