slow client GC prevents new allocs from running (periodic dispatch) #19917

aneustroev · 2024-02-08T10:07:14Z

Nomad version

Nomad v1.7.4

Operating system and Environment details

Ubuntu

Issue

If I run periodic tasks, childrens of periodic don't cleanup by GC until it become gc_max_allocs and then all jobs stuck in pending state. Waiting for childrens cleanup.

Also, I can't set gc_max_allocs very big value, because each alloc make two mounts, and with mounts more that 30k OS work unstable.

Reproduction steps

1 Make a lot frequency periodic tasks (>100)
2 Waiting for allocs > gc_max_allocs
3 See that all new allocs in pending state

Expected Result

1 GC remove old allocs by TTL
or
2 GC unmount secrets, private mounts by TTL

Actual Result

1 When gc_max_allocs small, new allocs created with delay or newer be created
2 When gc_max_allocs big, OS become unstable over time

Job file (if appropriate)

job "hello-world" {
  # Specifies the datacenter where this job should be run
  # This can be omitted and it will default to ["*"]
  datacenters = ["*"]
  
  type = "batch"
  
  periodic {
    cron             = "*/1 * * * * *"
    prohibit_overlap = true
  }

  # A group defines a series of tasks that should be co-located
  # on the same client (host). All tasks within a group will be
  # placed on the same host.
  group "servers" {

    # Specifies the number of instances of this group that should be running.
    # Use this to scale or parallelize your job.
    # This can be omitted and it will default to 1.
    count = 1

    # Tasks are individual units of work that are run by Nomad.
    task "test" {
      # This particular task starts a simple web server within a Docker container
      driver = "docker"

      config {
        image   = "alpine:latest"
        command = "/bin/sh"
        args = ["-c", "echo", "123423543"]
      }


      # Specify the maximum resources required to run the task
      resources {
        cpu    = 50
        memory = 64
      }
    }
  }
}

Nomad Server logs (if appropriate)

All is good.

Nomad Client logs (if appropriate)

also, I saw strange logs for dead allocs

Feb 08 05:04:44 php-2-eu1.adsrv.wtf nomad[2626928]:     2024-02-08T05:04:44.540-0500 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=06aeb094-0272-1940-fde9-c3a841a50b00 task=php-ai-common-delayed-processing-queue-execute-rabbit-task type=Killing msg="Sent interrupt. Waiting 10s before force killing" failed=false

why it's killing, if it's dead?

The text was updated successfully, but these errors were encountered:

lgfa29 · 2024-02-08T21:50:50Z

Hi @aneustroev 👋

I'm not sure I fully understood the problem. Is the Nomad GC preventing new allocations from being placed until it completes?

Would you be able to share the output of the status for the periodic job and the allocations when they get into this state?

Thank you!

aneustroev · 2024-02-09T09:30:43Z

Is the Nomad GC preventing new allocations from being placed until it completes?

yes

Why Nomad remove old allocations so long?

aneustroev · 2024-02-09T09:31:52Z

sometimes task hang in queue for hours

aneustroev · 2024-02-09T11:09:20Z

lgfa29 · 2024-02-09T14:56:10Z

Does the task have any events? And have you noticed if this happens in a particular client or are all allocs pending, regardless of where they are scheduled?

Would be able to share logs from the Nomad client when the problem happens? You can email them to [email protected] with the issue number in the subject.

aneustroev · 2024-02-09T15:02:00Z

It's don't depend of client. When it happened I see next log messages.

nomad[3402844]:[INFO]  client.gc: garbage collecting allocation: alloc_id=1c459fd2-c0e8-d225-5781-96a33c10672d reason="new allocations and over max (200)"
nomad[3402844]:[INFO]  client.alloc_runner.task_runner: Task event: alloc_id=1c459fd2-c0e8-d225-5781-96a33c10672d task=<taskname> type=Killing msg="Sent interrupt. Waiting 10s before force killing" failed=false
nomad[3402844]:[INFO]  client.gc: marking allocation for GC: alloc_id=1c459fd2-c0e8-d225-5781-96a33c10672d

No one else WARN or ERROR messages.

aneustroev · 2024-02-09T15:23:03Z

Until all allocs behind gc_max_allocs all works as expected.
Maybe GC is too slow? or don't clean allocs as expected.

server GC settings

node_gc_threshold = "24h"
job_gc_interval = "5m"
job_gc_threshold = "10m"
eval_gc_threshold = "10m"
batch_eval_gc_threshold = "10m"

client GC settings

gc_interval = "1m"
gc_max_allocs = 200
gc_parallel_destroys = 10
gc_inode_usage_threshold = 80
gc_disk_usage_threshold = 90

lgfa29 · 2024-02-13T22:37:31Z

I believe Nomad's client GC is not-blocking, so it shouldn't impact the new allocations.

Would you be able to drop the client log level to TRACE and share the client logs with us when this problem happens again?

aneustroev · 2024-02-19T08:21:16Z

2024-02-19T03:16:56.261-0500 [TRACE] client.alloc_runner.task_runner: Kill requested: alloc_id=a21ae6da-a252-2f55-a52f-62669c90ca4e task=<taskname>
2024-02-19T03:16:56.261-0500 [TRACE] client.alloc_runner.task_runner: Kill event: alloc_id=a21ae6da-a252-2f55-a52f-62669c90ca4e task=<taskname> event_type=Killing event_reason=""
2024-02-19T03:16:56.261-0500 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=a21ae6da-a252-2f55-a52f-62669c90ca4e task=<taskname> type=Killing msg="Sent interrupt. Waiting 10s before force killing" failed=false
2024-02-19T03:16:56.262-0500 [INFO]  client.gc: marking allocation for GC: alloc_id=a21ae6da-a252-2f55-a52f-62669c90ca4e
2024-02-19T03:16:56.264-0500 [DEBUG] client.gc: alloc garbage collected: alloc_id=a21ae6da-a252-2f55-a52f-62669c90ca4e
2024-02-19T03:16:56.264-0500 [INFO]  client.gc: garbage collecting allocation: alloc_id=9d828c6a-9df0-50ce-dce0-a6fdcc5b6018 reason="new allocations and over max (100)"
2024-02-19T03:16:56.264-0500 [DEBUG] client.alloc_runner.runner_hook.group_services: delay before killing tasks: alloc_id=9d828c6a-9df0-50ce-dce0-a6fdcc5b6018 group=<taskname> shutdown_delay=30s
2024-02-19T03:17:26.266-0500 [TRACE] client.alloc_runner.task_runner: Kill requested: alloc_id=9d828c6a-9df0-50ce-dce0-a6fdcc5b6018 task=<taskname>
2024-02-19T03:17:26.266-0500 [TRACE] client.alloc_runner.task_runner: Kill event: alloc_id=9d828c6a-9df0-50ce-dce0-a6fdcc5b6018 task=<taskname> event_type=Killing event_reason=""
2024-02-19T03:17:26.266-0500 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=9d828c6a-9df0-50ce-dce0-a6fdcc5b6018 task=<taskname> type=Killing msg="Sent interrupt. Waiting 10s before force killing" failed=false
2024-02-19T03:17:26.267-0500 [INFO]  client.gc: marking allocation for GC: alloc_id=9d828c6a-9df0-50ce-dce0-a6fdcc5b6018
2024-02-19T03:17:26.269-0500 [DEBUG] client.gc: alloc garbage collected: alloc_id=9d828c6a-9df0-50ce-dce0-a6fdcc5b6018
2024-02-19T03:17:26.269-0500 [INFO]  client.gc: garbage collecting allocation: alloc_id=27225c79-23f9-8bbd-ec89-6015c77d2bbe reason="new allocations and over max (100)"
2024-02-19T03:17:26.269-0500 [DEBUG] client.alloc_runner.runner_hook.group_services: delay before killing tasks: alloc_id=27225c79-23f9-8bbd-ec89-6015c77d2bbe group=<taskname> shutdown_delay=30s

why is it happend?

client.alloc_runner.runner_hook.group_services: delay before killing tasks: alloc_id=27225c79-23f9-8bbd-ec89-6015c77d2bbe group=<taskname> shutdown_delay=30s

cl-bvl · 2024-02-27T16:34:59Z

Some details here #7787 (comment)

aneustroev · 2024-04-03T15:39:21Z

Lower shutdown_delay, is a solution.

tgross · 2024-06-24T15:43:54Z

Sorry about the delays in returning to this (and the seemingly related #7787). I'm going to mark this as accepted and for roadmapping.

tgross · 2025-02-11T14:51:26Z

Internal ref: https://hashicorp.atlassian.net/browse/NET-10187

aneustroev added the type/bug label Feb 8, 2024

lgfa29 added stage/waiting-reply theme/periodic theme/gc labels Feb 8, 2024

lgfa29 self-assigned this Feb 8, 2024

aneustroev mentioned this issue Feb 21, 2024

Pending allocations in nomad 0.11, unable to process periodic tasks #7787

Closed

tgross added this to Nomad - Community Issues Triage Jun 24, 2024

tgross moved this to Triaging in Nomad - Community Issues Triage Jun 24, 2024

tgross changed the title ~~GC for periodic tasks~~ slow client GC prevents new allocs from running (periodic dispatch) Jun 24, 2024

tgross unassigned lgfa29 Jun 24, 2024

tgross added stage/accepted Confirmed, and intend to work on. No timeline committment though. hcc/jira and removed stage/waiting-reply labels Jun 24, 2024

tgross moved this from Triaging to Needs Roadmapping in Nomad - Community Issues Triage Jun 24, 2024

EtienneBruines mentioned this issue Dec 9, 2024

Interrupt sent to allocs that have long terminated #24630

Open

EtienneBruines mentioned this issue Jan 14, 2025

Nomad client not reporting pending job during GC #24777

Open

EtienneBruines mentioned this issue Jan 14, 2025

Nomad scheduler schedules job on client that cannot handle it #24779

Closed

mismithhisler self-assigned this Feb 14, 2025

mismithhisler linked a pull request Feb 14, 2025 that will close this issue

client: fix client blocking during garbage collection #25123

Draft

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

slow client GC prevents new allocs from running (periodic dispatch) #19917

slow client GC prevents new allocs from running (periodic dispatch) #19917

aneustroev commented Feb 8, 2024

lgfa29 commented Feb 8, 2024

aneustroev commented Feb 9, 2024

aneustroev commented Feb 9, 2024

aneustroev commented Feb 9, 2024

lgfa29 commented Feb 9, 2024

aneustroev commented Feb 9, 2024 •

edited

Loading

aneustroev commented Feb 9, 2024

lgfa29 commented Feb 13, 2024

aneustroev commented Feb 19, 2024

cl-bvl commented Feb 27, 2024

aneustroev commented Apr 3, 2024 •

edited

Loading

tgross commented Jun 24, 2024

tgross commented Feb 11, 2025

slow client GC prevents new allocs from running (periodic dispatch) #19917

slow client GC prevents new allocs from running (periodic dispatch) #19917

Comments

aneustroev commented Feb 8, 2024

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

Job file (if appropriate)

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

lgfa29 commented Feb 8, 2024

aneustroev commented Feb 9, 2024

aneustroev commented Feb 9, 2024

aneustroev commented Feb 9, 2024

lgfa29 commented Feb 9, 2024

aneustroev commented Feb 9, 2024 • edited Loading

aneustroev commented Feb 9, 2024

lgfa29 commented Feb 13, 2024

aneustroev commented Feb 19, 2024

cl-bvl commented Feb 27, 2024

aneustroev commented Apr 3, 2024 • edited Loading

tgross commented Jun 24, 2024

tgross commented Feb 11, 2025

aneustroev commented Feb 9, 2024 •

edited

Loading

aneustroev commented Apr 3, 2024 •

edited

Loading