Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nomad 1.3.0-rc.1 jobs hang/wont restart. cgroups v2? #12863

Closed
badalex opened this issue May 4, 2022 · 3 comments · Fixed by #12875
Closed

nomad 1.3.0-rc.1 jobs hang/wont restart. cgroups v2? #12863

badalex opened this issue May 4, 2022 · 3 comments · Fixed by #12875
Assignees
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/cgroups cgroups issues type/bug
Milestone

Comments

@badalex
Copy link

badalex commented May 4, 2022

Nomad version

Nomad v1.3.0-rc.1 (31b0a18)

Operating system and Environment details

Ubuntu 22.04 Jammy Jellyfish 5.15.0-27-generic #28-Ubuntu SMP Thu Apr 14 04:55:28 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Issue

Seems like something wonky happens with cgroup v2 support. If I create a job that exits immediately, it stops being restarted. Nomad 1.2.3 (the last version I can use because of the plugin breakage #12071) seems to work fine, although I plan on downgrading nomad again to double check.

Given the job file included:

nomad ui for the allocation shows:

May 03, '22 18:24:45 -0600 | Alloc Unhealthy | Task not running by deadline
May 03, '22 18:19:45 -0600 | Restarting | Task restarting in 1.09865699s
May 03, '22 18:19:45 -0600 | Terminated | Exit Code: 0
May 03, '22 18:19:45 -0600 | Started | Task started by client
May 03, '22 18:15:37 -0600 | Restarting | Task restarting in 1.165452572s
May 03, '22 18:15:37 -0600 | Terminated | Exit Code: 0
May 03, '22 18:15:37 -0600 | Started | Task started by client
May 03, '22 18:15:36 -0600 | Task Setup | Building Task Directory
May 03, '22 18:15:36 -0600 | Received | Task received by client

It is currently 18:26, no other restart attempts have been made. the logmon process for the alloc is still running, no processes underneath that or using the allocation dir according to lsof -n +D

If I change the constraint to a ubuntu 20.04 host, it restarts every secondish as expected.

Time    Type    Description
May 03, '22 18:32:32 -0600      Restarting      Task restarting in 1.018849262s
May 03, '22 18:32:32 -0600      Terminated      Exit Code: 0
May 03, '22 18:32:32 -0600      Started         Task started by client
May 03, '22 18:32:30 -0600      Restarting      Task restarting in 1.234701267s
May 03, '22 18:32:30 -0600      Terminated      Exit Code: 0
May 03, '22 18:32:30 -0600      Started         Task started by client
May 03, '22 18:32:29 -0600      Restarting      Task restarting in 1.196971407s
May 03, '22 18:32:29 -0600      Terminated      Exit Code: 0
May 03, '22 18:32:29 -0600      Started         Task started by client
May 03, '22 18:32:28 -0600      Restarting      Task restarting in 1.107809535s
... many more snipped

Other issues I have not been able to reproduce with any success:

[ERROR] client.cpuset.v2: failed to set cgroup: path=/sys/fs/cgroup/nomad.slice/eb0be10c-0359-00cc-915b-c8ecae499c19.run.scope err="openat2 /sys/fs/cgroup/nomad.slice/eb0be10c-0359-00cc-915b-c8ecae499c19.run.scope/cpuset.cpus: no such file or directory"
[ERROR] client.alloc_runner.task_runner: running driver failed: alloc_id=4db29e19-c6d2-832b-b30e-a1b6f7f62d53 task=run error="failed to launch command with executor: rpc error: code = Unknown desc = failed to set v2 cgroup resources: failed to call BPF_PROG_DETACH (BPF_CGROUP_DEVICE) on old filter program: can't detach program: no such file or directory"

Also, Might be a bug in with the job, but .. /dev/null seems to disappear. edit: somtimes, for some jobs, but not all the time, this is how I noticed restarts were not, err, restarting. Trying to debug this issue I'm still working to nail this down, feels like it might be related. This is a raw_exec job that make their own restricted mount namespace, it includes /dev/null is and it is writable. Seems to work fine on nomad 1.2.3 on the same host

PermissionError: [Errno 1] Operation not permitted: '/dev/null'

Job file (if appropriate)

job "test-env" {
        datacenters = ["cd01"]
        type = "service"

        constraint {
                attribute = "${attr.unique.hostname}"
                value = "..."
        }

        group "group" {
                restart {
                        attempts = 5
                        mode     = "delay"
                        delay = "1s"
                        interval = "5s"
                }

                task "try" {
                        driver = "raw_exec"
                        config {
                                command = "/usr/bin/bash"
                                # see if /dev/null disappears
                                args = [ "-c", "dd if=/dev/zero of=/dev/null count=1 || (echo \"busted\"; sleep 1000)"]
                        }
                }
        }
}
@shoenig
Copy link
Member

shoenig commented May 4, 2022

Thanks for testing this out and reporting @badalex! Indeed I can reproduce this given your job file, and should be able to figure out what's going on from here.

@shoenig shoenig self-assigned this May 4, 2022
@shoenig shoenig added the theme/cgroups cgroups issues label May 4, 2022
@shoenig shoenig added this to the 1.3.0 milestone May 4, 2022
@shoenig shoenig added the stage/accepted Confirmed, and intend to work on. No timeline committment though. label May 4, 2022
@badalex
Copy link
Author

badalex commented May 4, 2022

Sweet, I can confirm that PR fixes the restart issue.

@github-actions
Copy link

github-actions bot commented Oct 8, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 8, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/cgroups cgroups issues type/bug
Projects
Development

Successfully merging a pull request may close this issue.

2 participants