Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad 1.3.0 - cgroup-parent for systemd error #13022

Closed
replay111 opened this issue May 14, 2022 · 10 comments
Closed

Nomad 1.3.0 - cgroup-parent for systemd error #13022

replay111 opened this issue May 14, 2022 · 10 comments
Assignees
Labels
help-wanted We encourage community PRs for these issues! theme/cgroups cgroups issues theme/docs Documentation issues and enhancements

Comments

@replay111
Copy link

Nomad version

nomad version
Nomad v1.3.0 (52e95d6)

Operating system and Environment details

cat /etc/*release
Oracle Linux Server release 8.5
NAME="Oracle Linux Server"
VERSION="8.5"
ID="ol"
ID_LIKE="fedora"
VARIANT="Server"
VARIANT_ID="server"
VERSION_ID="8.5"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Oracle Linux Server 8.5"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:oracle:linux:8:5:server"
HOME_URL="https://linux.oracle.com/"
BUG_REPORT_URL="https://bugzilla.oracle.com/"

ORACLE_BUGZILLA_PRODUCT="Oracle Linux 8"
ORACLE_BUGZILLA_PRODUCT_VERSION=8.5
ORACLE_SUPPORT_PRODUCT="Oracle Linux"
ORACLE_SUPPORT_PRODUCT_VERSION=8.5
Red Hat Enterprise Linux release 8.5 (Ootpa)
Oracle Linux Server release 8.5

docker version
Client: Docker Engine - Community
Version: 20.10.16
API version: 1.41
Go version: go1.17.10
Git commit: aa7e414
Built: Thu May 12 09:17:20 2022
OS/Arch: linux/amd64
Context: default
Experimental: true

Server: Docker Engine - Community
Engine:
Version: 20.10.16
API version: 1.41 (minimum version 1.12)
Go version: go1.17.10
Git commit: f756502
Built: Thu May 12 09:15:41 2022
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.6.4
GitCommit: 212e8b6fa2f44b9c21b2798135fc6fb7c53efc16
runc:
Version: 1.1.1
GitCommit: v1.1.1-0-g52de29d
docker-init:
Version: 0.19.0
GitCommit: de40ad0

cat /etc/docker/daemon.json
{
"exec-opts": ["native.cgroupdriver=systemd"],
"metrics-addr" : "0.0.0.0:9323",
"insecure-registries": [
"172.30.0.0/16"
]
}

Issue

Job is failing with error:

Driver Failure 	failed to create container: API error (400): cgroup-parent for systemd cgroup should be a valid slice named as "xxx.slice"

Reproduction steps

Nomad job definition:

cat << 'EOF' > jenkins.nomad
variable "job_storage" {
  default = "/nfs_shared/jenkins"
}

job "jenkins" {
  datacenters = [ "DC1" ]
  type = "service"
  namespace = "ci-cd"

  group "jenkins" {
    count = 1
    network {
      port "http" {
        to = "8080"
      }
      port "agent" {
        to = "5000"
      }
    }
    task "prepare_dirs" {
      driver = "raw_exec"
      resources {
        memory = "10"
        cpu    = "16"
      }
      lifecycle {
        hook = "prestart"
        sidecar = false
      }
      template {
        data = <<TPLEOF
#!/bin/bash
WORKDIR="${var.job_storage}"

if [ ! -f $WORKDIR/init.groovy.d/security.groovy ]; then
mkdir -p $WORKDIR/init.groovy.d

cat << IEOF > $WORKDIR/init.groovy.d/security.groovy
import jenkins.model.*
import hudson.security.*
println "--> creating admin user"
def adminUsername = "admin"
def adminPassword = "admin"
def hudsonRealm = new HudsonPrivateSecurityRealm(false)
hudsonRealm.createAccount(adminUsername, adminPassword)
Jenkins.instance.save()
IEOF

chmod -v 777 $WORKDIR/init.groovy.d/security.groovy
fi

echo "All done!"

TPLEOF

        destination = "/local/runme.bash"
      }
      config {
        command = "/bin/bash"
        args    = ["-x","local/runme.bash"]
      }
    }

    task "jenkins" {
      driver = "docker"
      env {
        TZ = "Europe/Warsaw"
      }      
      config {
        image  = "jenkins/jenkins:lts-centos7-jdk11"
        force_pull = false
        hostname = "${NOMAD_TASK_NAME}.nomad.dom.net"
        labels {
          group = "jenkins"
        }
        mount {
          type = "bind"
          target = "/var/run/docker.sock"
          source = "/var/run/docker.sock"
          readonly = false
        }
        mount {
          type = "bind"
          target = "/var/jenkins_home"
          source = "${var.job_storage}"
          readonly = false
        }
        ports = ["http", "agent"]
      }
      resources {
        memory = 1024
        cpu    = 1024
      }
      service {
        name = "jenkins"
        tags = [ "urlprefix-${NOMAD_TASK_NAME}.local.net", "jenkins" ]
        port = "http"
        check {
          type       = "http"
          port       = "http"
          path       = "/login"
          interval   = "30s"
          timeout    = "5s"
        }
      }
    }
  }
}
EOF

echo export NOMAD_ADDR="http://host-ip-with-nomad-service:4646"

nomad namespace apply -description "CI/CD Tools" ci-cd
nomad run ./jenkins.nomad

Expected Result

Container with Jenkins up and running - no more errors like mentioned at the top

Actual Result

For now as mentioned:

Driver Failure 	failed to create container: API error (400): cgroup-parent for systemd cgroup should be a valid slice named as "xxx.slice"

Job file (if appropriate)

variable "job_storage" {
  default = "/nfs_shared/jenkins"
}

job "jenkins" {
  datacenters = [ "DC1" ]
  type = "service"
  namespace = "ci-cd"

  group "jenkins" {
    count = 1
    network {
      port "http" {
        to = "8080"
      }
      port "agent" {
        to = "5000"
      }
    }
    task "prepare_dirs" {
      driver = "raw_exec"
      resources {
        memory = "10"
        cpu    = "16"
      }
      lifecycle {
        hook = "prestart"
        sidecar = false
      }
      template {
        data = <<TPLEOF
#!/bin/bash
WORKDIR="${var.job_storage}"

if [ ! -f $WORKDIR/init.groovy.d/security.groovy ]; then
mkdir -p $WORKDIR/init.groovy.d

cat << IEOF > $WORKDIR/init.groovy.d/security.groovy
import jenkins.model.*
import hudson.security.*
println "--> creating admin user"
def adminUsername = "admin"
def adminPassword = "admin"
def hudsonRealm = new HudsonPrivateSecurityRealm(false)
hudsonRealm.createAccount(adminUsername, adminPassword)
Jenkins.instance.save()
IEOF

chmod -v 777 $WORKDIR/init.groovy.d/security.groovy
fi

echo "All done!"

TPLEOF

        destination = "/local/runme.bash"
      }
      config {
        command = "/bin/bash"
        args    = ["-x","local/runme.bash"]
      }
    }

    task "jenkins" {
      driver = "docker"
      env {
        TZ = "Europe/Warsaw"
      }      
      config {
        image  = "jenkins/jenkins:lts-centos7-jdk11"
        force_pull = false
        hostname = "${NOMAD_TASK_NAME}.nomad.dom.net"
        labels {
          group = "jenkins"
        }
        mount {
          type = "bind"
          target = "/var/run/docker.sock"
          source = "/var/run/docker.sock"
          readonly = false
        }
        mount {
          type = "bind"
          target = "/var/jenkins_home"
          source = "${var.job_storage}"
          readonly = false
        }
        ports = ["http", "agent"]
      }
      resources {
        memory = 1024
        cpu    = 1024
      }
      service {
        name = "jenkins"
        tags = [ "urlprefix-${NOMAD_TASK_NAME}.local.net", "jenkins" ]
        port = "http"
        check {
          type       = "http"
          port       = "http"
          path       = "/login"
          interval   = "30s"
          timeout    = "5s"
        }
      }
    }
  }
}

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

Nomad server and client are on the same host - test configuration

tail -f ./nomad.log 

2022-05-14T18:25:04.030+0200 [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=1e68afd9-d401-2707-aacf-d69125cd8107 task=prepare_dirs @module=logmon path=/app/nomad/storage/alloc/1e68afd9-d401-2707-aacf-d69125cd8107/alloc/logs/.prepare_dirs.stdout.fifo timestamp="2022-05-14T18:25:04.029+0200"
2022-05-14T18:25:04.030+0200 [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=1e68afd9-d401-2707-aacf-d69125cd8107 task=prepare_dirs path=/app/nomad/storage/alloc/1e68afd9-d401-2707-aacf-d69125cd8107/alloc/logs/.prepare_dirs.stderr.fifo @module=logmon timestamp="2022-05-14T18:25:04.030+0200"
2022-05-14T18:25:04.050+0200 [INFO]  agent: (runner) creating new runner (dry: false, once: false)
2022-05-14T18:25:04.050+0200 [INFO]  agent: (runner) creating watcher
2022-05-14T18:25:04.050+0200 [INFO]  agent: (runner) starting
2022-05-14T18:25:04.054+0200 [INFO]  agent: (runner) rendered "(dynamic)" => "/app/nomad/storage/alloc/1e68afd9-d401-2707-aacf-d69125cd8107/prepare_dirs/local/runme.bash"
2022-05-14T18:25:04.062+0200 [INFO]  client.driver_mgr.raw_exec: starting task: driver=raw_exec driver_cfg="{Command:/bin/bash Args:[-x local/runme.bash]}"
2022-05-14T18:25:04.533+0200 [INFO]  client.alloc_runner.task_runner: not restarting task: alloc_id=1e68afd9-d401-2707-aacf-d69125cd8107 task=prepare_dirs reason="Restart unnecessary as task terminated successfully"
2022-05-14T18:25:04.538+0200 [INFO]  agent: (runner) stopping
2022-05-14T18:25:04.538+0200 [INFO]  agent: (runner) received finish
2022-05-14T18:25:04.556+0200 [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=1e68afd9-d401-2707-aacf-d69125cd8107 task=jenkins @module=logmon path=/app/nomad/storage/alloc/1e68afd9-d401-2707-aacf-d69125cd8107/alloc/logs/.jenkins.stdout.fifo timestamp="2022-05-14T18:25:04.556+0200"
2022-05-14T18:25:04.556+0200 [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=1e68afd9-d401-2707-aacf-d69125cd8107 task=jenkins @module=logmon path=/app/nomad/storage/alloc/1e68afd9-d401-2707-aacf-d69125cd8107/alloc/logs/.jenkins.stderr.fifo timestamp="2022-05-14T18:25:04.556+0200"
2022-05-14T18:25:19.348+0200 [ERROR] client.driver_mgr.docker: failed to create container: driver=docker error="API error (400): cgroup-parent for systemd cgroup should be a valid slice named as \"xxx.slice\""
2022-05-14T18:25:19.352+0200 [ERROR] client.alloc_runner.task_runner: running driver failed: alloc_id=1e68afd9-d401-2707-aacf-d69125cd8107 task=jenkins error="failed to create container: API error (400): cgroup-parent for systemd cgroup should be a valid slice named as \"xxx.slice\""
2022-05-14T18:25:19.352+0200 [INFO]  client.alloc_runner.task_runner: not restarting task: alloc_id=1e68afd9-d401-2707-aacf-d69125cd8107 task=jenkins reason="Error was unrecoverable"
2022-05-14T18:25:19.360+0200 [INFO]  client.gc: marking allocation for GC: alloc_id=1e68afd9-d401-2707-aacf-d69125cd8107
2022-05-14T18:25:23.361+0200 [WARN]  client.alloc_runner.task_runner.task_hook.logmon.nomad: timed out waiting for read-side of process output pipe to close: alloc_id=1e68afd9-d401-2707-aacf-d69125cd8107 task=jenkins @module=logmon timestamp="2022-05-14T18:25:23.361+0200"
2022-05-14T18:25:23.361+0200 [WARN]  client.alloc_runner.task_runner.task_hook.logmon.nomad: timed out waiting for read-side of process output pipe to close: alloc_id=1e68afd9-d401-2707-aacf-d69125cd8107 task=jenkins @module=logmon timestamp="2022-05-14T18:25:23.361+0200"


2022-05-14T18:25:49.390+0200 [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=4b130c5c-9381-ff00-cf16-79ace3ac1361 task=prepare_dirs @module=logmon path=/app/nomad/storage/alloc/4b130c5c-9381-ff00-cf16-79ace3ac1361/alloc/logs/.prepare_dirs.stdout.fifo timestamp="2022-05-14T18:25:49.390+0200"
2022-05-14T18:25:49.391+0200 [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=4b130c5c-9381-ff00-cf16-79ace3ac1361 task=prepare_dirs path=/app/nomad/storage/alloc/4b130c5c-9381-ff00-cf16-79ace3ac1361/alloc/logs/.prepare_dirs.stderr.fifo @module=logmon timestamp="2022-05-14T18:25:49.391+0200"
2022-05-14T18:25:49.410+0200 [INFO]  agent: (runner) creating new runner (dry: false, once: false)
2022-05-14T18:25:49.410+0200 [INFO]  agent: (runner) creating watcher
2022-05-14T18:25:49.411+0200 [INFO]  agent: (runner) starting
2022-05-14T18:25:49.414+0200 [INFO]  agent: (runner) rendered "(dynamic)" => "/app/nomad/storage/alloc/4b130c5c-9381-ff00-cf16-79ace3ac1361/prepare_dirs/local/runme.bash"
2022-05-14T18:25:49.422+0200 [INFO]  client.driver_mgr.raw_exec: starting task: driver=raw_exec driver_cfg="{Command:/bin/bash Args:[-x local/runme.bash]}"
2022-05-14T18:25:49.900+0200 [INFO]  client.alloc_runner.task_runner: not restarting task: alloc_id=4b130c5c-9381-ff00-cf16-79ace3ac1361 task=prepare_dirs reason="Restart unnecessary as task terminated successfully"
2022-05-14T18:25:49.905+0200 [INFO]  agent: (runner) stopping
2022-05-14T18:25:49.906+0200 [INFO]  agent: (runner) received finish
2022-05-14T18:25:49.922+0200 [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=4b130c5c-9381-ff00-cf16-79ace3ac1361 task=jenkins @module=logmon path=/app/nomad/storage/alloc/4b130c5c-9381-ff00-cf16-79ace3ac1361/alloc/logs/.jenkins.stdout.fifo timestamp="2022-05-14T18:25:49.922+0200"
2022-05-14T18:25:49.922+0200 [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=4b130c5c-9381-ff00-cf16-79ace3ac1361 task=jenkins @module=logmon path=/app/nomad/storage/alloc/4b130c5c-9381-ff00-cf16-79ace3ac1361/alloc/logs/.jenkins.stderr.fifo timestamp="2022-05-14T18:25:49.922+0200"
2022-05-14T18:26:04.679+0200 [ERROR] client.driver_mgr.docker: failed to create container: driver=docker error="API error (400): cgroup-parent for systemd cgroup should be a valid slice named as \"xxx.slice\""
2022-05-14T18:26:04.683+0200 [ERROR] client.alloc_runner.task_runner: running driver failed: alloc_id=4b130c5c-9381-ff00-cf16-79ace3ac1361 task=jenkins error="failed to create container: API error (400): cgroup-parent for systemd cgroup should be a valid slice named as \"xxx.slice\""
2022-05-14T18:26:04.684+0200 [INFO]  client.alloc_runner.task_runner: not restarting task: alloc_id=4b130c5c-9381-ff00-cf16-79ace3ac1361 task=jenkins reason="Error was unrecoverable"
2022-05-14T18:26:04.691+0200 [INFO]  client.gc: marking allocation for GC: alloc_id=4b130c5c-9381-ff00-cf16-79ace3ac1361
2022-05-14T18:26:08.694+0200 [WARN]  client.alloc_runner.task_runner.task_hook.logmon.nomad: timed out waiting for read-side of process output pipe to close: alloc_id=4b130c5c-9381-ff00-cf16-79ace3ac1361 task=jenkins @module=logmon timestamp="2022-05-14T18:26:08.693+0200"
2022-05-14T18:26:08.694+0200 [WARN]  client.alloc_runner.task_runner.task_hook.logmon.nomad: timed out waiting for read-side of process output pipe to close: alloc_id=4b130c5c-9381-ff00-cf16-79ace3ac1361 task=jenkins @module=logmon timestamp="2022-05-14T18:26:08.694+0200"

NOTE

Exactly same configuration but with Nomad 1.2.6 (I stopped service removed everything deployed instance with 1.2.6) works without any problems.

@shoenig
Copy link
Member

shoenig commented May 16, 2022

Hi @replay111, did you set the value "exec-opts": ["native.cgroupdriver=systemd"], in the docker config yourself?

When using the docker driver Nomad needs control over where the cgroup for the container gets created, and I suspect that is going to conflict with the systemd cgroup docker driver.

@replay111
Copy link
Author

replay111 commented May 16, 2022

@shoenig - yes - it is placed at the top of the issue:

cat /etc/docker/daemon.json
{
"exec-opts": ["native.cgroupdriver=systemd"],
"metrics-addr" : "0.0.0.0:9323",
"insecure-registries": [
"172.30.0.0/16"
]
}

and this is added during the installation (of docker) by my ansible module.
I remember that I had some problems with docker and this was required to add.
This config works fine with 1.2.6 version.

@replay111
Copy link
Author

Hi, I've updated my Ansible Collection and now I do not add this line that reffers to cgroup driver:
"exec-opts": ["native.cgroupdriver=systemd"],. I did redeployment and now everything seems to be working fine.

@shoenig
Copy link
Member

shoenig commented May 16, 2022

Thanks for the update @replay111. Since this is just a configuration issue I think we should update our docker driver docs, mentioning the cgroupdriver should be left unset.

@shoenig shoenig added theme/docs Documentation issues and enhancements and removed type/bug stage/waiting-reply stage/needs-investigation labels May 16, 2022
@valodzka
Copy link
Contributor

valodzka commented May 20, 2022

I think we should update our docker driver docs, mentioning the cgroupdriver should be left unset.

@shoenig I'm not sure how it's determined but at least Debian 11 (default cgroupv2) / Linux 5.10.0-10-amd64 / docker 5:20.10.16 cgroup driver is systemd. If nomad doesn't work with this driver it should be mentioned that native.cgroupdriver=cgroupfs should be configured, not unset:

curl --silent -XGET --unix-socket /run/docker.sock http://localhost/info | jq .CgroupDriver
"systemd"

@valodzka
Copy link
Contributor

valodzka commented May 20, 2022

Found, It's from 20.10.0 https://docs.docker.com/engine/release-notes/#20100
@shoenig Should it be changed to run docker with nomad?

cgroup2: use “systemd” cgroup driver by default when available moby/moby#40846

@shoenig
Copy link
Member

shoenig commented May 20, 2022

@valodzka I'm not sure #40846 is the full story; in practice I think leaving the driver blank in turn enables setting a custom cgroup parent and setting it to systemd forces the systemd hierarchy. Could be totally wrong; docker is frankly an enigma.

In reality what Nomad really needs is to be able to just set the cgroup path the docker container should use. Unfortunately moby/moby#43363 seems to have been ignored.

@jacksod1
Copy link

Encountered this same issue when upgrading to 1.3.1. We are running RHEL's supported version of docker, which is configured to use the systemd cgroup driver. Considering moving back to 1.2.8, until this issue is resolved. Will the systemd cgroup driver be supported?

@shoenig
Copy link
Member

shoenig commented Aug 29, 2022

Will the systemd cgroup driver be supported?

@jacksod1 can you (or someone) describe (or link to docs) what needs to be done to support this configuration?

@shoenig shoenig added the help-wanted We encourage community PRs for these issues! label Aug 29, 2022
@shoenig
Copy link
Member

shoenig commented Oct 24, 2023

Should be fixed as of #18371 (Nomad 1.7)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help-wanted We encourage community PRs for these issues! theme/cgroups cgroups issues theme/docs Documentation issues and enhancements
Projects
Development

No branches or pull requests

4 participants