Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad can no longer launch commands with raw_exec if /sys/fs/cgroup does not exist (old kernels) #8565

Closed
dposton80 opened this issue Jul 30, 2020 · 7 comments · Fixed by #9328
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/docs Documentation issues and enhancements type/bug

Comments

@dposton80
Copy link

Nomad version

0.12.0

Operating system and Environment details

RHEL6, kernel version 2.6.32-754.30.2.el6.x86_64

Issue

Appreciate that older kernel versions may not be supported, in which case please close this. However it may be useful for others.

I was able to run nomad Nomad v0.11.3 on RHEL6 if I ran it with a newer version of libc (which I did via patchelf --set-interpreter glibc-2.19/lib/ld-linux-x86-64.so.2 --set-rpath glibc-2.19/lib). It could successfully launch commands OK and generally worked well.

However with Nomad v 0.12.0, this no longer worked - the exec driver fails with:

 2020-07-30T11:34:29.586+0100 [DEBUG] client.driver_mgr.raw_exec.executor.nomad: panic: cannot statfs cgroup root: alloc_id=d3cfc692-5264-7cc8-fd31-1e65072808b9 driver=raw_exec task_name=test_task
    2020-07-30T11:34:29.586+0100 [DEBUG] client.driver_mgr.raw_exec.executor.nomad: : alloc_id=d3cfc692-5264-7cc8-fd31-1e65072808b9 driver=raw_exec task_name=test_task
    2020-07-30T11:34:29.586+0100 [DEBUG] client.driver_mgr.raw_exec.executor.nomad: goroutine 8 [running]:: alloc_id=d3cfc692-5264-7cc8-fd31-1e65072808b9 driver=raw_exec task_name=test_task
    2020-07-30T11:34:29.586+0100 [DEBUG] client.driver_mgr.raw_exec.executor.nomad: github.com/opencontainers/runc/libcontainer/cgroups.IsCgroup2UnifiedMode.func1(): alloc_id=d3cfc692-5264-7cc8-fd31-1e65072808b9 driver=raw_exec task_name=test_task
    2020-07-30T11:34:29.586+0100 [DEBUG] client.driver_mgr.raw_exec.executor.nomad:     github.com/opencontainers/[email protected]/libcontainer/cgroups/utils.go:45 +0xbe: alloc_id=d3cfc692-5264-7cc8-fd31-1e65072808b9 driver=raw_exec task_name=test_task
    2020-07-30T11:34:29.586+0100 [DEBUG] client.driver_mgr.raw_exec.executor.nomad: sync.(*Once).doSlow(0x54d2310, 0x3313690): alloc_id=d3cfc692-5264-7cc8-fd31-1e65072808b9 driver=raw_exec task_name=test_task
    2020-07-30T11:34:29.586+0100 [DEBUG] client.driver_mgr.raw_exec.executor.nomad:     sync/once.go:66 +0xec: alloc_id=d3cfc692-5264-7cc8-fd31-1e65072808b9 driver=raw_exec task_name=test_task
    2020-07-30T11:34:29.586+0100 [DEBUG] client.driver_mgr.raw_exec.executor.nomad: sync.(*Once).Do(...): alloc_id=d3cfc692-5264-7cc8-fd31-1e65072808b9 driver=raw_exec task_name=test_task
    2020-07-30T11:34:29.586+0100 [DEBUG] client.driver_mgr.raw_exec.executor.nomad:     sync/once.go:57: alloc_id=d3cfc692-5264-7cc8-fd31-1e65072808b9 driver=raw_exec task_name=test_task
    2020-07-30T11:34:29.586+0100 [DEBUG] client.driver_mgr.raw_exec.executor.nomad: github.com/opencontainers/runc/libcontainer/cgroups.IsCgroup2UnifiedMode(0x5312ca0): alloc_id=d3cfc692-5264-7cc8-fd31-1e65072808b9 driver=raw_exec task_name=test_task
    2020-07-30T11:34:29.586+0100 [DEBUG] client.driver_mgr.raw_exec.executor.nomad:     github.com/opencontainers/[email protected]/libcontainer/cgroups/utils.go:42 +0x58: alloc_id=d3cfc692-5264-7cc8-fd31-1e65072808b9 driver=raw_exec task_name=test_task
    2020-07-30T11:34:29.586+0100 [DEBUG] client.driver_mgr.raw_exec.executor.nomad: github.com/opencontainers/runc/libcontainer/cgroups.isSubsystemAvailable(0x321931a, 0x7, 0x24): alloc_id=d3cfc692-5264-7cc8-fd31-1e65072808b9 driver=raw_exec task_name=test_task
    2020-07-30T11:34:29.586+0100 [DEBUG] client.driver_mgr.raw_exec.executor.nomad:     github.com/opencontainers/[email protected]/libcontainer/cgroups/utils.go:106 +0x26: alloc_id=d3cfc692-5264-7cc8-fd31-1e65072808b9 driver=raw_exec task_name=test_task
    2020-07-30T11:34:29.586+0100 [DEBUG] client.driver_mgr.raw_exec.executor.nomad: github.com/opencontainers/runc/libcontainer/cgroups.FindCgroupMountpointAndRoot(0x0, 0x0, 0x321931a, 0x7, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0): alloc_id=d3cfc692-5264-7cc8-fd31-1e65072808b9 driver=raw_exec task_name=test_task
    2020-07-30T11:34:29.586+0100 [DEBUG] client.driver_mgr.raw_exec.executor.nomad:     github.com/opencontainers/[email protected]/libcontainer/cgroups/utils.go:65 +0x75: alloc_id=d3cfc692-5264-7cc8-fd31-1e65072808b9 driver=raw_exec task_name=test_task
    2020-07-30T11:34:29.586+0100 [DEBUG] client.driver_mgr.raw_exec.executor.nomad: github.com/hashicorp/nomad/drivers/shared/executor.getCgroupPathHelper(0x321931a, 0x7, 0xc0005304b0, 0x2b, 0x2b, 0xc000544000, 0x41, 0xc000515880): alloc_id=d3cfc692-5264-7cc8-fd31-1e65072808b9 driver=raw_exec task_name=test_task
    2020-07-30T11:34:29.586+0100 [DEBUG] client.driver_mgr.raw_exec.executor.nomad:     github.com/hashicorp/nomad/drivers/shared/executor/executor_linux.go:716 +0x55: alloc_id=d3cfc692-5264-7cc8-fd31-1e65072808b9 driver=raw_exec task_name=test_task
    2020-07-30T11:34:29.586+0100 [DEBUG] client.driver_mgr.raw_exec.executor.nomad: github.com/hashicorp/nomad/drivers/shared/executor.configureBasicCgroups(0xc000515780, 0xc000544000, 0x0): alloc_id=d3cfc692-5264-7cc8-fd31-1e65072808b9 driver=raw_exec task_name=test_task
    2020-07-30T11:34:29.586+0100 [DEBUG] client.driver_mgr.raw_exec.executor.nomad:     github.com/hashicorp/nomad/drivers/shared/executor/executor_linux.go:700 +0xb9: alloc_id=d3cfc692-5264-7cc8-fd31-1e65072808b9 driver=raw_exec task_name=test_task
    2020-07-30T11:34:29.586+0100 [DEBUG] client.driver_mgr.raw_exec.executor.nomad: github.com/hashicorp/nomad/drivers/shared/executor.(*UniversalExecutor).configureResourceContainer(0xc0002241a0, 0xbfd, 0x0, 0x0): alloc_id=d3cfc692-5264-7cc8-fd31-1e65072808b9 driver=raw_exec task_name=test_task
    2020-07-30T11:34:29.586+0100 [DEBUG] client.driver_mgr.raw_exec.executor.nomad:     github.com/hashicorp/nomad/drivers/shared/executor/executor_universal_linux.go:80 +0xff: alloc_id=d3cfc692-5264-7cc8-fd31-1e65072808b9 driver=raw_exec task_name=test_task
    2020-07-30T11:34:29.586+0100 [DEBUG] client.driver_mgr.raw_exec.executor.nomad: github.com/hashicorp/nomad/drivers/shared/executor.(*UniversalExecutor).Launch(0xc0002241a0, 0xc0005080f0, 0x0, 0x0, 0x0): alloc_id=d3cfc692-5264-7cc8-fd31-1e65072808b9 driver=raw_exec task_name=test_task
    2020-07-30T11:34:29.586+0100 [DEBUG] client.driver_mgr.raw_exec.executor.nomad:     github.com/hashicorp/nomad/drivers/shared/executor/executor.go:283 +0x258: alloc_id=d3cfc692-5264-7cc8-fd31-1e65072808b9 driver=raw_exec task_name=test_task
    2020-07-30T11:34:29.586+0100 [DEBUG] client.driver_mgr.raw_exec.executor.nomad: github.com/hashicorp/nomad/drivers/shared/executor.(*grpcExecutorServer).Launch(0xc000206750, 0x38e7100, 0xc000506000, 0xc000508000, 0xc000206750, 0xc000506000, 0xc000214b78): alloc_id=d3cfc692-5264-7cc8-fd31-1e65072808b9 driver=raw_exec task_name=test_task
    2020-07-30T11:34:29.586+0100 [DEBUG] client.driver_mgr.raw_exec.executor.nomad:     github.com/hashicorp/nomad/drivers/shared/executor/server.go:23 +0x371: alloc_id=d3cfc692-5264-7cc8-fd31-1e65072808b9 driver=raw_exec task_name=test_task

Without tracing on, this would just appear as a task 'Driver Failure' with error 'failed to launch command with executor: rpc error: code = Unavailable desc = transport is closing'

The problem seems to be line 45 of /vendor/github.com/opencontainers/runc/libcontainer/cgroups/utils.go in IsCgroup2UnifiedMode

		if err := syscall.Statfs(unifiedMountpoint, &st); err != nil {
			panic("cannot statfs cgroup root")
		}

where unifiedMountpoint is "/sys/fs/cgroup". Seems the code now panics if this doesn't exist.

If this were just to return false instead (or the panic were avoided some other way), I think everything would work (seems the code otherwise tolerates cgroups initialization returning an error).

If it's not intended to support certain kernel versions, it might be good to have an error on startup.

@tgross
Copy link
Member

tgross commented Jul 30, 2020

Hi @dposton80! The error you're seeing is bubbling up from the third-party libcontainer, which we've updated for a security issue in 0.12.0 (see #8246). It might be worth reporting this to that project to see whether they can recommend a workaround for older kernels.

If it's not intended to support certain kernel versions, it might be good to have an error on startup.

That support is going to be dependent on which task drivers you have enabled, so that makes it a little tricky to state a specific version. But unfortunately it looks like we don't even document that (or at least anywhere I would expect to see it), so I'm going to mark this as a documentation bug at least.

@tgross tgross added theme/docs Documentation issues and enhancements type/bug labels Jul 30, 2020
@notnoop
Copy link
Contributor

notnoop commented Jul 30, 2020

As a potential workaround, you can disable cgroups usage in raw_exec with the no_cgroups flag. Can you try adding the following snippet to your client config:

plugin "raw_exec" {
  config {
    no_cgroups = true
  }
}

raw_exec driver uses cgroup to improve process tracking for metric collection and shutdown purposes, so you may notice some odd behavior with child processes not tracked or killed if they don't clean up properly.

@dposton80
Copy link
Author

dposton80 commented Jul 30, 2020

Thanks for the responses. I already tried the no_cgroups option, it doesn't seem to make any difference I'm afraid. Seems the call to configureResourceContainer in drivers/shared/executor/executor.go line 283 should be skipped if command.BasicProcessCgroup is false?

@notnoop
Copy link
Contributor

notnoop commented Jul 30, 2020

Well, that's unfortunate. This is a bug that we should fix - no cgroup operation should occur when no_cgroup is set!

@tgross tgross added the stage/accepted Confirmed, and intend to work on. No timeline committment though. label Aug 24, 2020
@qianglchina
Copy link

Have any progress ? I have the same issue in latest nomad 0.12.7

notnoop pushed a commit that referenced this issue Nov 11, 2020
When raw_exec is configured with [`no_cgroups`](https://www.nomadproject.io/docs/drivers/raw_exec#no_cgroups), raw_exec shouldn't attempt to create a cgroup.

Prior to this change, we accidentally always required freezer cgroup to do stats PID tracking. We already have the proper fallback in place for metrics, so only need to ensure that we don't create a cgroup for the task.

Fixes #8565
@notnoop
Copy link
Contributor

notnoop commented Nov 11, 2020

@qianglchina Thanks for your patience. I have just merged a fix to be included in the next Nomad release.

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 28, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/docs Documentation issues and enhancements type/bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants