CNI: use check command when restoring from restart #24658

tgross · 2024-12-12T20:48:44Z

When the Nomad client restarts and restores allocations, the network namespace for an allocation may exist but no longer be correctly configured. For example, if the host is rebooted and the task was a Docker task using a pause container, the network namespace may be recreated by the docker daemon.

When we restore an allocation, use the CNI "check" command to verify that any existing network namespace matches the expected configuration. This requires CNI plugins of at least version 1.2.0 to avoid a bug in older plugin versions that would cause the check to fail. So we check the plugin fingerprint before performing the check.

If the check fails, destroy the network namespace and try to recreate it from scratch once. If that fails in the second pass, fail the restore so that the allocation can be recreated (rather than silently having networking fail).

This should fix the gap left #24650 for Docker task drivers and any other drivers with the MustInitiateNetwork capability.

Fixes: #24292
Ref: #24650
Ref: https://hashicorp.atlassian.net/browse/NET-11869

Testing & Reproduction steps

Run a cluster on a set of VMs, with at least one client. This can't be a server+client because we need to reboot the hosts. You should probably set the server.heartbeat_grace = "5m" to give yourself time to work.

Run a Docker task with network.mode = "bridge". Wait for it to be healthy.
Restart the agent.
Make sure the alloc is untouched and networking works.
Reboot the client host.
Make sure the alloc is restored, the tasks are restarted, and networking works.

Contributor Checklist

Changelog Entry If this PR changes user-facing behavior, please generate and add a
changelog entry using the make cl command.
Testing Please add tests to cover any new functionality or to demonstrate bug fixes and
ensure regressions will be caught.
Documentation If the change impacts user-facing functionality such as the CLI, API, UI,
and job configuration, please update the Nomad website documentation to reflect this. Refer to
the website README for docs guidelines. Please also consider whether the
change requires notes within the upgrade guide.

Reviewer Checklist

Backport Labels Please add the correct backport labels as described by the internal
backporting document.
Commit Type Ensure the correct merge method is selected which should be "squash and merge"
in the majority of situations. The main exceptions are long-lived feature branches or merges where
history should be preserved.
Enterprise PRs If this is an enterprise only PR, please add any required changelog entry
within the public repository.

client/allocrunner/networking_cni.go

client/allocrunner/network_hook.go

apollo13 · 2024-12-12T21:22:31Z

Your reproduction steps say:

Run a Docker/Podman task with network.mode = "bridge". Wait for it to be healthy.

is the podman driver actually affected by it? I thought that one would not use a pause container and always let nomad configure the network namespaces on it's own.

tgross · 2024-12-12T21:31:15Z

is the podman driver actually affected by it? I thought that one would not use a pause container and always let nomad configure the network namespaces on it's own.

You know what, I assumed that it used pause containers too (I hardly ever have run it myself) but that's not the case: https://github.com/hashicorp/nomad-driver-podman/blob/main/driver.go#L87. So yeah, just Docker here. Will edit the repro steps.

apollo13 · 2024-12-12T21:39:00Z

You know what, I assumed that it used pause containers too

Luckily no, podman run supports --network=ns:/path/for/ns/to/join in addition to the half-baked --network=container:id that docker has (and obviously via the API as well).

jrasell

I tested this using an AWS environment with 1 server and 1 client instance. The client instance has CNI v1.3.0 plugins installed.

JobSpec:

job "example" {

  group "nginx" {
    network {
      mode = "bridge"
      port "http" {
        to = 80
      }
    }

    task "nginx" {
      driver = "docker"

      config {
        image = "nginx:latest"
        ports = ["http"]
      }

      resources {
        cpu    = 500
        memory = 256
      }
    }
  }
}

Running through the steps using a build from main f4529485563924462dbdccdd1b4cacbd9d68616e when I rebooted the instance the allocation was restarted, but the networking was broken and I was unable to perform a simple curl to NGINX.

I then tested this patch 9daf267e33bd502d4a386dc1dd6d4a728894c194 I performed the same steps as detailed in the PR and performed in the previous test. When the instance was rebooted, the allocation failed and a new allocation scheduled in its place.

Failed allocation events:

2024-12-13T08:59:18Z  Terminated     Exit Code: 0
2024-12-13T08:59:18Z  Setup Failure  failed to setup alloc: pre-run hook "network" failed: failed to configure networking for alloc: Interface name nomad not found
2024-12-13T08:57:08Z  Started        Task started by client
2024-12-13T08:57:08Z  Driver         Downloading image
2024-12-13T08:57:07Z  Task Setup     Building Task Directory
2024-12-13T08:57:07Z  Received       Task received by client

Client logs around time of failure marking:

2024-12-13T08:59:18.385Z [ERROR] allocrunner/alloc_runner.go:403: client.alloc_runner: prerun failed: alloc_id=8f3b3d5a-906a-7f53-8d72-e9c782bb4dee error="pre-run hook \"network\" failed: failed to configure networking for alloc: Interface name nomad not found"
2024-12-13T08:59:18.385Z [INFO]  taskrunner/task_runner.go:1468: client.alloc_runner.task_runner: Task event: alloc_id=8f3b3d5a-906a-7f53-8d72-e9c782bb4dee task=nginx type="Setup Failure" msg="failed to setup alloc: pre-run hook \"network\" failed: failed to configure networking for alloc: Interface name nomad not found" failed=true
2024-12-13T08:59:18.387Z [INFO]  docker/handle.go:229: client.driver_mgr.docker: stopped container: container_id=9d6cc3a56a51596885f36fec7c4d74ea1002e8727c1fc213851ac42fed53e8cb driver=docker

tgross · 2024-12-13T18:42:03Z

I've updated this PR with a retry where if the check fails the first time, we attempt to recreate the network namespace (attempting this once). This should reduce the cases where, after a reboot, the allocation cannot be restored and gets failed. And I've added tests covering that logic.

jrasell

LGTM! I left two very minor and ignorable comments.

I did the same test outlined in my previous review and when the client instance rebooted and the Nomad process started, the allocation was restarted and the task process API still available.

Task events:

Recent Events:
Time                  Type        Description
2025-01-07T09:12:56Z  Started     Task started by client
2025-01-07T09:12:55Z  Driver      Downloading image
2025-01-07T09:12:40Z  Restarting  Task restarting in 15.75160817s
2025-01-07T09:12:40Z  Terminated  Exit Code: 0
2025-01-07T09:11:12Z  Started     Task started by client
2025-01-07T09:11:07Z  Driver      Downloading image
2025-01-07T09:11:07Z  Task Setup  Building Task Directory
2025-01-07T09:11:06Z  Received    Task received by client

Networking confirmation:

$ curl 10.0.1.193:29665
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
<style>
html { color-scheme: light dark; }
body { width: 35em; margin: 0 auto;
font-family: Tahoma, Verdana, Arial, sans-serif; }
</style>
</head>
<body>
<h1>Welcome to nginx!</h1>
<p>If you see this page, the nginx web server is successfully installed and
working. Further configuration is required.</p>

<p>For online documentation and support please refer to
<a href="http://nginx.org/">nginx.org</a>.<br/>
Commercial support is available at
<a href="http://nginx.com/">nginx.com</a>.</p>

<p><em>Thank you for using nginx.</em></p>
</body>
</html>

client/allocrunner/networking_cni.go

When the Nomad client restarts and restores allocations, the network namespace for an allocation may exist but no longer be correctly configured. For example, if the host is rebooted and the task was a Docker task using a pause container, the network namespace may be recreated by the docker daemon. When we restore an allocation, use the CNI "check" command to verify that any existing network namespace matches the expected configuration. This requires CNI plugins of at least version 1.2.0 to avoid a bug in older plugin versions that would cause the check to fail. If the check fails, destroy the network namespace and try to recreate it from scratch once. If that fails in the second pass, fail the restore so that the allocation can be recreated (rather than silently having networking fail). This should fix the gap left #24650 for Docker task drivers and any other drivers with the `MustInitiateNetwork` capability. Fixes: #24292 Ref: #24650

ahjohannessen · 2025-01-09T09:20:02Z

@tgross Any ETA on Nomad 1.9.5 - bridge networking seems to be unstable when nodes restart / part of jobs are updated. IPTables does not seem to be updated, because I experience issues with containers not being able to communicate to IPs on the subnet that the host is on. When stopping a job and starting again, network communication is working again.

Seeing stuff in the logs like:

Jan 07 14:59:31 f04 kernel: veth50336319 (unregistering): left promiscuous mode
Jan 07 14:59:31 f04 audit: ANOM_PROMISCUOUS dev=veth50336319 prom=0 old_prom=256 auid=4294967295 uid=0 gid=0 ses=4294967295
Jan 07 14:59:35 f04 nomad[1415]:  client.alloc_runner.task_runner: Task event: alloc_id=f98147e9-0f1f-79dc-223c-2336ebe30a48 task=connect-proxy-count-api type=Received msg="Task received by client" failed=false
Jan 07 14:59:35 f04 nomad[1415]:     2025-01-07T14:59:35.113Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=f98147e9-0f1f-79dc-223c-2336ebe30a48 task=connect-proxy-count-api type=Received msg="Task received by client" failed=false
Jan 07 14:59:35 f04 audit: BPF prog-id=117 op=LOAD
Jan 07 14:59:35 f04 audit: BPF prog-id=118 op=LOAD
Jan 07 14:59:35 f04 audit: BPF prog-id=118 op=UNLOAD
Jan 07 14:59:35 f04 audit: BPF prog-id=119 op=LOAD
Jan 07 14:59:35 f04 audit: BPF prog-id=119 op=UNLOAD
Jan 07 14:59:35 f04 audit: BPF prog-id=120 op=LOAD
Jan 07 14:59:35 f04 audit: ANOM_PROMISCUOUS dev=vethf2889ee2 prom=256 old_prom=0 auid=4294967295 uid=0 gid=0 ses=4294967295
Jan 07 14:59:35 f04 kernel: vethf2889ee2: entered promiscuous mode

Indicates that something is wrong / not in sync with the cni network setup.

This is on Fedora CoreOS 41.20241215.3.0

jrasell · 2025-01-09T09:23:45Z

Hi @ahjohannessen; v1.9.5 is expected to be released within the next 1-2 weeks.

ahjohannessen · 2025-01-09T09:26:01Z

@jrasell Ok, thanks for heads up. Is there some CI build with this change that one can try out?

ahjohannessen · 2025-01-09T09:29:15Z

@jrasell Think I found it, https://github.com/hashicorp/nomad/actions/runs/12679755285

tgross commented Dec 12, 2024

View reviewed changes

client/allocrunner/networking_cni.go Outdated Show resolved Hide resolved

tgross changed the title ~~cni: use check command when restoring from restart~~ CNI: use check command when restoring from restart Dec 12, 2024

tgross added theme/networking type/bug backport/ent/1.7.x+ent Changes are backported to 1.7.x+ent backport/ent/1.8.x+ent Changes are backported to 1.8.x+ent backport/1.9.x backport to 1.9.x release line labels Dec 12, 2024

tgross added this to the 1.9.x milestone Dec 12, 2024

tgross force-pushed the b-24292-netns-restore branch from 98343be to 66dcfcb Compare December 12, 2024 20:52

tgross requested review from jrasell and shoenig December 12, 2024 20:52

vercel bot deployed to Preview – nomad-ui December 12, 2024 20:53 View deployment

tgross mentioned this pull request Dec 12, 2024

CNI: Docker container lose network configuration after host reboot #24292

Closed

tgross force-pushed the b-24292-netns-restore branch from 66dcfcb to 9daf267 Compare December 12, 2024 21:12

vercel bot deployed to Preview – nomad-ui December 12, 2024 21:14 View deployment

tgross commented Dec 12, 2024

View reviewed changes

client/allocrunner/network_hook.go Outdated Show resolved Hide resolved

jrasell reviewed Dec 13, 2024

View reviewed changes

vercel bot deployed to Preview – nomad-ui December 13, 2024 17:32 View deployment

tgross force-pushed the b-24292-netns-restore branch 2 times, most recently from 10760d7 to 04a444a Compare December 13, 2024 18:40

vercel bot deployed to Preview – nomad-ui December 13, 2024 18:41 View deployment

tgross marked this pull request as ready for review December 13, 2024 18:42

tgross requested review from a team as code owners December 13, 2024 18:42

tgross force-pushed the b-24292-netns-restore branch from 04a444a to 137e7d6 Compare December 13, 2024 18:53

vercel bot deployed to Preview – nomad-ui December 13, 2024 18:55 View deployment

tgross mentioned this pull request Dec 13, 2024

Weird routing behavior on mode="bridge" when trying to reach alloc on same host. #24648

Closed

tgross requested review from jrasell, schmichael and Juanadelacuesta December 16, 2024 14:59

tgross force-pushed the b-24292-netns-restore branch from 137e7d6 to a26e6c4 Compare December 18, 2024 15:24

tgross requested review from a team as code owners December 18, 2024 15:24

tgross requested a review from pbortnick December 18, 2024 15:24

vercel bot deployed to Preview – nomad-ui December 18, 2024 15:26 View deployment

tgross requested a review from aimeeu December 18, 2024 15:27

vercel bot deployed to Preview – nomad December 18, 2024 15:31 View deployment

aimeeu previously approved these changes Dec 18, 2024

View reviewed changes

pbortnick removed their request for review January 6, 2025 13:32

jrasell previously approved these changes Jan 7, 2025

View reviewed changes

client/allocrunner/networking_cni.go Outdated Show resolved Hide resolved

client/allocrunner/networking_cni.go Outdated Show resolved Hide resolved

tgross dismissed stale reviews from jrasell and aimeeu via fd20ccc January 7, 2025 14:00

tgross force-pushed the b-24292-netns-restore branch from a26e6c4 to fd20ccc Compare January 7, 2025 14:00

tgross requested a review from jrasell January 7, 2025 14:00

vercel bot deployed to Preview – nomad-ui January 7, 2025 14:01 View deployment

vercel bot deployed to Preview – nomad January 7, 2025 14:07 View deployment

jrasell approved these changes Jan 7, 2025

View reviewed changes

tgross merged commit 08a6f87 into main Jan 7, 2025
32 checks passed

tgross deleted the b-24292-netns-restore branch January 7, 2025 14:38

hc-github-team-nomad-core mentioned this pull request Jan 7, 2025

Backport of CNI: use check command when restoring from restart into release/1.9.x #24796

Merged

6 tasks

tgross mentioned this pull request Feb 11, 2025

CNI: use tmpfs location for ipam plugin #24650

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CNI: use check command when restoring from restart #24658

CNI: use check command when restoring from restart #24658

tgross commented Dec 12, 2024 •

edited

Loading

apollo13 commented Dec 12, 2024

tgross commented Dec 12, 2024

apollo13 commented Dec 12, 2024 •

edited

Loading

jrasell left a comment

tgross commented Dec 13, 2024

jrasell left a comment

ahjohannessen commented Jan 9, 2025

jrasell commented Jan 9, 2025

ahjohannessen commented Jan 9, 2025 •

edited

Loading

ahjohannessen commented Jan 9, 2025

CNI: use check command when restoring from restart #24658

CNI: use check command when restoring from restart #24658

Conversation

tgross commented Dec 12, 2024 • edited Loading

Testing & Reproduction steps

Contributor Checklist

Reviewer Checklist

apollo13 commented Dec 12, 2024

tgross commented Dec 12, 2024

apollo13 commented Dec 12, 2024 • edited Loading

jrasell left a comment

Choose a reason for hiding this comment

tgross commented Dec 13, 2024

jrasell left a comment

Choose a reason for hiding this comment

ahjohannessen commented Jan 9, 2025

jrasell commented Jan 9, 2025

ahjohannessen commented Jan 9, 2025 • edited Loading

ahjohannessen commented Jan 9, 2025

tgross commented Dec 12, 2024 •

edited

Loading

apollo13 commented Dec 12, 2024 •

edited

Loading

ahjohannessen commented Jan 9, 2025 •

edited

Loading