Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CNI: use check command when restoring from restart #24658

Merged
merged 1 commit into from
Jan 7, 2025

Conversation

tgross
Copy link
Member

@tgross tgross commented Dec 12, 2024

When the Nomad client restarts and restores allocations, the network namespace for an allocation may exist but no longer be correctly configured. For example, if the host is rebooted and the task was a Docker task using a pause container, the network namespace may be recreated by the docker daemon.

When we restore an allocation, use the CNI "check" command to verify that any existing network namespace matches the expected configuration. This requires CNI plugins of at least version 1.2.0 to avoid a bug in older plugin versions that would cause the check to fail. So we check the plugin fingerprint before performing the check.

If the check fails, destroy the network namespace and try to recreate it from scratch once. If that fails in the second pass, fail the restore so that the allocation can be recreated (rather than silently having networking fail).

This should fix the gap left #24650 for Docker task drivers and any other drivers with the MustInitiateNetwork capability.

Fixes: #24292
Ref: #24650
Ref: https://hashicorp.atlassian.net/browse/NET-11869

Testing & Reproduction steps

Run a cluster on a set of VMs, with at least one client. This can't be a server+client because we need to reboot the hosts. You should probably set the server.heartbeat_grace = "5m" to give yourself time to work.

  • Run a Docker task with network.mode = "bridge". Wait for it to be healthy.
  • Restart the agent.
  • Make sure the alloc is untouched and networking works.
  • Reboot the client host.
  • Make sure the alloc is restored, the tasks are restarted, and networking works.

Contributor Checklist

  • Changelog Entry If this PR changes user-facing behavior, please generate and add a
    changelog entry using the make cl command.
  • Testing Please add tests to cover any new functionality or to demonstrate bug fixes and
    ensure regressions will be caught.
  • Documentation If the change impacts user-facing functionality such as the CLI, API, UI,
    and job configuration, please update the Nomad website documentation to reflect this. Refer to
    the website README for docs guidelines. Please also consider whether the
    change requires notes within the upgrade guide.

Reviewer Checklist

  • Backport Labels Please add the correct backport labels as described by the internal
    backporting document.
  • Commit Type Ensure the correct merge method is selected which should be "squash and merge"
    in the majority of situations. The main exceptions are long-lived feature branches or merges where
    history should be preserved.
  • Enterprise PRs If this is an enterprise only PR, please add any required changelog entry
    within the public repository.

@tgross tgross changed the title cni: use check command when restoring from restart CNI: use check command when restoring from restart Dec 12, 2024
@tgross tgross added theme/networking type/bug backport/ent/1.7.x+ent Changes are backported to 1.7.x+ent backport/ent/1.8.x+ent Changes are backported to 1.8.x+ent backport/1.9.x backport to 1.9.x release line labels Dec 12, 2024
@tgross tgross added this to the 1.9.x milestone Dec 12, 2024
@tgross tgross force-pushed the b-24292-netns-restore branch from 98343be to 66dcfcb Compare December 12, 2024 20:52
@tgross tgross requested review from jrasell and shoenig December 12, 2024 20:52
@tgross tgross force-pushed the b-24292-netns-restore branch from 66dcfcb to 9daf267 Compare December 12, 2024 21:12
@apollo13
Copy link
Contributor

Your reproduction steps say:

Run a Docker/Podman task with network.mode = "bridge". Wait for it to be healthy.

is the podman driver actually affected by it? I thought that one would not use a pause container and always let nomad configure the network namespaces on it's own.

@tgross
Copy link
Member Author

tgross commented Dec 12, 2024

is the podman driver actually affected by it? I thought that one would not use a pause container and always let nomad configure the network namespaces on it's own.

You know what, I assumed that it used pause containers too (I hardly ever have run it myself) but that's not the case: https://github.com/hashicorp/nomad-driver-podman/blob/main/driver.go#L87. So yeah, just Docker here. Will edit the repro steps.

@apollo13
Copy link
Contributor

apollo13 commented Dec 12, 2024

You know what, I assumed that it used pause containers too

Luckily no, podman run supports --network=ns:/path/for/ns/to/join in addition to the half-baked --network=container:id that docker has (and obviously via the API as well).

Copy link
Member

@jrasell jrasell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this using an AWS environment with 1 server and 1 client instance. The client instance has CNI v1.3.0 plugins installed.

JobSpec:

job "example" {

  group "nginx" {
    network {
      mode = "bridge"
      port "http" {
        to = 80
      }
    }

    task "nginx" {
      driver = "docker"

      config {
        image = "nginx:latest"
        ports = ["http"]
      }

      resources {
        cpu    = 500
        memory = 256
      }
    }
  }
}

Running through the steps using a build from main f4529485563924462dbdccdd1b4cacbd9d68616e when I rebooted the instance the allocation was restarted, but the networking was broken and I was unable to perform a simple curl to NGINX.

I then tested this patch 9daf267e33bd502d4a386dc1dd6d4a728894c194 I performed the same steps as detailed in the PR and performed in the previous test. When the instance was rebooted, the allocation failed and a new allocation scheduled in its place.

Failed allocation events:

2024-12-13T08:59:18Z  Terminated     Exit Code: 0
2024-12-13T08:59:18Z  Setup Failure  failed to setup alloc: pre-run hook "network" failed: failed to configure networking for alloc: Interface name nomad not found
2024-12-13T08:57:08Z  Started        Task started by client
2024-12-13T08:57:08Z  Driver         Downloading image
2024-12-13T08:57:07Z  Task Setup     Building Task Directory
2024-12-13T08:57:07Z  Received       Task received by client

Client logs around time of failure marking:

2024-12-13T08:59:18.385Z [ERROR] allocrunner/alloc_runner.go:403: client.alloc_runner: prerun failed: alloc_id=8f3b3d5a-906a-7f53-8d72-e9c782bb4dee error="pre-run hook \"network\" failed: failed to configure networking for alloc: Interface name nomad not found"
2024-12-13T08:59:18.385Z [INFO]  taskrunner/task_runner.go:1468: client.alloc_runner.task_runner: Task event: alloc_id=8f3b3d5a-906a-7f53-8d72-e9c782bb4dee task=nginx type="Setup Failure" msg="failed to setup alloc: pre-run hook \"network\" failed: failed to configure networking for alloc: Interface name nomad not found" failed=true
2024-12-13T08:59:18.387Z [INFO]  docker/handle.go:229: client.driver_mgr.docker: stopped container: container_id=9d6cc3a56a51596885f36fec7c4d74ea1002e8727c1fc213851ac42fed53e8cb driver=docker

@tgross
Copy link
Member Author

tgross commented Dec 13, 2024

I've updated this PR with a retry where if the check fails the first time, we attempt to recreate the network namespace (attempting this once). This should reduce the cases where, after a reboot, the allocation cannot be restored and gets failed. And I've added tests covering that logic.

@tgross tgross marked this pull request as ready for review December 13, 2024 18:42
@tgross tgross requested review from a team as code owners December 13, 2024 18:42
@tgross tgross force-pushed the b-24292-netns-restore branch from 04a444a to 137e7d6 Compare December 13, 2024 18:53
aimeeu
aimeeu previously approved these changes Dec 18, 2024
@pbortnick pbortnick removed their request for review January 6, 2025 13:32
jrasell
jrasell previously approved these changes Jan 7, 2025
Copy link
Member

@jrasell jrasell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I left two very minor and ignorable comments.

I did the same test outlined in my previous review and when the client instance rebooted and the Nomad process started, the allocation was restarted and the task process API still available.

Task events:

Recent Events:
Time                  Type        Description
2025-01-07T09:12:56Z  Started     Task started by client
2025-01-07T09:12:55Z  Driver      Downloading image
2025-01-07T09:12:40Z  Restarting  Task restarting in 15.75160817s
2025-01-07T09:12:40Z  Terminated  Exit Code: 0
2025-01-07T09:11:12Z  Started     Task started by client
2025-01-07T09:11:07Z  Driver      Downloading image
2025-01-07T09:11:07Z  Task Setup  Building Task Directory
2025-01-07T09:11:06Z  Received    Task received by client

Networking confirmation:

$ curl 10.0.1.193:29665
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
<style>
html { color-scheme: light dark; }
body { width: 35em; margin: 0 auto;
font-family: Tahoma, Verdana, Arial, sans-serif; }
</style>
</head>
<body>
<h1>Welcome to nginx!</h1>
<p>If you see this page, the nginx web server is successfully installed and
working. Further configuration is required.</p>

<p>For online documentation and support please refer to
<a href="http://nginx.org/">nginx.org</a>.<br/>
Commercial support is available at
<a href="http://nginx.com/">nginx.com</a>.</p>

<p><em>Thank you for using nginx.</em></p>
</body>
</html>

When the Nomad client restarts and restores allocations, the network namespace
for an allocation may exist but no longer be correctly configured. For example,
if the host is rebooted and the task was a Docker task using a pause container,
the network namespace may be recreated by the docker daemon.

When we restore an allocation, use the CNI "check" command to verify that any
existing network namespace matches the expected configuration. This requires CNI
plugins of at least version 1.2.0 to avoid a bug in older plugin versions that
would cause the check to fail.

If the check fails, destroy the network namespace and try to recreate it from
scratch once. If that fails in the second pass, fail the restore so that the
allocation can be recreated (rather than silently having networking fail).

This should fix the gap left #24650 for Docker task drivers and any other
drivers with the `MustInitiateNetwork` capability.

Fixes: #24292
Ref: #24650
@tgross tgross merged commit 08a6f87 into main Jan 7, 2025
32 checks passed
@tgross tgross deleted the b-24292-netns-restore branch January 7, 2025 14:38
@ahjohannessen
Copy link

@tgross Any ETA on Nomad 1.9.5 - bridge networking seems to be unstable when nodes restart / part of jobs are updated. IPTables does not seem to be updated, because I experience issues with containers not being able to communicate to IPs on the subnet that the host is on. When stopping a job and starting again, network communication is working again.

Seeing stuff in the logs like:

Jan 07 14:59:31 f04 kernel: veth50336319 (unregistering): left promiscuous mode
Jan 07 14:59:31 f04 audit: ANOM_PROMISCUOUS dev=veth50336319 prom=0 old_prom=256 auid=4294967295 uid=0 gid=0 ses=4294967295
Jan 07 14:59:35 f04 nomad[1415]:  client.alloc_runner.task_runner: Task event: alloc_id=f98147e9-0f1f-79dc-223c-2336ebe30a48 task=connect-proxy-count-api type=Received msg="Task received by client" failed=false
Jan 07 14:59:35 f04 nomad[1415]:     2025-01-07T14:59:35.113Z [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=f98147e9-0f1f-79dc-223c-2336ebe30a48 task=connect-proxy-count-api type=Received msg="Task received by client" failed=false
Jan 07 14:59:35 f04 audit: BPF prog-id=117 op=LOAD
Jan 07 14:59:35 f04 audit: BPF prog-id=118 op=LOAD
Jan 07 14:59:35 f04 audit: BPF prog-id=118 op=UNLOAD
Jan 07 14:59:35 f04 audit: BPF prog-id=119 op=LOAD
Jan 07 14:59:35 f04 audit: BPF prog-id=119 op=UNLOAD
Jan 07 14:59:35 f04 audit: BPF prog-id=120 op=LOAD
Jan 07 14:59:35 f04 audit: ANOM_PROMISCUOUS dev=vethf2889ee2 prom=256 old_prom=0 auid=4294967295 uid=0 gid=0 ses=4294967295
Jan 07 14:59:35 f04 kernel: vethf2889ee2: entered promiscuous mode

Indicates that something is wrong / not in sync with the cni network setup.

This is on Fedora CoreOS 41.20241215.3.0

@jrasell
Copy link
Member

jrasell commented Jan 9, 2025

Hi @ahjohannessen; v1.9.5 is expected to be released within the next 1-2 weeks.

@ahjohannessen
Copy link

ahjohannessen commented Jan 9, 2025

@jrasell Ok, thanks for heads up. Is there some CI build with this change that one can try out?

@ahjohannessen
Copy link

@jrasell Think I found it, https://github.com/hashicorp/nomad/actions/runs/12679755285

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/ent/1.7.x+ent Changes are backported to 1.7.x+ent backport/ent/1.8.x+ent Changes are backported to 1.8.x+ent backport/1.9.x backport to 1.9.x release line theme/networking type/bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CNI: Docker container lose network configuration after host reboot
5 participants