Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v0.10.4 Unable to deploy job due to "network: no networks available" #7232

Closed
guce opened this issue Feb 27, 2020 · 13 comments · Fixed by #7509
Closed

v0.10.4 Unable to deploy job due to "network: no networks available" #7232

guce opened this issue Feb 27, 2020 · 13 comments · Fixed by #7509
Assignees

Comments

@guce
Copy link

guce commented Feb 27, 2020

Nomad version

Nomad v0.10.4 (f750636)

Operating system and Environment details

ubuntu 18.04 virtualbox

Issue

unable to deploy jobs after upgrading v0.10.4
v0.10.3 does not have this problem

ID            = ac98899c-ea2a-033d-0e42-aae6a11c27ec
Name          = client1
Class         = <none>
DC            = dc1
Drain         = false
Eligibility   = eligible
Status        = ready
Uptime        = 3h11m40s
Host Volumes  = <none>
Driver Status = docker,exec

Node Events
Time                  Subsystem  Message
2020-02-27T03:51:58Z  Cluster    Node registered

Allocated Resources
CPU         Memory       Disk
0/2494 MHz  0 B/1.9 GiB  0 B/14 GiB

Allocation Resource Utilization
CPU         Memory
0/2494 MHz  0 B/1.9 GiB

Host Resource Utilization
CPU         Memory           Disk
0/2494 MHz  358 MiB/1.9 GiB  4.6 GiB/20 GiB

Allocations
No allocations placed
ID            = example
Name          = example
Submit Date   = 2020-02-27T03:53:39Z
Type          = service
Priority      = 50
Datacenters   = dc1
Namespace     = default
Status        = pending
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
cache       1       0         0        0       0         0

Placement Failure
Task Group "cache":
  * Resources exhausted on 1 nodes
  * Dimension "network: no networks available" exhausted on 1 nodes

Latest Deployment
ID          = be8b274d
Status      = running
Description = Deployment is running

Deployed
Task Group  Desired  Placed  Healthy  Unhealthy  Progress Deadline
cache       1        0       0        0          N/A

Allocations
No allocations placed

Reproduction steps

start server & client
run job
server.hcl
client.hcl

Job file

https://gist.github.com/guce/1e79f61fd895722c86e968f7acc069a5

Nomad Client logs

    2020-02-27T03:51:58.504Z [DEBUG] client.fingerprint_mgr.network: link speed detected: interface=enp0s3 mbits=1000
    2020-02-27T03:51:58.504Z [DEBUG] client.fingerprint_mgr.network: detected interface IP: interface=enp0s3 IP=10.0.0.32

https://gist.github.com/guce/e528b16b6efe4446287cb74af85e6b3c

Nomad Server logs

https://gist.github.com/guce/ceacfaaebd65cc616d8ed249edc16e94

@leptonyu
Copy link

try network_interface in client to specify network interface.

@guce
Copy link
Author

guce commented Mar 2, 2020

try network_interface in client to specify network interface.

Thanks a lot
I added network_interface configuration and it works

My curious is

  1. v0.10.3 does not require this config
  2. v0.10.4 can actually detect the network card and bandwidth without this config, but it is not possible to deploy a job

@joshuaclausen
Copy link

I get the same problem with v0.10.4, but it's not there with v0.10.3.

Windows Server 2016
Vultr-hosted VM

@notnoop notnoop self-assigned this Mar 21, 2020
@notnoop
Copy link
Contributor

notnoop commented Mar 21, 2020

Thank you for the report. We are digging into this now, but I'm afraid, I'm unable to reproduce. Mind if you provide nomad node status --json output for the node for both versions 0.10.3 and 0.10.4?

@joshuaclausen
Copy link

I got this on Windows VMs in Vultr, in case that's useful. I'll try to get a "nomad node status --json" on a host there that's demonstrating the issue. I'll have to spin up a new machine and deploy to it first though.

@rf-guo
Copy link

rf-guo commented Mar 25, 2020

I get the same problem with v0.10.4/v0.10.5

[root@nomad-cluster1 grf]# ./nomad job run httpecho.nomad 
==> Monitoring evaluation "ac2dd6af"
    Evaluation triggered by job "httpecho"
    Evaluation within deployment: "802670ac"
    Evaluation status changed: "pending" -> "complete"
==> Evaluation "ac2dd6af" finished with status "complete" but failed to place all allocations:
    Task Group "web" (failed to place 1 allocation):
      * Resources exhausted on 3 nodes
      * Dimension "network: no networks available" exhausted on 3 nodes
    Evaluation "9f5a8901" waiting for additional capacity to place remainder

And i added network_interface configuration and it not works

client {
    enabled = true
    servers = ["10.0.0.16:4647", "10.0.0.23:4647", "10.0.0.24:4647"]
    network_interface = "eth0"
    network_speed = 1000
}

@notnoop
Copy link
Contributor

notnoop commented Mar 25, 2020

@ultra-baba Thanks. Mind if you provide nomad node status --json <node_id> info before and after upgrade if you can?

@mediarl
Copy link

mediarl commented Mar 26, 2020

@notnoop Same problem for me, working fine with Nomad v0.10.3 but got network error with v0.10.4 & v0.10.5.
I use VM hosted on Hetzner Cloud.

Here is nomad node status --json <node_id> for v0.10.3

{
    "Attributes": {
        "driver.docker.version": "18.09.1",
        "cpu.numcores": "2",
        "os.name": "centos",
        "os.signals": "SIGTRAP,SIGUSR1,SIGHUP,SIGKILL,SIGPROF,SIGSYS,SIGTTIN,SIGXCPU,SIGIO,SIGPIPE,SIGSTOP,SIGWINCH,SIGQUIT,SIGTERM,SIGILL,SIGINT,SIGIOT,SIGURG,SIGUSR2,SIGXFSZ,SIGBUS,SIGCONT,SIGTSTP,SIGFPE,SIGSEGV,SIGTTOU,SIGABRT,SIGALRM,SIGCHLD",
        "driver.docker": "1",
        "nomad.revision": "65af1b9ecff5b55a1dd6e10b8c3224f896d6c9fa",
        "cpu.totalcompute": "4198",
        "unique.network.ip-address": "95.217.132.254",
        "kernel.version": "4.18.0-147.5.1.el8_1.x86_64",
        "unique.storage.volume": "/dev/sda1",
        "driver.docker.os_type": "linux",
        "driver.docker.runtimes": "runc",
        "unique.storage.bytestotal": "80509460480",
        "cpu.arch": "amd64",
        "driver.docker.privileged.enabled": "true",
        "cpu.modelname": "Intel Xeon Processor (Skylake, IBRS)",
        "unique.cgroup.mountpoint": "/sys/fs/cgroup",
        "cpu.frequency": "2099",
        "memory.totalbytes": "8160858112",
        "kernel.name": "linux",
        "unique.hostname": "client1",
        "driver.docker.volumes.enabled": "true",
        "os.version": "8.1.1911",
        "driver.exec": "1",
        "nomad.version": "0.10.3",
        "nomad.advertise.address": "172.17.0.1:4646",
        "unique.storage.bytesfree": "75511934976"
    },
    "CreateIndex": 17,
    "Datacenter": "dc1",
    "Drain": false,
    "DrainStrategy": null,
    "Drivers": {
        "rkt": {
            "Attributes": null,
            "Detected": false,
            "HealthDescription": "Failed to execute rkt version: exec: \"rkt\": executable file not found in $PATH",
            "Healthy": false,
            "UpdateTime": "2020-03-26T12:16:06.188087266+01:00"
        },
        "exec": {
            "Attributes": {
                "driver.exec": "true"
            },
            "Detected": true,
            "HealthDescription": "Healthy",
            "Healthy": true,
            "UpdateTime": "2020-03-26T12:16:06.188366663+01:00"
        },
        "java": {
            "Attributes": null,
            "Detected": false,
            "HealthDescription": "",
            "Healthy": false,
            "UpdateTime": "2020-03-26T12:16:06.188487223+01:00"
        },
        "qemu": {
            "Attributes": null,
            "Detected": false,
            "HealthDescription": "",
            "Healthy": false,
            "UpdateTime": "2020-03-26T12:16:06.188529483+01:00"
        },
        "raw_exec": {
            "Attributes": null,
            "Detected": false,
            "HealthDescription": "disabled",
            "Healthy": false,
            "UpdateTime": "2020-03-26T12:16:06.189299655+01:00"
        },
        "docker": {
            "Attributes": {
                "driver.docker.runtimes": "runc",
                "driver.docker.os_type": "linux",
                "driver.docker": "true",
                "driver.docker.version": "18.09.1",
                "driver.docker.privileged.enabled": "true",
                "driver.docker.volumes.enabled": "true"
            },
            "Detected": true,
            "HealthDescription": "Healthy",
            "Healthy": true,
            "UpdateTime": "2020-03-26T12:16:06.22697967+01:00"
        }
    },
    "Events": [
        {
            "CreateIndex": 0,
            "Details": null,
            "Message": "Node registered",
            "Subsystem": "Cluster",
            "Timestamp": "2020-03-26T11:57:03+01:00"
        },
        {
            "CreateIndex": 22,
            "Details": null,
            "Message": "Node heartbeat missed",
            "Subsystem": "Cluster",
            "Timestamp": "2020-03-26T11:58:38.587549994+01:00"
        },
        {
            "CreateIndex": 23,
            "Details": null,
            "Message": "Node re-registered",
            "Subsystem": "Cluster",
            "Timestamp": "2020-03-26T11:59:04+01:00"
        },
        {
            "CreateIndex": 41,
            "Details": null,
            "Message": "Node heartbeat missed",
            "Subsystem": "Cluster",
            "Timestamp": "2020-03-26T12:11:06.443818944+01:00"
        },
        {
            "CreateIndex": 44,
            "Details": null,
            "Message": "Node re-registered",
            "Subsystem": "Cluster",
            "Timestamp": "2020-03-26T12:12:35+01:00"
        },
        {
            "CreateIndex": 46,
            "Details": null,
            "Message": "Node heartbeat missed",
            "Subsystem": "Cluster",
            "Timestamp": "2020-03-26T12:13:11.693922489+01:00"
        },
        {
            "CreateIndex": 47,
            "Details": null,
            "Message": "Node re-registered",
            "Subsystem": "Cluster",
            "Timestamp": "2020-03-26T12:13:22+01:00"
        }
    ],
    "HTTPAddr": "172.17.0.1:4646",
    "HostVolumes": null,
    "ID": "71586c26-fdb9-6a35-38c0-4269092888f3",
    "Links": null,
    "Meta": {
        "connect.log_level": "info",
        "connect.sidecar_image": "envoyproxy/envoy:v1.11.2@sha256:a7769160c9c1a55bb8d07a3b71ce5d64f72b1f665f10d81aa1581bc3cf850d09"
    },
    "ModifyIndex": 54,
    "Name": "client1",
    "NodeClass": "",
    "NodeResources": {
        "Cpu": {
            "CpuShares": 4198
        },
        "Devices": null,
        "Disk": {
            "DiskMB": 72013
        },
        "Memory": {
            "MemoryMB": 7782
        },
        "Networks": [
            {
                "CIDR": "95.217.132.254/32",
                "Device": "eth0",
                "DynamicPorts": null,
                "IP": "95.217.132.254",
                "MBits": 1000,
                "Mode": "",
                "ReservedPorts": null
            },
            {
                "CIDR": "2a01:4f9:c010:6cc7::1/128",
                "Device": "eth0",
                "DynamicPorts": null,
                "IP": "2a01:4f9:c010:6cc7::1",
                "MBits": 1000,
                "Mode": "",
                "ReservedPorts": null
            }
        ]
    },
    "Reserved": {
        "CPU": 0,
        "Devices": null,
        "DiskMB": 0,
        "IOPS": 0,
        "MemoryMB": 0,
        "Networks": null
    },
    "ReservedResources": {
        "Cpu": {
            "CpuShares": 0
        },
        "Disk": {
            "DiskMB": 0
        },
        "Memory": {
            "MemoryMB": 0
        },
        "Networks": {
            "ReservedHostPorts": ""
        }
    },
    "Resources": {
        "CPU": 4198,
        "Devices": null,
        "DiskMB": 72013,
        "IOPS": 0,
        "MemoryMB": 7782,
        "Networks": [
            {
                "CIDR": "95.217.132.254/32",
                "Device": "eth0",
                "DynamicPorts": null,
                "IP": "95.217.132.254",
                "MBits": 1000,
                "Mode": "",
                "ReservedPorts": null
            },
            {
                "CIDR": "2a01:4f9:c010:6cc7::1/128",
                "Device": "eth0",
                "DynamicPorts": null,
                "IP": "2a01:4f9:c010:6cc7::1",
                "MBits": 1000,
                "Mode": "",
                "ReservedPorts": null
            }
        ]
    },
    "SchedulingEligibility": "eligible",
    "Status": "ready",
    "StatusDescription": "",
    "StatusUpdatedAt": 1585221372,
    "TLSEnabled": false
}

Here is nomad node status --json <node_id> for v0.10.5

{
    "Attributes": {
        "cpu.frequency": "2099",
        "driver.docker.version": "18.09.1",
        "kernel.version": "4.18.0-147.5.1.el8_1.x86_64",
        "os.signals": "SIGFPE,SIGINT,SIGIO,SIGIOT,SIGURG,SIGWINCH,SIGKILL,SIGPROF,SIGTTIN,SIGUSR1,SIGXCPU,SIGPIPE,SIGSTOP,SIGTERM,SIGABRT,SIGALRM,SIGBUS,SIGSEGV,SIGTSTP,SIGCHLD,SIGHUP,SIGXFSZ,SIGILL,SIGQUIT,SIGSYS,SIGTRAP,SIGUSR2,SIGCONT,SIGTTOU",
        "driver.docker.volumes.enabled": "true",
        "unique.cgroup.mountpoint": "/sys/fs/cgroup",
        "memory.totalbytes": "8160858112",
        "nomad.revision": "4eb2ca3f8d8786c897bb47878f1c12577011ddd3",
        "nomad.advertise.address": "172.17.0.1:4646",
        "driver.docker": "1",
        "kernel.name": "linux",
        "unique.platform.aws.instance-id": "5120622",
        "nomad.version": "0.10.5",
        "unique.hostname": "client1",
        "unique.platform.aws.public-ipv4": "95.217.132.254",
        "unique.platform.aws.hostname": "client1",
        "driver.docker.privileged.enabled": "true",
        "os.version": "8.1.1911",
        "driver.docker.runtimes": "runc",
        "unique.network.ip-address": "95.217.132.254",
        "unique.storage.bytesfree": "75469914112",
        "os.name": "centos",
        "cpu.arch": "amd64",
        "cpu.numcores": "2",
        "driver.docker.os_type": "linux",
        "cpu.modelname": "Intel Xeon Processor (Skylake, IBRS)",
        "unique.storage.volume": "/dev/sda1",
        "unique.storage.bytestotal": "80509460480",
        "cpu.totalcompute": "4198",
        "driver.exec": "1"
    },
    "CreateIndex": 17,
    "Datacenter": "dc1",
    "Drain": false,
    "DrainStrategy": null,
    "Drivers": {
        "qemu": {
            "Attributes": null,
            "Detected": false,
            "HealthDescription": "",
            "Healthy": false,
            "UpdateTime": "2020-03-26T12:17:57.21459538+01:00"
        },
        "java": {
            "Attributes": null,
            "Detected": false,
            "HealthDescription": "",
            "Healthy": false,
            "UpdateTime": "2020-03-26T12:17:57.214848632+01:00"
        },
        "exec": {
            "Attributes": {
                "driver.exec": "true"
            },
            "Detected": true,
            "HealthDescription": "Healthy",
            "Healthy": true,
            "UpdateTime": "2020-03-26T12:17:57.215006072+01:00"
        },
        "docker": {
            "Attributes": {
                "driver.docker.privileged.enabled": "true",
                "driver.docker.volumes.enabled": "true",
                "driver.docker.runtimes": "runc",
                "driver.docker.os_type": "linux",
                "driver.docker": "true",
                "driver.docker.version": "18.09.1"
            },
            "Detected": true,
            "HealthDescription": "Healthy",
            "Healthy": true,
            "UpdateTime": "2020-03-26T12:17:57.244374627+01:00"
        },
        "rkt": {
            "Attributes": null,
            "Detected": false,
            "HealthDescription": "Failed to execute rkt version: exec: \"rkt\": executable file not found in $PATH",
            "Healthy": false,
            "UpdateTime": "2020-03-26T12:17:57.214154456+01:00"
        },
        "raw_exec": {
            "Attributes": null,
            "Detected": false,
            "HealthDescription": "disabled",
            "Healthy": false,
            "UpdateTime": "2020-03-26T12:17:57.214486592+01:00"
        }
    },
    "Events": [
        {
            "CreateIndex": 0,
            "Details": null,
            "Message": "Node registered",
            "Subsystem": "Cluster",
            "Timestamp": "2020-03-26T11:57:03+01:00"
        },
        {
            "CreateIndex": 22,
            "Details": null,
            "Message": "Node heartbeat missed",
            "Subsystem": "Cluster",
            "Timestamp": "2020-03-26T11:58:38.587549994+01:00"
        },
        {
            "CreateIndex": 23,
            "Details": null,
            "Message": "Node re-registered",
            "Subsystem": "Cluster",
            "Timestamp": "2020-03-26T11:59:04+01:00"
        },
        {
            "CreateIndex": 41,
            "Details": null,
            "Message": "Node heartbeat missed",
            "Subsystem": "Cluster",
            "Timestamp": "2020-03-26T12:11:06.443818944+01:00"
        },
        {
            "CreateIndex": 44,
            "Details": null,
            "Message": "Node re-registered",
            "Subsystem": "Cluster",
            "Timestamp": "2020-03-26T12:12:35+01:00"
        },
        {
            "CreateIndex": 46,
            "Details": null,
            "Message": "Node heartbeat missed",
            "Subsystem": "Cluster",
            "Timestamp": "2020-03-26T12:13:11.693922489+01:00"
        },
        {
            "CreateIndex": 47,
            "Details": null,
            "Message": "Node re-registered",
            "Subsystem": "Cluster",
            "Timestamp": "2020-03-26T12:13:22+01:00"
        },
        {
            "CreateIndex": 57,
            "Details": null,
            "Message": "Node heartbeat missed",
            "Subsystem": "Cluster",
            "Timestamp": "2020-03-26T12:17:56.610701455+01:00"
        },
        {
            "CreateIndex": 58,
            "Details": null,
            "Message": "Node re-registered",
            "Subsystem": "Cluster",
            "Timestamp": "2020-03-26T12:17:57+01:00"
        }
    ],
    "HTTPAddr": "172.17.0.1:4646",
    "HostVolumes": null,
    "ID": "71586c26-fdb9-6a35-38c0-4269092888f3",
    "Links": {
        "aws.ec2": ".5120622"
    },
    "Meta": {
        "connect.sidecar_image": "envoyproxy/envoy:v1.11.2@sha256:a7769160c9c1a55bb8d07a3b71ce5d64f72b1f665f10d81aa1581bc3cf850d09",
        "connect.log_level": "info"
    },
    "ModifyIndex": 59,
    "Name": "client1",
    "NodeClass": "",
    "NodeResources": {
        "Cpu": {
            "CpuShares": 4198
        },
        "Devices": null,
        "Disk": {
            "DiskMB": 71973
        },
        "Memory": {
            "MemoryMB": 7782
        },
        "Networks": [
            {
                "CIDR": "",
                "Device": "eth0",
                "DynamicPorts": null,
                "IP": "",
                "MBits": 1000,
                "Mode": "",
                "ReservedPorts": null
            }
        ]
    },
    "Reserved": {
        "CPU": 0,
        "Devices": null,
        "DiskMB": 0,
        "IOPS": 0,
        "MemoryMB": 0,
        "Networks": null
    },
    "ReservedResources": {
        "Cpu": {
            "CpuShares": 0
        },
        "Disk": {
            "DiskMB": 0
        },
        "Memory": {
            "MemoryMB": 0
        },
        "Networks": {
            "ReservedHostPorts": ""
        }
    },
    "Resources": {
        "CPU": 4198,
        "Devices": null,
        "DiskMB": 71973,
        "IOPS": 0,
        "MemoryMB": 7782,
        "Networks": [
            {
                "CIDR": "95.217.132.254/32",
                "Device": "eth0",
                "DynamicPorts": null,
                "IP": "95.217.132.254",
                "MBits": 1000,
                "Mode": "",
                "ReservedPorts": null
            },
            {
                "CIDR": "2a01:4f9:c010:6cc7::1/128",
                "Device": "eth0",
                "DynamicPorts": null,
                "IP": "2a01:4f9:c010:6cc7::1",
                "MBits": 1000,
                "Mode": "",
                "ReservedPorts": null
            }
        ]
    },
    "SchedulingEligibility": "eligible",
    "Status": "ready",
    "StatusDescription": "",
    "StatusUpdatedAt": 1585221484,
    "TLSEnabled": false
}

@notnoop
Copy link
Contributor

notnoop commented Mar 26, 2020

@mediarl Thanks for output - that was very helpful! The immediate cause seems to be bad entry in the node status, we'll investigate how it got into that state.

Note how .Resources.Networks differs from .NodeResources.Networks:

$ jq '.Resources.Networks' /tmp/nomad-node-status-0.10.5
[
  {
    "CIDR": "95.217.132.254/32",
    "Device": "eth0",
    "DynamicPorts": null,
    "IP": "95.217.132.254",
    "MBits": 1000,
    "Mode": "",
    "ReservedPorts": null
  },
  {
    "CIDR": "2a01:4f9:c010:6cc7::1/128",
    "Device": "eth0",
    "DynamicPorts": null,
    "IP": "2a01:4f9:c010:6cc7::1",
    "MBits": 1000,
    "Mode": "",
    "ReservedPorts": null
  }
]
$ jq '.NodeResources.Networks' /tmp/nomad-node-status-0.10.5
[
  {
    "CIDR": "",
    "Device": "eth0",
    "DynamicPorts": null,
    "IP": "",
    "MBits": 1000,
    "Mode": "",
    "ReservedPorts": null
  }
]

@mediarl
Copy link

mediarl commented Mar 26, 2020

@notnoop You welcome!
If you want to reproduce that error for sure, just provision nodes on https://hetzner.com/cloud. I tried several OS and they all failed.
The only workaround for now is to use v0.10.3

@notnoop
Copy link
Contributor

notnoop commented Mar 26, 2020

@mediarl Thank you very good information. I found the cause! A potential workaround is to set AWS_ENV_URL environment variable to a bad value (e.g. AWS_ENV_URL=http://0.0.0.0/).

#6779 caused the regression. Looks like hetzner attempts to mimic AWS API, but its implementation is incomplete, so ended up with empty values for local ip addresses and CIDR info!

# curl http://169.254.169.254/latest/meta-data/instance-id; echo
5121933
# curl http://169.254.169.254/latest/meta-data/ami-id; echo
ami-id not found
# curl http://169.254.169.254/latest/meta-data/public-ipv4; echo
95.217.165.252
# curl http://169.254.169.254/latest/meta-data/local-ipv4; echo

#

For comparison, on EC2, the values are:

$ curl http://169.254.169.254/latest/meta-data/instance-id; echo
i-00e6d1c6998cb30ab
$ curl http://169.254.169.254/latest/meta-data/ami-id; echo
ami-0b8ec8fe3e479e979
$ curl http://169.254.169.254/latest/meta-data/public-ipv4; echo
18.206.88.121
$ curl http://169.254.169.254/latest/meta-data/local-ipv4; echo
172.31.83.172

The regression was due to us accidentally moving to using instance-id as the EC2 check instead of ami-id.

@joshuaclausen
Copy link

joshuaclausen commented Mar 26, 2020 via email

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 11, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants