Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with nomad job deployment (raw_exec mode, v1.0.1) #9700

Closed
ivanprostran opened this issue Dec 19, 2020 · 18 comments
Closed

Problem with nomad job deployment (raw_exec mode, v1.0.1) #9700

ivanprostran opened this issue Dec 19, 2020 · 18 comments

Comments

@ivanprostran
Copy link

ivanprostran commented Dec 19, 2020

Nomad version

nomad-sdk version 0.11.3.0
Server(agent) version: Nomad v1.0.1 (c9c68aa)

[root@blade1 ~]# nomad node-status -verbose
ID                                    DC   Name                Class   Address         Version  Drain  Eligibility  Status
5838e8b0-ebd3-5c47-a949-df3d601e0da1  dc1  blade1.lab.bulb.hr  <none>  192.168.112.31  1.0.1    false  eligible     ready
[root@blade1 ~]# nomad node-status -verbose 5838e8b0-ebd3-5c47-a949-df3d601e0da1
ID              = 5838e8b0-ebd3-5c47-a949-df3d601e0da1
Name            = blade1.lab.bulb.hr
Class           = <none>
DC              = dc1
Drain           = false
Eligibility     = eligible
Status          = ready
CSI Controllers = <none>
CSI Drivers     = <none>
Uptime          = 1516h1m31s

Drivers
Driver    Detected  Healthy  Message                             Time
docker    false     false    Failed to connect to docker daemon  2020-12-18T14:37:09+01:00
exec      false     false    Driver must run as root             2020-12-18T14:37:09+01:00
java      false     false    Driver must run as root             2020-12-18T14:37:09+01:00
qemu      false     false    <none>                              2020-12-18T14:37:09+01:00
raw_exec  true      true     Healthy                             2020-12-18T14:37:09+01:00

Node Events
Time                       Subsystem  Message          Details
2020-12-18T14:37:09+01:00  Cluster    Node registered  <none>

Allocated Resources
CPU          Memory      Disk
0/18000 MHz  0 B/53 GiB  0 B/70 GiB

Allocation Resource Utilization
CPU          Memory
0/18000 MHz  0 B/53 GiB

Host Resource Utilization
CPU            Memory         Disk
499/20000 MHz  33 GiB/63 GiB  (/dev/mapper/vg00-root)

Allocations
No allocations placed

Attributes
consul.datacenter         = dacs
consul.revision           = 1e03567d3
consul.server             = true
consul.version            = 1.8.5
cpu.arch                  = amd64
driver.raw_exec           = 1
kernel.name               = linux
kernel.version            = 3.10.0-693.21.1.el7.x86_64
memory.totalbytes         = 67374776320
nomad.advertise.address   = 192.168.112.31:5656
nomad.revision            = c9c68aa55a7275f22d2338f2df53e67ebfcb9238
nomad.version             = 1.0.1
os.name                   = centos
os.signals                = SIGTTIN,SIGUSR2,SIGXCPU,SIGBUS,SIGILL,SIGQUIT,SIGCHLD,SIGIOT,SIGKILL,SIGINT,SIGSTOP,SIGSYS,SIGTTOU,SIGFPE,SIGSEGV,SIGTSTP,SIGURG,SIGWINCH,SIGCONT,SIGIO,SIGTRAP,SIGXFSZ,SIGHUP,SIGPIPE,SIGTERM,SIGPROF,SIGABRT,SIGALRM,SIGUSR1
os.version                = 7.4.1708
unique.cgroup.mountpoint  = /sys/fs/cgroup/systemd
unique.consul.name        = grabber1
unique.hostname           = blade1.lab.bulb.hr
unique.network.ip-address = 192.168.112.31
unique.storage.bytesfree  = 74604830720
unique.storage.bytestotal = 126698909696
unique.storage.volume     = /dev/mapper/vg00-root

Meta
connect.gateway_image     = envoyproxy/envoy:v${NOMAD_envoy_version}
connect.log_level         = info
connect.proxy_concurrency = 1
connect.sidecar_image     = envoyproxy/envoy:v${NOMAD_envoy_version}

Operating system and Environment details

/etc/redhat-release
CentOS Linux release 7.4.1708 (Core)
Linux blade1.lab.bulb.hr 3.10.0-693.21.1.el7.x86_64 #1 SMP Wed Mar 7 19:03:37 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Issue

Recent update from nomad v.0.9.6 to nomad v.1.01 breaks a job deployment. Unfortunately I couldn't get any usable info from nomad agent about "pending or dead" status. I also checked trace monitor from web-ui but without success.

Please could you give some advice on how to get reject/pending reason from the agent?

I use "raw_exec" driver (non-privileged user, driver.raw_exec.enable" = "1") F or deployment I use nomad-sdk (version 0.11.3.0)

Reproduction steps

Job deployment via nomad-sdk api fails 100% (pending or dead status checked from web-ui or command line)

Job file

You can find the job definition (from the nomad's point of view) here:

https://pastebin.com/ZXiaM9RW

Job status:

[root@blade1 ~]# nomad status
ID                                     Type     Priority  Status   Submit Date
lightningCollector-lightningCollector  service  50        pending  2020-12-18T15:06:09+01:00


[root@blade1 ~]# nomad status lightningCollector-lightningCollector
ID            = lightningCollector-lightningCollector
Name          = lightningCollector-lightningCollector
Submit Date   = 2020-12-18T15:06:09+01:00
Type          = service
Priority      = 50
Datacenters   = dc1
Namespace     = default
Status        = pending
Periodic      = false
Parameterized = false

Summary
Task Group                               Queued  Starting  Running  Failed  Complete  Lost
lightningCollector-lightningCollector-0  0       0         0        0       0         0

Allocations
No allocations placed

Nomad Client logs (if appropriate)

Nomad Server logs (if appropriate)

Unfortunately I couldn't get any usable logs.

Thank you for your effort and time.

Regards,
Ivan

@tgross
Copy link
Member

tgross commented Jan 4, 2021

Please could you give some advice on how to get reject/pending reason from the agent?

This information should be in the Nomad server logs. Can you maybe explain why you weren't able to get any usable logs? You should be able to fetch them via journalctl -u nomad (filtering for the appropriate timeframe).

@tgross
Copy link
Member

tgross commented Feb 12, 2021

I'm going to close this issue as it's been a while since we heard from you. But if you do get those logs, please feel free to re-open!

@tgross tgross closed this as completed Feb 12, 2021
@ivanprostran
Copy link
Author

Hi Tim,

Sorry for the late response.

The issue is still present.
I've updated to the latest version: Nomad v1.0.4 (9294f35)
I will try to describe what's going on here.
I am using the official java sdk for a job deployment (version 0.12.0-SNAPSHOT)
Initially the job is in "pending" state and what I can do is to "stop the job" and again start it from the UI.

Perhaps I am missing some deployment "flag" using SDK API or something like that?

This information should be in the Nomad server logs. Can you maybe explain why you weren't able to get any usable logs? You should be able to fetch them via journalctl -u nomad (filtering for the appropriate timeframe).

I checked the log. Nothing (empty)

How do I debug this situation.

Any help is appreciated!

Thank you,
Ivan Prostran

Added UI screenshot and job definition:

image

Jon definition (from the nomad's point of view) is as follows:

{
  "Stop": false,
  "Region": "grabber",
  "Namespace": "default",
  "ID": "PCMMManager-PCMMManager",
  "ParentID": "f17fcef0-af35-44ea-aa71-2e40d8e78cce",
  "Name": "PCMMManager-PCMMManager",
  "Type": "service",
  "Priority": 50,
  "AllAtOnce": false,
  "Datacenters": [
    "dc1",
    "dc2",
    "dc3"
  ],
  "Constraints": [
    {
      "LTarget": "",
      "RTarget": "true",
      "Operand": "distinct_hosts"
    }
  ],
  "Affinities": null,
  "Spreads": [
    {
      "Attribute": "${node.datacenter}",
      "Weight": 100,
      "SpreadTarget": [
        {
          "Value": "dc1",
          "Percent": 33
        },
        {
          "Value": "dc2",
          "Percent": 33
        },
        {
          "Value": "dc3",
          "Percent": 33
        }
      ]
    }
  ],
  "TaskGroups": [
    {
      "Name": "PCMMManager-PCMMManager-0",
      "Count": 1,
      "Update": {
        "Stagger": 30000000000,
        "MaxParallel": 1,
        "HealthCheck": "checks",
        "MinHealthyTime": 10000000000,
        "HealthyDeadline": 300000000000,
        "ProgressDeadline": 600000000000,
        "AutoRevert": false,
        "AutoPromote": false,
        "Canary": 0
      },
      "Migrate": {
        "MaxParallel": 1,
        "HealthCheck": "checks",
        "MinHealthyTime": 10000000000,
        "HealthyDeadline": 300000000000
      },
      "Constraints": [
        {
          "LTarget": "",
          "RTarget": "true",
          "Operand": "distinct_hosts"
        }
      ],
      "Scaling": null,
      "RestartPolicy": {
        "Attempts": 960,
        "Interval": 14400000000000,
        "Delay": 15000000000,
        "Mode": "delay"
      },
      "Tasks": [
        {
          "Name": "PCMMManager-PCMMManager-0",
          "Driver": "raw_exec",
          "User": "",
          "Config": {
            "command": "/usr/bin/java",
            "args": [
              "-XX:+UseG1GC",
              "-XX:MaxGCPauseMillis=1000",
              "-XX:G1HeapRegionSize=1",
              "-Xms64m",
              "-Xmx768m",
              "-jar",
              "local/pcmm-manager-0.1.0.jar"
            ]
          },
          "Env": {
            "INSTANCE_INDEX": "0",
            "SPRING_APPLICATION_INDEX": "0",
            "SPRING_APPLICATION_JSON": "{\"spring.metrics.export.triggers.application.includes\":\"integration**\",\"spring.application.memory\":\"768\",\"spring.cloud.stream.metrics.key\":\"PCMMManager.PCMMManager.${spring.cloud.application.guid}\",\"spring.cloud.dataflow.stream.app.label\":\"PCMMManager\",\"spring.cloud.stream.bindings.input.group\":\"PCMMManager\",\"spring.application.count\":\"1\",\"spring.cloud.stream.metrics.properties\":\"spring.application.name,spring.application.index,spring.cloud.application.*,spring.cloud.dataflow.*\",\"spring.application.name\":\"PCMMManager\",\"server.port\":\"0\",\"spring.cloud.stream.bindings.output.destination\":\"OUTPUT\",\"spring.cloud.dataflow.stream.name\":\"PCMMManager\",\"spring.cloud.dataflow.stream.app.type\":\"processor\",\"spring.cloud.stream.bindings.input.destination\":\"INPUT\"}",
            "SPRING_CLOUD_APPLICATION_GROUP": "PCMMManager"
          },
          "Services": null,
          "Vault": null,
          "Templates": null,
          "Constraints": null,
          "Affinities": null,
          "Resources": {
            "CPU": 100,
            "MemoryMB": 768,
            "DiskMB": 0,
            "IOPS": 0,
            "Networks": null,
            "Devices": null
          },
          "RestartPolicy": {
            "Attempts": 960,
            "Interval": 14400000000000,
            "Delay": 15000000000,
            "Mode": "delay"
          },
          "DispatchPayload": null,
          "Lifecycle": null,
          "Meta": {
            "uniqueId": "f17fcef0-af35-44ea-aa71-2e40d8e78cce"
          },
          "KillTimeout": 5000000000,
          "LogConfig": {
            "MaxFiles": 2,
            "MaxFileSizeMB": 10
          },
          "Artifacts": [
            {
              "GetterSource": "http://maven01:9393/resources/maven/hr.bulb.ai/pcmm-manager-0.1.0.jar",
              "GetterOptions": {
                "checksum": "md5:66d93219b0167d543bb3e421fbd2ee35"
              },
              "GetterHeaders": null,
              "GetterMode": "any",
              "RelativeDest": "local"
            }
          ],
          "Leader": false,
          "ShutdownDelay": 0,
          "VolumeMounts": null,
          "ScalingPolicies": null,
          "KillSignal": "",
          "Kind": "",
          "CSIPluginConfig": null
        }
      ],
      "EphemeralDisk": {
        "Sticky": false,
        "SizeMB": 300,
        "Migrate": false
      },
      "Meta": {
        "uniqueId": "f17fcef0-af35-44ea-aa71-2e40d8e78cce"
      },
      "ReschedulePolicy": {
        "Attempts": 0,
        "Interval": 0,
        "Delay": 30000000000,
        "DelayFunction": "exponential",
        "MaxDelay": 3600000000000,
        "Unlimited": true
      },
      "Affinities": null,
      "Spreads": null,
      "Networks": null,
      "Services": null,
      "Volumes": null,
      "ShutdownDelay": null,
      "StopAfterClientDisconnect": null
    }
  ],
  "Update": {
    "Stagger": 30000000000,
    "MaxParallel": 1,
    "HealthCheck": "",
    "MinHealthyTime": 0,
    "HealthyDeadline": 0,
    "ProgressDeadline": 0,
    "AutoRevert": false,
    "AutoPromote": false,
    "Canary": 0
  },
  "Multiregion": null,
  "Periodic": null,
  "ParameterizedJob": null,
  "Dispatched": false,
  "Payload": null,
  "Meta": {
    "uniqueId": "f17fcef0-af35-44ea-aa71-2e40d8e78cce"
  },
  "ConsulToken": "",
  "VaultToken": "",
  "VaultNamespace": "",
  "NomadTokenID": "",
  "Status": "pending",
  "StatusDescription": "",
  "Stable": false,
  "Version": 2,
  "SubmitTime": 1616078357502375400,
  "CreateIndex": 9945,
  "ModifyIndex": 10022,
  "JobModifyIndex": 10022
}

@tgross
Copy link
Member

tgross commented Mar 18, 2021

I checked the log. Nothing (empty)

Are you redirecting the logs somewhere then? Can you share your systemd unit file and server/client config?

@ivanprostran
Copy link
Author

ivanprostran commented Mar 19, 2021

Hi Tim,

I didn't mean "there is no log at all", sorry for that (my mistake).
I mean there is no usable log entry related to this particular "buggy" deployment.
That is reproducible 100/100.

The configuration (server, client)
Consul configuration is excluded (3 nodes in cluster)

consul members
Node      Address            Status  Type    Build  Protocol  DC    Segment
grabber1  10.10.20.13:8301  alive   server  1.9.4  2         dacs  <all>
grabber2  10.10.20.14:8301  alive   server  1.9.4  2         dacs  <all>
grabber3  10.10.20.15:8301  alive   server  1.9.4  2         dacs  <all>


nomad node status
ID        DC   Name               Class   Drain  Eligibility  Status
1ec59497  dc1  grabber1    <none>  false  eligible     ready
30464a41  dc3  grabber2    <none>  false  eligible     ready
8f3ada18  dc2   grabber3    <none>  false  eligible     ready

Client (agent -client)

file: /etc/nomad-client/nomad-client.hcl

data_dir = "/opt/nomad-client/data"

client {
  enabled       = true
  cpu_total_compute = 20000
  reserved {
                cpu = 2000
                memory = 5000
                disk = 1

        }
  network_speed = 10000
  servers = ["127.0.0.1:4647"]
  options {
    "driver.raw_exec.enable" = "1"
  }

}

ports {
    http = 5656
}

systemd: /usr/lib/systemd/system/nomad-client.service

[Unit]
Description=Nomad Client
Documentation=https://www.nomadproject.io/
Wants=network-online.target
After=network-online.target

[Service]
PermissionsStartOnly=true
Restart=always
WorkingDirectory=/opt/nomad-client
User=nomad
Group=nomad
EnvironmentFile=/etc/sysconfig/nomad-client
StandardOutput=syslog
StandardError=syslog
SyslogIdentifier=nomad-client

ExecStart=/opt/nomad-client/bin/nomad $CMD_OPTS
ExecReload=/bin/kill -HUP $MAINPID
KillSignal=SIGINT

[Install]

WantedBy=multi-user.target

EnvironmentFile=/etc/sysconfig/nomad-client

Every node different "dc".

node1:

CMD_OPTS="agent -client -config /etc/nomad-client -region grabber -dc=dc1"

node2:

CMD_OPTS="agent -client -config /etc/nomad-client -region grabber -dc=dc2"

node3:

CMD_OPTS="agent -client -config /etc/nomad-client -region grabber -dc=dc3"

Nomad Server (agent -server)

file: /etc/nomad-server/nomad-server.hcl

data_dir = "/opt/nomad-server/data"

server {
  enabled          = true
  bootstrap_expect = 2

}

Clean start (data/alloc deleted)

nomad-client log (3 node cluster the same output..):

Mar 19 12:13:47 grabber1 nomad-client[99809]: ==> Loaded configuration from /etc/nomad-client/nomad-client.hcl
Mar 19 12:13:47 grabber1 nomad-client[99809]: ==> Starting Nomad agent...
Mar 19 12:13:55 grabber1 nomad-client[99809]: ==> Nomad agent configuration:
Mar 19 12:13:55 grabber1 nomad-client[99809]:       Advertise Addrs: HTTP: 10.10.20.13:5656
Mar 19 12:13:55 grabber1 nomad-client[99809]:            Bind Addrs: HTTP: 0.0.0.0:5656
Mar 19 12:13:55 grabber1 nomad-client[99809]:                Client: true
Mar 19 12:13:55 grabber1 nomad-client[99809]:             Log Level: INFO
Mar 19 12:13:55 grabber1 nomad-client[99809]:                Region: grabber (DC: dc1)
Mar 19 12:13:55 grabber1 nomad-client[99809]:                Server: false
Mar 19 12:13:55 grabber1 nomad-client[99809]:               Version: 1.0.4
Mar 19 12:13:55 grabber1 nomad-client[99809]: ==> Nomad agent started! Log data will stream in below:
Mar 19 12:13:55 grabber1 nomad-client[99809]:    2021-03-19T12:13:47.372+0100 [WARN]  agent.plugin_loader: skipping external plugins since plugin_dir doesn't exist: plugin_dir=/opt/nomad-client/data/plugins
Mar 19 12:13:55 grabber1 nomad-client[99809]:    2021-03-19T12:13:47.373+0100 [INFO]  agent: detected plugin: name=nvidia-gpu type=device plugin_version=0.1.0
Mar 19 12:13:55 grabber1 nomad-client[99809]:    2021-03-19T12:13:47.373+0100 [INFO]  agent: detected plugin: name=exec type=driver plugin_version=0.1.0
Mar 19 12:13:55 grabber1 nomad-client[99809]:    2021-03-19T12:13:47.373+0100 [INFO]  agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
Mar 19 12:13:55 grabber1 nomad-client[99809]:    2021-03-19T12:13:47.373+0100 [INFO]  agent: detected plugin: name=java type=driver plugin_version=0.1.0
Mar 19 12:13:55 grabber1 nomad-client[99809]:    2021-03-19T12:13:47.373+0100 [INFO]  agent: detected plugin: name=docker type=driver plugin_version=0.1.0
Mar 19 12:13:55 grabber1 nomad-client[99809]:    2021-03-19T12:13:47.373+0100 [INFO]  agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
Mar 19 12:13:55 grabber1 nomad-client[99809]:    2021-03-19T12:13:47.373+0100 [INFO]  client: using state directory: state_dir=/opt/nomad-client/data/client
Mar 19 12:13:55 grabber1 nomad-client[99809]:    2021-03-19T12:13:47.383+0100 [INFO]  client: using alloc directory: alloc_dir=/opt/nomad-client/data/alloc
Mar 19 12:13:55 grabber1 nomad-client[99809]:    2021-03-19T12:13:47.399+0100 [INFO]  client.fingerprint_mgr.cgroup: cgroups are available
Mar 19 12:13:55 grabber1 nomad-client[99809]:    2021-03-19T12:13:47.401+0100 [INFO]  client.fingerprint_mgr.consul: consul agent is available
Mar 19 12:13:55 grabber1 nomad-client[99809]:    2021-03-19T12:13:47.407+0100 [WARN]  client.fingerprint_mgr.network: unable to parse speed: path=/usr/sbin/ethtool device=lo
Mar 19 12:13:55 grabber1 nomad-client[99809]:    2021-03-19T12:13:55.426+0100 [INFO]  client.plugin: starting plugin manager: plugin-type=csi
Mar 19 12:13:55 grabber1 nomad-client[99809]:    2021-03-19T12:13:55.426+0100 [INFO]  client.plugin: starting plugin manager: plugin-type=driver
Mar 19 12:13:55 grabber1 nomad-client[99809]:    2021-03-19T12:13:55.426+0100 [INFO]  client.plugin: starting plugin manager: plugin-type=device
Mar 19 12:13:55 grabber1 nomad-client[99809]:    2021-03-19T12:13:55.427+0100 [INFO]  client: started client: node_id=183f311e-5494-d894-f495-9dabafded5be
Mar 19 12:13:55 grabber1 nomad-client[99809]:    2021-03-19T12:13:55.456+0100 [INFO]  client: node registration complete
Mar 19 12:14:01 grabber1 nomad-client[99809]:    2021-03-19T12:14:01.252+0100 [INFO]  client: node registration complete
nomad status
No running jobs

Microservice deployment:

admin:4646/ui/jobs

image


nomad status
ID                       Type     Priority  Status   Submit Date
PCMMManager-PCMMManager  service  50        pending  2021-03-19T12:47:35+01:00


nomad status PCMMManager-PCMMManager
ID            = PCMMManager-PCMMManager
Name          = PCMMManager-PCMMManager
Submit Date   = 2021-03-19T12:47:35+01:00
Type          = service
Priority      = 50
Datacenters   = dc1,dc2,dc3
Namespace     = default
Status        = pending
Periodic      = false
Parameterized = false

Summary
Task Group                 Queued  Starting  Running  Failed  Complete  Lost
PCMMManager-PCMMManager-0  0       0         0        0       0         0

Allocations
No allocations placed

nomad-client log (after deployment)

Mar 19 12:13:47 grabber1 nomad-client[99809]: ==> Starting Nomad agent...
Mar 19 12:13:55 grabber1 nomad-client[99809]: ==> Nomad agent configuration:
Mar 19 12:13:55 grabber1 nomad-client[99809]:       Advertise Addrs: HTTP: 10.10.20.13:5656
Mar 19 12:13:55 grabber1 nomad-client[99809]:            Bind Addrs: HTTP: 0.0.0.0:5656
Mar 19 12:13:55 grabber1 nomad-client[99809]:                Client: true
Mar 19 12:13:55 grabber1 nomad-client[99809]:             Log Level: INFO
Mar 19 12:13:55 grabber1 nomad-client[99809]:                Region: grabber (DC: dc1)
Mar 19 12:13:55 grabber1 nomad-client[99809]:                Server: false
Mar 19 12:13:55 grabber1 nomad-client[99809]:               Version: 1.0.4
Mar 19 12:13:55 grabber1 nomad-client[99809]: ==> Nomad agent started! Log data will stream in below:
Mar 19 12:13:55 grabber1 nomad-client[99809]:    2021-03-19T12:13:47.372+0100 [WARN]  agent.plugin_loader: skipping external plugins since plugin_dir doesn't exist: plugin_dir=/opt/nomad-client/data/plugins
Mar 19 12:13:55 grabber1 nomad-client[99809]:    2021-03-19T12:13:47.373+0100 [INFO]  agent: detected plugin: name=nvidia-gpu type=device plugin_version=0.1.0
Mar 19 12:13:55 grabber1 nomad-client[99809]:    2021-03-19T12:13:47.373+0100 [INFO]  agent: detected plugin: name=exec type=driver plugin_version=0.1.0
Mar 19 12:13:55 grabber1 nomad-client[99809]:    2021-03-19T12:13:47.373+0100 [INFO]  agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
Mar 19 12:13:55 grabber1 nomad-client[99809]:    2021-03-19T12:13:47.373+0100 [INFO]  agent: detected plugin: name=java type=driver plugin_version=0.1.0
Mar 19 12:13:55 grabber1 nomad-client[99809]:    2021-03-19T12:13:47.373+0100 [INFO]  agent: detected plugin: name=docker type=driver plugin_version=0.1.0
Mar 19 12:13:55 grabber1 nomad-client[99809]:    2021-03-19T12:13:47.373+0100 [INFO]  agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
Mar 19 12:13:55 grabber1 nomad-client[99809]:    2021-03-19T12:13:47.373+0100 [INFO]  client: using state directory: state_dir=/opt/nomad-client/data/client
Mar 19 12:13:55 grabber1 nomad-client[99809]:    2021-03-19T12:13:47.383+0100 [INFO]  client: using alloc directory: alloc_dir=/opt/nomad-client/data/alloc
Mar 19 12:13:55 grabber1 nomad-client[99809]:    2021-03-19T12:13:47.399+0100 [INFO]  client.fingerprint_mgr.cgroup: cgroups are available
Mar 19 12:13:55 grabber1 nomad-client[99809]:    2021-03-19T12:13:47.401+0100 [INFO]  client.fingerprint_mgr.consul: consul agent is available
Mar 19 12:13:55 grabber1 nomad-client[99809]:    2021-03-19T12:13:47.407+0100 [WARN]  client.fingerprint_mgr.network: unable to parse speed: path=/usr/sbin/ethtool device=lo
Mar 19 12:13:55 grabber1 nomad-client[99809]:    2021-03-19T12:13:55.426+0100 [INFO]  client.plugin: starting plugin manager: plugin-type=csi
Mar 19 12:13:55 grabber1 nomad-client[99809]:    2021-03-19T12:13:55.426+0100 [INFO]  client.plugin: starting plugin manager: plugin-type=driver
Mar 19 12:13:55 grabber1 nomad-client[99809]:    2021-03-19T12:13:55.426+0100 [INFO]  client.plugin: starting plugin manager: plugin-type=device
Mar 19 12:13:55 grabber1 nomad-client[99809]:    2021-03-19T12:13:55.427+0100 [INFO]  client: started client: node_id=183f311e-5494-d894-f495-9dabafded5be
Mar 19 12:13:55 grabber1 nomad-client[99809]:    2021-03-19T12:13:55.456+0100 [INFO]  client: node registration complete
Mar 19 12:14:01 grabber1 nomad-client[99809]:    2021-03-19T12:14:01.252+0100 [INFO]  client: node registration complete

Monitor (on the node1)

http://localhost:4646/ui/clients/183f311e-5494-d894-f495-9dabafded5be/monitor?level=trace

Please check it here: https://pastebin.com/xRCdEphS

PCMMManager job definition from adminui

Please check it here: https://pastebin.com/Kq0zXAwi

From Wireshark (request/response)

REQUEST

PUT /v1/jobs HTTP/1.1
X-Nomad-Token: 0f4d6e46-6154-d674-d8a1-5eca06cc61f6
Content-Length: 3345
Content-Type: application/json; charset=UTF-8
Host: localhost:4646
Connection: Keep-Alive
User-Agent: Apache-HttpClient/4.5.6 (Java/1.8.0_282)
Accept-Encoding: gzip,deflate

Check the body: https://pastebin.com/Fx9L9eyD

RESPONSE

HTTP/1.1 200 OK
Content-Encoding: gzip
Content-Type: application/json
Vary: Accept-Encoding
X-Nomad-Index: 48
Date: Fri, 19 Mar 2021 11:47:35 GMT
Content-Length: 145
{
	"EvalID": "ef9be1b4-c2fd-3e1e-73a1-df8f5aeb6f99",
	"EvalCreateIndex": 48,
	"JobModifyIndex": 48,
	"Warnings": "",
	"Index": 48,
	"LastContact": 0,
	"KnownLeader": false
}

Thank you for your time and effort.

Regards,
Ivan Prostran

@tgross
Copy link
Member

tgross commented Mar 19, 2021

I didn't mean "there is no log at all", sorry for that (my mistake).
I mean there is no usable log entry related to this particular "buggy" deployment.

The logs I was looking for were the server logs. You've posted one of those in the pastebin, the relevant section of which is here:

server log
2021-03-19T12:47:34.125+0100 [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:50084
2021-03-19T12:47:35.858+0100 [TRACE] nomad.job: job mutate results: mutator=canonicalize warnings=[] error=<nil>
2021-03-19T12:47:35.858+0100 [TRACE] nomad.job: job mutate results: mutator=connect warnings=[] error=<nil>
2021-03-19T12:47:35.858+0100 [TRACE] nomad.job: job mutate results: mutator=expose-check warnings=[] error=<nil>
2021-03-19T12:47:35.858+0100 [TRACE] nomad.job: job mutate results: mutator=constraints warnings=[] error=<nil>
2021-03-19T12:47:35.858+0100 [TRACE] nomad.job: job validate results: validator=connect warnings=[] error=<nil>
2021-03-19T12:47:35.858+0100 [TRACE] nomad.job: job validate results: validator=expose-check warnings=[] error=<nil>
2021-03-19T12:47:35.858+0100 [TRACE] nomad.job: job validate results: validator=validate warnings=[] error=<nil>
2021-03-19T12:47:35.858+0100 [DEBUG] http: request complete: method=PUT path=/v1/validate/job duration=1.44896ms
2021-03-19T12:47:35.859+0100 [TRACE] nomad.job: job mutate results: mutator=canonicalize warnings=[] error=<nil>
2021-03-19T12:47:35.859+0100 [TRACE] nomad.job: job mutate results: mutator=connect warnings=[] error=<nil>
2021-03-19T12:47:35.859+0100 [TRACE] nomad.job: job mutate results: mutator=expose-check warnings=[] error=<nil>
2021-03-19T12:47:35.859+0100 [TRACE] nomad.job: job mutate results: mutator=constraints warnings=[] error=<nil>
2021-03-19T12:47:35.859+0100 [TRACE] nomad.job: job validate results: validator=connect warnings=[] error=<nil>
2021-03-19T12:47:35.859+0100 [TRACE] nomad.job: job validate results: validator=expose-check warnings=[] error=<nil>
2021-03-19T12:47:35.859+0100 [TRACE] nomad.job: job validate results: validator=validate warnings=[] error=<nil>
2021-03-19T12:47:35.878+0100 [DEBUG] http: request complete: method=PUT path=/v1/jobs duration=19.292161ms
2021-03-19T12:47:35.878+0100 [DEBUG] http: request complete: method=GET path=/v1/jobs?index=1 duration=44.766116799s
2021-03-19T12:47:35.878+0100 [DEBUG] http: request complete: method=GET path=/v1/jobs?index=1 duration=4m22.337685831s
2021-03-19T12:47:35.878+0100 [DEBUG] http: request complete: method=GET path=/v1/jobs?index=1 duration=6.175009781s
2021-03-19T12:47:35.971+0100 [DEBUG] http: request complete: method=GET path=/v1/jobs?prefix=PCMMAudit-PCMMAudit duration=139.275µs
2021-03-19T12:47:35.972+0100 [DEBUG] http: request complete: method=GET path=/v1/job/PCMMAudit-PCMMAudit/evaluations duration=134.772µs
2021-03-19T12:47:35.974+0100 [DEBUG] http: request complete: method=GET path=/v1/job/PCMMAudit-PCMMAudit/allocations duration=1.415115ms
2021-03-19T12:47:35.974+0100 [DEBUG] http: request complete: method=GET path=/v1/jobs?prefix=PCMMManager-PCMMManager duration=160.018µs
2021-03-19T12:47:35.974+0100 [DEBUG] http: request complete: method=GET path=/v1/job/PCMMManager-PCMMManager/evaluations duration=123.242µs
2021-03-19T12:47:35.976+0100 [DEBUG] http: request complete: method=GET path=/v1/job/PCMMManager-PCMMManager/allocations duration=988.557µs
2021-03-19T12:47:44.078+0100 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration=144.456µs
2021-03-19T12:47:44.126+0100 [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:50098
2021-03-19T12:47:51.608+0100 [DEBUG] nomad: memberlist: Initiating push/pull sync with: grabber03.grabber 10.10.20.15:4648
2021-03-19T12:47:54.079+0100 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration=214.156µs
2021-03-19T12:47:54.126+0100 [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:50110
2021-03-19T12:48:04.080+0100 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration=226.004µs
2021-03-19T12:48:04.126+0100 [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:50120
2021-03-19T12:48:14.081+0100 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration=215.32µs
2021-03-19T12:48:14.127+0100 [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:50130
2021-03-19T12:48:20.973+0100 [DEBUG] http: request complete: method=GET path=/v1/jobs?index=11622 duration=5m10.984159083s
2021-03-19T12:48:24.082+0100 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration=159.943µs
2021-03-19T12:48:24.127+0100 [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:50138
2021-03-19T12:48:34.083+0100 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration=315.114µs
2021-03-19T12:48:34.127+0100 [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:50146
2021-03-19T12:48:44.084+0100 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration=250.355µs
2021-03-19T12:48:44.128+0100 [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:50156
2021-03-19T12:48:47.006+0100 [DEBUG] worker: dequeued evaluation: eval_id=4c19301f-687a-1a1c-3dd2-58bdf2838057
2021-03-19T12:48:47.006+0100 [DEBUG] core.sched: eval GC scanning before cutoff index: index=0 eval_gc_threshold=1h0m0s
2021-03-19T12:48:47.006+0100 [DEBUG] worker: ack evaluation: eval_id=4c19301f-687a-1a1c-3dd2-58bdf2838057
2021-03-19T12:48:47.006+0100 [DEBUG] worker: dequeued evaluation: eval_id=ee1a4c02-addc-c56e-c442-2ba76548ba3b
2021-03-19T12:48:47.006+0100 [TRACE] core.sched: garbage collecting unclaimed CSI volume claims: eval.JobID=csi-volume-claim-gc
2021-03-19T12:48:47.006+0100 [DEBUG] core.sched: CSI volume claim GC scanning before cutoff index: index=4 csi_volume_claim_gc_threshold=5m0s
2021-03-19T12:48:47.007+0100 [DEBUG] worker: ack evaluation: eval_id=ee1a4c02-addc-c56e-c442-2ba76548ba3b

I don't see the plan created by the evaluation here. nomad eval status for the eval ID should return some information about what happened with the evaluation.

Also, HCL is a lot nicer way to read the jobspec, but something that jumps out at me from the JSON is that you've got count = 1 with 33% spread across each DC. You probably want to leave off the third target (see the example in the docs), but also having a count of 1 with a spread is a little weird. Is that intentional?

@ivanprostran
Copy link
Author

Please keep in mind everything works as expected in version

nomad version
Nomad v0.9.0 (18dd59056ee1d7b2df51256fe900a98460d3d6b9)

Data center option is something I added yesterday (this behavior was detected before that).

Usually when some constraint prevents a deployment it is visible and straightforward via 'nomad status jobname'
along with pending status (e.g. memory exhausted)

I don't see the plan created by the evaluation here. nomad eval status for the eval ID should return some information about what happened with the evaluation.

Response from the server:

	"EvalID": "ef9be1b4-c2fd-3e1e-73a1-df8f5aeb6f99",
	"EvalCreateIndex": 48,
	"JobModifyIndex": 48,
	"Warnings": "",
	"Index": 48,
	"LastContact": 0,
	"KnownLeader": false

There is no evaluation as I can see:

nomad eval status -monitor ef9be1b4
No evaluation(s) with prefix or id "ef9be1b4" found

 nomad eval status -monitor ef9be1b4-c2fd-3e1e-73a1-df8f5aeb6f99
No evaluation(s) with prefix or id "ef9be1b4-c2fd-3e1e-73a1-df8f5aeb6f99" found

In my opinion the most important is to get some usable information from the backend.
What I need to do to get some "info" and find what might went wrong.

Thank you for your time and effort.

Regards,
Ivan

@tgross
Copy link
Member

tgross commented Mar 19, 2021

Data center option is something I added yesterday (this behavior was detected before that).

It would help if you provided the jobspec that you initially had problems with, rather than one where you can't isolate the problem to the upgrade.

There is no evaluation as I can see:

That's from an eval you just ran, and not one that's been GC'd? If so, that's strange... do you have logs from the rest of the servers? The logs you have here are only for one of the servers, and it doesn't look like it's the one where the plan was applied (I'd expect to see the number of allocations that were placed).

@ivanprostran
Copy link
Author

No additional logs (that's all I can get)

Ok, I removed Spread Stanza completely from the Job (builder).

Created new job: PCMMManagerNoSpreadStanza
(deployment via nomad-sdk)

nomad status PCMMManagerNoSpreadStanza-PCMMManager
ID            = PCMMManagerNoSpreadStanza-PCMMManager
Name          = PCMMManagerNoSpreadStanza-PCMMManager
Submit Date   = 2021-03-19T15:46:31+01:00
Type          = service
Priority      = 50
Datacenters   = dc1,dc2,dc3
Namespace     = default
Status        = pending
Periodic      = false
Parameterized = false

Summary
Task Group                               Queued  Starting  Running  Failed  Complete  Lost
PCMMManagerNoSpreadStanza-PCMMManager-0  0       0         0        0       0         0

Allocations
No allocations placed

Node01 - trace
 2021-03-19T15:46:18.131+0100 [DEBUG] http: request complete: method=GET path=/v1/job/PCMMManager-PCMMManager/deployment?index=268 duration=5m14.43017974s
2021-03-19T15:46:18.763+0100 [DEBUG] http: request complete: method=GET path=/v1/job/PCMMManager-PCMMManager/evaluations?index=291 duration=5m15.096552657s
2021-03-19T15:46:19.596+0100 [DEBUG] http: request complete: method=GET path=/v1/job/PCMMManager-PCMMManager?index=300 duration=5m0.703788795s
2021-03-19T15:46:19.785+0100 [DEBUG] http: request complete: method=GET path=/v1/client/stats?node_id=11459763-1a32-a1de-6141-e8bbad9f2061 duration=2.115091ms
2021-03-19T15:46:19.814+0100 [DEBUG] http: request complete: method=GET path=/v1/agent/monitor?client_id=11459763-1a32-a1de-6141-e8bbad9f2061&follow=true&log_level=debug&offset=50000&origin=end duration=3m24.347588185s
2021-03-19T15:46:19.828+0100 [DEBUG] http: request complete: method=GET path=/v1/client/allocation/713ee672-9119-16d7-3568-9169374acb5f/stats duration=1.554687ms
2021-03-19T15:46:19.893+0100 [DEBUG] http: request complete: method=GET path=/v1/client/allocation/ba5e3361-0c5a-927a-538b-e8de8c397e0a/stats duration=1.762755ms
2021-03-19T15:46:19.905+0100 [DEBUG] http: request complete: method=GET path=/v1/job/PCMMManager-PCMMManager/evaluations?index=291 duration=5m0.549051642s
2021-03-19T15:46:22.789+0100 [DEBUG] http: request complete: method=GET path=/v1/client/stats?node_id=183f311e-5494-d894-f495-9dabafded5be duration=784.147µs
2021-03-19T15:46:22.806+0100 [DEBUG] http: request complete: method=GET path=/v1/agent/monitor?client_id=183f311e-5494-d894-f495-9dabafded5be&follow=true&log_level=debug&offset=50000&origin=end duration=3m24.474681542s
2021-03-19T15:46:22.858+0100 [DEBUG] http: request complete: method=GET path=/v1/client/allocation/a11f2f7f-6915-e242-b8c5-072f76795359/stats duration=674.091µs
2021-03-19T15:46:24.443+0100 [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:34428
2021-03-19T15:46:24.631+0100 [DEBUG] nomad: memberlist: Stream connection from=10.10.20.14:57524
2021-03-19T15:46:25.122+0100 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration=149.359µs
2021-03-19T15:46:30.019+0100 [DEBUG] http: request complete: method=GET path=/v1/job/PCMMManager-PCMMManager/evaluations?index=291 duration=5m16.504643644s
2021-03-19T15:46:31.543+0100 [DEBUG] http: request complete: method=PUT path=/v1/validate/job duration=424.032µs
2021-03-19T15:46:31.553+0100 [DEBUG] http: request complete: method=PUT path=/v1/jobs duration=8.45694ms
2021-03-19T15:46:31.615+0100 [DEBUG] http: request complete: method=GET path=/v1/job/PCMMManager-PCMMManager/allocations?index=284 duration=5m12.807695721s
2021-03-19T15:46:31.620+0100 [DEBUG] http: request complete: method=GET path=/v1/jobs?prefix=PCMMWSServer01-PCMMWSServer duration=173.456µs
2021-03-19T15:46:31.621+0100 [DEBUG] http: request complete: method=GET path=/v1/job/PCMMWSServer01-PCMMWSServer/evaluations duration=137.151µs
2021-03-19T15:46:31.621+0100 [DEBUG] http: request complete: method=GET path=/v1/job/PCMMWSServer01-PCMMWSServer/allocations duration=178.11µs
2021-03-19T15:46:31.622+0100 [DEBUG] http: request complete: method=GET path=/v1/allocation/a11f2f7f-6915-e242-b8c5-072f76795359 duration=315.061µs
2021-03-19T15:46:31.624+0100 [DEBUG] http: request complete: method=GET path=/v1/allocation/a11f2f7f-6915-e242-b8c5-072f76795359 duration=267.61µs
2021-03-19T15:46:31.625+0100 [DEBUG] http: request complete: method=GET path=/v1/jobs?prefix=PCMMWSServer02-PCMMWSServer duration=149.776µs
2021-03-19T15:46:31.626+0100 [DEBUG] http: request complete: method=GET path=/v1/job/PCMMWSServer02-PCMMWSServer/evaluations duration=263.598µs
2021-03-19T15:46:31.627+0100 [DEBUG] http: request complete: method=GET path=/v1/job/PCMMWSServer02-PCMMWSServer/allocations duration=133.792µs
2021-03-19T15:46:31.628+0100 [DEBUG] http: request complete: method=GET path=/v1/allocation/ba5e3361-0c5a-927a-538b-e8de8c397e0a duration=226.982µs
2021-03-19T15:46:31.629+0100 [DEBUG] http: request complete: method=GET path=/v1/allocation/ba5e3361-0c5a-927a-538b-e8de8c397e0a duration=209.776µs
2021-03-19T15:46:31.630+0100 [DEBUG] http: request complete: method=GET path=/v1/jobs?prefix=PCMMAudit-PCMMAudit duration=111.983µs
2021-03-19T15:46:31.631+0100 [DEBUG] http: request complete: method=GET path=/v1/job/PCMMAudit-PCMMAudit/evaluations duration=124.877µs
2021-03-19T15:46:31.631+0100 [DEBUG] http: request complete: method=GET path=/v1/job/PCMMAudit-PCMMAudit/allocations duration=169.58µs
2021-03-19T15:46:31.632+0100 [DEBUG] http: request complete: method=GET path=/v1/allocation/713ee672-9119-16d7-3568-9169374acb5f duration=251.087µs
2021-03-19T15:46:31.634+0100 [DEBUG] http: request complete: method=GET path=/v1/allocation/713ee672-9119-16d7-3568-9169374acb5f duration=263.1µs
2021-03-19T15:46:31.635+0100 [DEBUG] http: request complete: method=GET path=/v1/jobs?prefix=PCMMWSServer03-PCMMWSServer duration=128.459µs
2021-03-19T15:46:31.635+0100 [DEBUG] http: request complete: method=GET path=/v1/job/PCMMWSServer03-PCMMWSServer/evaluations duration=112.056µs
2021-03-19T15:46:31.636+0100 [DEBUG] http: request complete: method=GET path=/v1/job/PCMMWSServer03-PCMMWSServer/allocations duration=151.926µs
2021-03-19T15:46:31.637+0100 [DEBUG] http: request complete: method=GET path=/v1/allocation/0917099e-2c9f-947c-4dc0-31372d293474 duration=287.667µs
2021-03-19T15:46:31.639+0100 [DEBUG] http: request complete: method=GET path=/v1/allocation/0917099e-2c9f-947c-4dc0-31372d293474 duration=235.828µs
2021-03-19T15:46:31.640+0100 [DEBUG] http: request complete: method=GET path=/v1/jobs?prefix=PCMMManagerNoSpreadStanza-PCMMManager duration=149.017µs
2021-03-19T15:46:31.640+0100 [DEBUG] http: request complete: method=GET path=/v1/job/PCMMManagerNoSpreadStanza-PCMMManager/evaluations duration=96.462µs
2021-03-19T15:46:31.641+0100 [DEBUG] http: request complete: method=GET path=/v1/job/PCMMManagerNoSpreadStanza-PCMMManager/allocations duration=104.301µs
2021-03-19T15:46:34.443+0100 [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:34436
2021-03-19T15:46:35.123+0100 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration=149.547µs
2021-03-19T15:46:44.444+0100 [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:34448
2021-03-19T15:46:45.124+0100 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration=147.614µs
2021-03-19T15:46:47.288+0100 [DEBUG] http: request complete: method=GET path=/v1/evaluations?prefix=9cd7ebc7-5e47-73a1-6cb1-daca156824d2 duration=179.073µs
2021-03-19T15:46:47.602+0100 [DEBUG] http: request complete: method=GET path=/v1/job/PCMMManager-PCMMManager/allocations?index=284 duration=5m0.132882224s
2021-03-19T15:46:51.817+0100 [DEBUG] nomad: memberlist: Initiating push/pull sync with: grabber02.grabber 10.10.20.14:4648
2021-03-19T15:46:54.444+0100 [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:34462
2021-03-19T15:46:55.125+0100 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration=155.216µs
2021-03-19T15:46:59.505+0100 [DEBUG] http: request complete: method=GET path=/v1/job/PCMMManager-PCMMManager/summary?index=284 duration=5m12.069446853s
2021-03-19T15:47:03.468+0100 [DEBUG] http: request complete: method=GET path=/v1/job/PCMMManager-PCMMManager?index=300 duration=5m16.033698766s
2021-03-19T15:47:04.444+0100 [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:34470
2021-03-19T15:47:05.126+0100 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration=157.203µs
2021-03-19T15:47:05.647+0100 [DEBUG] http: request complete: method=GET path=/v1/jobs duration=261.398µs
2021-03-19T15:47:10.454+0100 [DEBUG] http: request complete: method=PUT path=/v1/search duration=269.1µs
2021-03-19T15:47:10.456+0100 [DEBUG] http: request complete: method=GET path=/v1/jobs?prefix=PCMMManagerNoSpreadStanza-PCMMManager duration=163.93µs
2021-03-19T15:47:10.456+0100 [DEBUG] http: request complete: method=GET path=/v1/job/PCMMManagerNoSpreadStanza-PCMMManager?namespace=default duration=190.61µs
2021-03-19T15:47:10.458+0100 [DEBUG] http: request complete: method=GET path=/v1/job/PCMMManagerNoSpreadStanza-PCMMManager/allocations?all=false&namespace=default duration=120.233µs
2021-03-19T15:47:10.458+0100 [DEBUG] http: request complete: method=GET path=/v1/job/PCMMManagerNoSpreadStanza-PCMMManager/evaluations?namespace=default duration=102.132µs
2021-03-19T15:47:10.459+0100 [DEBUG] http: request complete: method=GET path=/v1/job/PCMMManagerNoSpreadStanza-PCMMManager/deployment?namespace=default duration=97.599µs
2021-03-19T15:47:10.459+0100 [DEBUG] http: request complete: method=GET path=/v1/job/PCMMManagerNoSpreadStanza-PCMMManager/summary?namespace=default duration=127.924µs
2021-03-19T15:47:14.445+0100 [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:34496
2021-03-19T15:47:15.127+0100 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration=152.508µs
2021-03-19T15:47:24.445+0100 [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:34504
2021-03-19T15:47:25.127+0100 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration=172.308µs 

Node02 - trace
2021-03-19T15:46:18.131+0100 [DEBUG] http: request complete: method=GET path=/v1/job/PCMMManager-PCMMManager/deployment?index=268 duration=5m14.43017974s
2021-03-19T15:46:18.763+0100 [DEBUG] http: request complete: method=GET path=/v1/job/PCMMManager-PCMMManager/evaluations?index=291 duration=5m15.096552657s
2021-03-19T15:46:19.596+0100 [DEBUG] http: request complete: method=GET path=/v1/job/PCMMManager-PCMMManager?index=300 duration=5m0.703788795s
2021-03-19T15:46:19.785+0100 [DEBUG] http: request complete: method=GET path=/v1/client/stats?node_id=11459763-1a32-a1de-6141-e8bbad9f2061 duration=2.115091ms
2021-03-19T15:46:19.814+0100 [DEBUG] http: request complete: method=GET path=/v1/agent/monitor?client_id=11459763-1a32-a1de-6141-e8bbad9f2061&follow=true&log_level=debug&offset=50000&origin=end duration=3m24.347588185s
2021-03-19T15:46:19.828+0100 [DEBUG] http: request complete: method=GET path=/v1/client/allocation/713ee672-9119-16d7-3568-9169374acb5f/stats duration=1.554687ms
2021-03-19T15:46:19.893+0100 [DEBUG] http: request complete: method=GET path=/v1/client/allocation/ba5e3361-0c5a-927a-538b-e8de8c397e0a/stats duration=1.762755ms
2021-03-19T15:46:19.905+0100 [DEBUG] http: request complete: method=GET path=/v1/job/PCMMManager-PCMMManager/evaluations?index=291 duration=5m0.549051642s
2021-03-19T15:46:22.789+0100 [DEBUG] http: request complete: method=GET path=/v1/client/stats?node_id=183f311e-5494-d894-f495-9dabafded5be duration=784.147µs
2021-03-19T15:46:22.806+0100 [DEBUG] http: request complete: method=GET path=/v1/agent/monitor?client_id=183f311e-5494-d894-f495-9dabafded5be&follow=true&log_level=debug&offset=50000&origin=end duration=3m24.474681542s
2021-03-19T15:46:22.858+0100 [DEBUG] http: request complete: method=GET path=/v1/client/allocation/a11f2f7f-6915-e242-b8c5-072f76795359/stats duration=674.091µs
2021-03-19T15:46:24.443+0100 [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:34428
2021-03-19T15:46:24.631+0100 [DEBUG] nomad: memberlist: Stream connection from=10.10.20.14:57524
2021-03-19T15:46:25.122+0100 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration=149.359µs
2021-03-19T15:46:30.019+0100 [DEBUG] http: request complete: method=GET path=/v1/job/PCMMManager-PCMMManager/evaluations?index=291 duration=5m16.504643644s
2021-03-19T15:46:31.543+0100 [DEBUG] http: request complete: method=PUT path=/v1/validate/job duration=424.032µs
2021-03-19T15:46:31.553+0100 [DEBUG] http: request complete: method=PUT path=/v1/jobs duration=8.45694ms
2021-03-19T15:46:31.615+0100 [DEBUG] http: request complete: method=GET path=/v1/job/PCMMManager-PCMMManager/allocations?index=284 duration=5m12.807695721s
2021-03-19T15:46:31.620+0100 [DEBUG] http: request complete: method=GET path=/v1/jobs?prefix=PCMMWSServer01-PCMMWSServer duration=173.456µs
2021-03-19T15:46:31.621+0100 [DEBUG] http: request complete: method=GET path=/v1/job/PCMMWSServer01-PCMMWSServer/evaluations duration=137.151µs
2021-03-19T15:46:31.621+0100 [DEBUG] http: request complete: method=GET path=/v1/job/PCMMWSServer01-PCMMWSServer/allocations duration=178.11µs
2021-03-19T15:46:31.622+0100 [DEBUG] http: request complete: method=GET path=/v1/allocation/a11f2f7f-6915-e242-b8c5-072f76795359 duration=315.061µs
2021-03-19T15:46:31.624+0100 [DEBUG] http: request complete: method=GET path=/v1/allocation/a11f2f7f-6915-e242-b8c5-072f76795359 duration=267.61µs
2021-03-19T15:46:31.625+0100 [DEBUG] http: request complete: method=GET path=/v1/jobs?prefix=PCMMWSServer02-PCMMWSServer duration=149.776µs
2021-03-19T15:46:31.626+0100 [DEBUG] http: request complete: method=GET path=/v1/job/PCMMWSServer02-PCMMWSServer/evaluations duration=263.598µs
2021-03-19T15:46:31.627+0100 [DEBUG] http: request complete: method=GET path=/v1/job/PCMMWSServer02-PCMMWSServer/allocations duration=133.792µs
2021-03-19T15:46:31.628+0100 [DEBUG] http: request complete: method=GET path=/v1/allocation/ba5e3361-0c5a-927a-538b-e8de8c397e0a duration=226.982µs
2021-03-19T15:46:31.629+0100 [DEBUG] http: request complete: method=GET path=/v1/allocation/ba5e3361-0c5a-927a-538b-e8de8c397e0a duration=209.776µs
2021-03-19T15:46:31.630+0100 [DEBUG] http: request complete: method=GET path=/v1/jobs?prefix=PCMMAudit-PCMMAudit duration=111.983µs
2021-03-19T15:46:31.631+0100 [DEBUG] http: request complete: method=GET path=/v1/job/PCMMAudit-PCMMAudit/evaluations duration=124.877µs
2021-03-19T15:46:31.631+0100 [DEBUG] http: request complete: method=GET path=/v1/job/PCMMAudit-PCMMAudit/allocations duration=169.58µs
2021-03-19T15:46:31.632+0100 [DEBUG] http: request complete: method=GET path=/v1/allocation/713ee672-9119-16d7-3568-9169374acb5f duration=251.087µs
2021-03-19T15:46:31.634+0100 [DEBUG] http: request complete: method=GET path=/v1/allocation/713ee672-9119-16d7-3568-9169374acb5f duration=263.1µs
2021-03-19T15:46:31.635+0100 [DEBUG] http: request complete: method=GET path=/v1/jobs?prefix=PCMMWSServer03-PCMMWSServer duration=128.459µs
2021-03-19T15:46:31.635+0100 [DEBUG] http: request complete: method=GET path=/v1/job/PCMMWSServer03-PCMMWSServer/evaluations duration=112.056µs
2021-03-19T15:46:31.636+0100 [DEBUG] http: request complete: method=GET path=/v1/job/PCMMWSServer03-PCMMWSServer/allocations duration=151.926µs
2021-03-19T15:46:31.637+0100 [DEBUG] http: request complete: method=GET path=/v1/allocation/0917099e-2c9f-947c-4dc0-31372d293474 duration=287.667µs
2021-03-19T15:46:31.639+0100 [DEBUG] http: request complete: method=GET path=/v1/allocation/0917099e-2c9f-947c-4dc0-31372d293474 duration=235.828µs
2021-03-19T15:46:31.640+0100 [DEBUG] http: request complete: method=GET path=/v1/jobs?prefix=PCMMManagerNoSpreadStanza-PCMMManager duration=149.017µs
2021-03-19T15:46:31.640+0100 [DEBUG] http: request complete: method=GET path=/v1/job/PCMMManagerNoSpreadStanza-PCMMManager/evaluations duration=96.462µs
2021-03-19T15:46:31.641+0100 [DEBUG] http: request complete: method=GET path=/v1/job/PCMMManagerNoSpreadStanza-PCMMManager/allocations duration=104.301µs
2021-03-19T15:46:34.443+0100 [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:34436
2021-03-19T15:46:35.123+0100 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration=149.547µs
2021-03-19T15:46:44.444+0100 [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:34448
2021-03-19T15:46:45.124+0100 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration=147.614µs
2021-03-19T15:46:47.288+0100 [DEBUG] http: request complete: method=GET path=/v1/evaluations?prefix=9cd7ebc7-5e47-73a1-6cb1-daca156824d2 duration=179.073µs
2021-03-19T15:46:47.602+0100 [DEBUG] http: request complete: method=GET path=/v1/job/PCMMManager-PCMMManager/allocations?index=284 duration=5m0.132882224s
2021-03-19T15:46:51.817+0100 [DEBUG] nomad: memberlist: Initiating push/pull sync with: grabber02.grabber 10.10.20.14:4648
2021-03-19T15:46:54.444+0100 [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:34462
2021-03-19T15:46:55.125+0100 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration=155.216µs
2021-03-19T15:46:59.505+0100 [DEBUG] http: request complete: method=GET path=/v1/job/PCMMManager-PCMMManager/summary?index=284 duration=5m12.069446853s
2021-03-19T15:47:03.468+0100 [DEBUG] http: request complete: method=GET path=/v1/job/PCMMManager-PCMMManager?index=300 duration=5m16.033698766s
2021-03-19T15:47:04.444+0100 [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:34470
2021-03-19T15:47:05.126+0100 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration=157.203µs
2021-03-19T15:47:05.647+0100 [DEBUG] http: request complete: method=GET path=/v1/jobs duration=261.398µs
2021-03-19T15:47:10.454+0100 [DEBUG] http: request complete: method=PUT path=/v1/search duration=269.1µs
2021-03-19T15:47:10.456+0100 [DEBUG] http: request complete: method=GET path=/v1/jobs?prefix=PCMMManagerNoSpreadStanza-PCMMManager duration=163.93µs
2021-03-19T15:47:10.456+0100 [DEBUG] http: request complete: method=GET path=/v1/job/PCMMManagerNoSpreadStanza-PCMMManager?namespace=default duration=190.61µs
2021-03-19T15:47:10.458+0100 [DEBUG] http: request complete: method=GET path=/v1/job/PCMMManagerNoSpreadStanza-PCMMManager/allocations?all=false&namespace=default duration=120.233µs
2021-03-19T15:47:10.458+0100 [DEBUG] http: request complete: method=GET path=/v1/job/PCMMManagerNoSpreadStanza-PCMMManager/evaluations?namespace=default duration=102.132µs
2021-03-19T15:47:10.459+0100 [DEBUG] http: request complete: method=GET path=/v1/job/PCMMManagerNoSpreadStanza-PCMMManager/deployment?namespace=default duration=97.599µs
2021-03-19T15:47:10.459+0100 [DEBUG] http: request complete: method=GET path=/v1/job/PCMMManagerNoSpreadStanza-PCMMManager/summary?namespace=default duration=127.924µs
2021-03-19T15:47:14.445+0100 [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:34496
2021-03-19T15:47:15.127+0100 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration=152.508µs
2021-03-19T15:47:24.445+0100 [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:34504
2021-03-19T15:47:25.127+0100 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration=172.308µs

Node03 - trace
2021-03-19T15:46:18.131+0100 [DEBUG] http: request complete: method=GET path=/v1/job/PCMMManager-PCMMManager/deployment?index=268 duration=5m14.43017974s
2021-03-19T15:46:18.763+0100 [DEBUG] http: request complete: method=GET path=/v1/job/PCMMManager-PCMMManager/evaluations?index=291 duration=5m15.096552657s
2021-03-19T15:46:19.596+0100 [DEBUG] http: request complete: method=GET path=/v1/job/PCMMManager-PCMMManager?index=300 duration=5m0.703788795s
2021-03-19T15:46:19.785+0100 [DEBUG] http: request complete: method=GET path=/v1/client/stats?node_id=11459763-1a32-a1de-6141-e8bbad9f2061 duration=2.115091ms
2021-03-19T15:46:19.814+0100 [DEBUG] http: request complete: method=GET path=/v1/agent/monitor?client_id=11459763-1a32-a1de-6141-e8bbad9f2061&follow=true&log_level=debug&offset=50000&origin=end duration=3m24.347588185s
2021-03-19T15:46:19.828+0100 [DEBUG] http: request complete: method=GET path=/v1/client/allocation/713ee672-9119-16d7-3568-9169374acb5f/stats duration=1.554687ms
2021-03-19T15:46:19.893+0100 [DEBUG] http: request complete: method=GET path=/v1/client/allocation/ba5e3361-0c5a-927a-538b-e8de8c397e0a/stats duration=1.762755ms
2021-03-19T15:46:19.905+0100 [DEBUG] http: request complete: method=GET path=/v1/job/PCMMManager-PCMMManager/evaluations?index=291 duration=5m0.549051642s
2021-03-19T15:46:22.789+0100 [DEBUG] http: request complete: method=GET path=/v1/client/stats?node_id=183f311e-5494-d894-f495-9dabafded5be duration=784.147µs
2021-03-19T15:46:22.806+0100 [DEBUG] http: request complete: method=GET path=/v1/agent/monitor?client_id=183f311e-5494-d894-f495-9dabafded5be&follow=true&log_level=debug&offset=50000&origin=end duration=3m24.474681542s
2021-03-19T15:46:22.858+0100 [DEBUG] http: request complete: method=GET path=/v1/client/allocation/a11f2f7f-6915-e242-b8c5-072f76795359/stats duration=674.091µs
2021-03-19T15:46:24.443+0100 [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:34428
2021-03-19T15:46:24.631+0100 [DEBUG] nomad: memberlist: Stream connection from=10.10.20.14:57524
2021-03-19T15:46:25.122+0100 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration=149.359µs
2021-03-19T15:46:30.019+0100 [DEBUG] http: request complete: method=GET path=/v1/job/PCMMManager-PCMMManager/evaluations?index=291 duration=5m16.504643644s
2021-03-19T15:46:31.543+0100 [DEBUG] http: request complete: method=PUT path=/v1/validate/job duration=424.032µs
2021-03-19T15:46:31.553+0100 [DEBUG] http: request complete: method=PUT path=/v1/jobs duration=8.45694ms
2021-03-19T15:46:31.615+0100 [DEBUG] http: request complete: method=GET path=/v1/job/PCMMManager-PCMMManager/allocations?index=284 duration=5m12.807695721s
2021-03-19T15:46:31.620+0100 [DEBUG] http: request complete: method=GET path=/v1/jobs?prefix=PCMMWSServer01-PCMMWSServer duration=173.456µs
2021-03-19T15:46:31.621+0100 [DEBUG] http: request complete: method=GET path=/v1/job/PCMMWSServer01-PCMMWSServer/evaluations duration=137.151µs
2021-03-19T15:46:31.621+0100 [DEBUG] http: request complete: method=GET path=/v1/job/PCMMWSServer01-PCMMWSServer/allocations duration=178.11µs
2021-03-19T15:46:31.622+0100 [DEBUG] http: request complete: method=GET path=/v1/allocation/a11f2f7f-6915-e242-b8c5-072f76795359 duration=315.061µs
2021-03-19T15:46:31.624+0100 [DEBUG] http: request complete: method=GET path=/v1/allocation/a11f2f7f-6915-e242-b8c5-072f76795359 duration=267.61µs
2021-03-19T15:46:31.625+0100 [DEBUG] http: request complete: method=GET path=/v1/jobs?prefix=PCMMWSServer02-PCMMWSServer duration=149.776µs
2021-03-19T15:46:31.626+0100 [DEBUG] http: request complete: method=GET path=/v1/job/PCMMWSServer02-PCMMWSServer/evaluations duration=263.598µs
2021-03-19T15:46:31.627+0100 [DEBUG] http: request complete: method=GET path=/v1/job/PCMMWSServer02-PCMMWSServer/allocations duration=133.792µs
2021-03-19T15:46:31.628+0100 [DEBUG] http: request complete: method=GET path=/v1/allocation/ba5e3361-0c5a-927a-538b-e8de8c397e0a duration=226.982µs
2021-03-19T15:46:31.629+0100 [DEBUG] http: request complete: method=GET path=/v1/allocation/ba5e3361-0c5a-927a-538b-e8de8c397e0a duration=209.776µs
2021-03-19T15:46:31.630+0100 [DEBUG] http: request complete: method=GET path=/v1/jobs?prefix=PCMMAudit-PCMMAudit duration=111.983µs
2021-03-19T15:46:31.631+0100 [DEBUG] http: request complete: method=GET path=/v1/job/PCMMAudit-PCMMAudit/evaluations duration=124.877µs
2021-03-19T15:46:31.631+0100 [DEBUG] http: request complete: method=GET path=/v1/job/PCMMAudit-PCMMAudit/allocations duration=169.58µs
2021-03-19T15:46:31.632+0100 [DEBUG] http: request complete: method=GET path=/v1/allocation/713ee672-9119-16d7-3568-9169374acb5f duration=251.087µs
2021-03-19T15:46:31.634+0100 [DEBUG] http: request complete: method=GET path=/v1/allocation/713ee672-9119-16d7-3568-9169374acb5f duration=263.1µs
2021-03-19T15:46:31.635+0100 [DEBUG] http: request complete: method=GET path=/v1/jobs?prefix=PCMMWSServer03-PCMMWSServer duration=128.459µs
2021-03-19T15:46:31.635+0100 [DEBUG] http: request complete: method=GET path=/v1/job/PCMMWSServer03-PCMMWSServer/evaluations duration=112.056µs
2021-03-19T15:46:31.636+0100 [DEBUG] http: request complete: method=GET path=/v1/job/PCMMWSServer03-PCMMWSServer/allocations duration=151.926µs
2021-03-19T15:46:31.637+0100 [DEBUG] http: request complete: method=GET path=/v1/allocation/0917099e-2c9f-947c-4dc0-31372d293474 duration=287.667µs
2021-03-19T15:46:31.639+0100 [DEBUG] http: request complete: method=GET path=/v1/allocation/0917099e-2c9f-947c-4dc0-31372d293474 duration=235.828µs
2021-03-19T15:46:31.640+0100 [DEBUG] http: request complete: method=GET path=/v1/jobs?prefix=PCMMManagerNoSpreadStanza-PCMMManager duration=149.017µs
2021-03-19T15:46:31.640+0100 [DEBUG] http: request complete: method=GET path=/v1/job/PCMMManagerNoSpreadStanza-PCMMManager/evaluations duration=96.462µs
2021-03-19T15:46:31.641+0100 [DEBUG] http: request complete: method=GET path=/v1/job/PCMMManagerNoSpreadStanza-PCMMManager/allocations duration=104.301µs
2021-03-19T15:46:34.443+0100 [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:34436
2021-03-19T15:46:35.123+0100 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration=149.547µs
2021-03-19T15:46:44.444+0100 [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:34448
2021-03-19T15:46:45.124+0100 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration=147.614µs
2021-03-19T15:46:47.288+0100 [DEBUG] http: request complete: method=GET path=/v1/evaluations?prefix=9cd7ebc7-5e47-73a1-6cb1-daca156824d2 duration=179.073µs
2021-03-19T15:46:47.602+0100 [DEBUG] http: request complete: method=GET path=/v1/job/PCMMManager-PCMMManager/allocations?index=284 duration=5m0.132882224s
2021-03-19T15:46:51.817+0100 [DEBUG] nomad: memberlist: Initiating push/pull sync with: grabber02.grabber 10.10.20.14:4648
2021-03-19T15:46:54.444+0100 [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:34462
2021-03-19T15:46:55.125+0100 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration=155.216µs
2021-03-19T15:46:59.505+0100 [DEBUG] http: request complete: method=GET path=/v1/job/PCMMManager-PCMMManager/summary?index=284 duration=5m12.069446853s
2021-03-19T15:47:03.468+0100 [DEBUG] http: request complete: method=GET path=/v1/job/PCMMManager-PCMMManager?index=300 duration=5m16.033698766s
2021-03-19T15:47:04.444+0100 [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:34470
2021-03-19T15:47:05.126+0100 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration=157.203µs
2021-03-19T15:47:05.647+0100 [DEBUG] http: request complete: method=GET path=/v1/jobs duration=261.398µs
2021-03-19T15:47:10.454+0100 [DEBUG] http: request complete: method=PUT path=/v1/search duration=269.1µs
2021-03-19T15:47:10.456+0100 [DEBUG] http: request complete: method=GET path=/v1/jobs?prefix=PCMMManagerNoSpreadStanza-PCMMManager duration=163.93µs
2021-03-19T15:47:10.456+0100 [DEBUG] http: request complete: method=GET path=/v1/job/PCMMManagerNoSpreadStanza-PCMMManager?namespace=default duration=190.61µs
2021-03-19T15:47:10.458+0100 [DEBUG] http: request complete: method=GET path=/v1/job/PCMMManagerNoSpreadStanza-PCMMManager/allocations?all=false&namespace=default duration=120.233µs
2021-03-19T15:47:10.458+0100 [DEBUG] http: request complete: method=GET path=/v1/job/PCMMManagerNoSpreadStanza-PCMMManager/evaluations?namespace=default duration=102.132µs
2021-03-19T15:47:10.459+0100 [DEBUG] http: request complete: method=GET path=/v1/job/PCMMManagerNoSpreadStanza-PCMMManager/deployment?namespace=default duration=97.599µs
2021-03-19T15:47:10.459+0100 [DEBUG] http: request complete: method=GET path=/v1/job/PCMMManagerNoSpreadStanza-PCMMManager/summary?namespace=default duration=127.924µs
2021-03-19T15:47:14.445+0100 [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:34496
2021-03-19T15:47:15.127+0100 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration=152.508µs
2021-03-19T15:47:24.445+0100 [DEBUG] nomad: memberlist: Stream connection from=127.0.0.1:34504
2021-03-19T15:47:25.127+0100 [DEBUG] http: request complete: method=GET path=/v1/agent/health?type=server duration=172.308µs

Job definition
{
  "Stop": false,
  "Region": "grabber",
  "Namespace": "default",
  "ID": "PCMMManagerNoSpreadStanza-PCMMManager",
  "ParentID": "6cef33c5-41a6-4a1d-ba9e-9e3a2b57d0fb",
  "Name": "PCMMManagerNoSpreadStanza-PCMMManager",
  "Type": "service",
  "Priority": 50,
  "AllAtOnce": false,
  "Datacenters": [
    "dc1",
    "dc2",
    "dc3"
  ],
  "Constraints": [
    {
      "LTarget": "",
      "RTarget": "true",
      "Operand": "distinct_hosts"
    }
  ],
  "Affinities": null,
  "Spreads": null,
  "TaskGroups": [
    {
      "Name": "PCMMManagerNoSpreadStanza-PCMMManager-0",
      "Count": 1,
      "Update": {
        "Stagger": 30000000000,
        "MaxParallel": 1,
        "HealthCheck": "checks",
        "MinHealthyTime": 10000000000,
        "HealthyDeadline": 300000000000,
        "ProgressDeadline": 600000000000,
        "AutoRevert": false,
        "AutoPromote": false,
        "Canary": 0
      },
      "Migrate": {
        "MaxParallel": 1,
        "HealthCheck": "checks",
        "MinHealthyTime": 10000000000,
        "HealthyDeadline": 300000000000
      },
      "Constraints": [
        {
          "LTarget": "",
          "RTarget": "true",
          "Operand": "distinct_hosts"
        }
      ],
      "Scaling": null,
      "RestartPolicy": {
        "Attempts": 960,
        "Interval": 14400000000000,
        "Delay": 15000000000,
        "Mode": "delay"
      },
      "Tasks": [
        {
          "Name": "PCMMManagerNoSpreadStanza-PCMMManager-0",
          "Driver": "raw_exec",
          "User": "",
          "Config": {
            "args": [
              "-XX:+UseG1GC",
              "-XX:MaxGCPauseMillis=1000",
              "-XX:G1HeapRegionSize=1",
              "-Xms64m",
              "-Xmx768m",
              "-jar",
              "local/pcmm-manager-0.1.0.jar"
            ],
            "command": "/usr/bin/java"
          },
          "Env": {
            "SPRING_APPLICATION_JSON": "{\"spring.metrics.export.triggers.application.includes\":\"integration**\",\"spring.application.memory\":\"768\",\"spring.cloud.stream.metrics.key\":\"PCMMManagerNoSpreadStanza.PCMMManager.${spring.cloud.application.guid}\",\"spring.cloud.dataflow.stream.app.label\":\"PCMMManager\",\"spring.cloud.stream.bindings.input.group\":\"PCMMManagerNoSpreadStanza\",\"spring.application.count\":\"1\",\"spring.cloud.stream.metrics.properties\":\"spring.application.name,spring.application.index,spring.cloud.application.*,spring.cloud.dataflow.*\",\"spring.application.name\":\"PCMMManager\",\"server.port\":\"0\",\"spring.cloud.stream.bindings.output.destination\":\"OUTPUT\",\"spring.cloud.dataflow.stream.name\":\"PCMMManagerNoSpreadStanza\",\"spring.cloud.dataflow.stream.app.type\":\"processor\",\"spring.cloud.stream.bindings.input.destination\":\"INPUT\"}",
            "SPRING_CLOUD_APPLICATION_GROUP": "PCMMManagerNoSpreadStanza",
            "INSTANCE_INDEX": "0",
            "SPRING_APPLICATION_INDEX": "0"
          },
          "Services": null,
          "Vault": null,
          "Templates": null,
          "Constraints": null,
          "Affinities": null,
          "Resources": {
            "CPU": 100,
            "MemoryMB": 768,
            "DiskMB": 0,
            "IOPS": 0,
            "Networks": null,
            "Devices": null
          },
          "RestartPolicy": {
            "Attempts": 960,
            "Interval": 14400000000000,
            "Delay": 15000000000,
            "Mode": "delay"
          },
          "DispatchPayload": null,
          "Lifecycle": null,
          "Meta": {
            "uniqueId": "6cef33c5-41a6-4a1d-ba9e-9e3a2b57d0fb"
          },
          "KillTimeout": 5000000000,
          "LogConfig": {
            "MaxFiles": 2,
            "MaxFileSizeMB": 10
          },
          "Artifacts": [
            {
              "GetterSource": "http://maven01:9393/resources/maven/hr.bulb.ai/pcmm-manager-0.1.0.jar",
              "GetterOptions": {
                "checksum": "md5:ced83cd1b2fb581c073764df50deb493"
              },
              "GetterHeaders": null,
              "GetterMode": "any",
              "RelativeDest": "local"
            }
          ],
          "Leader": false,
          "ShutdownDelay": 0,
          "VolumeMounts": null,
          "ScalingPolicies": null,
          "KillSignal": "",
          "Kind": "",
          "CSIPluginConfig": null
        }
      ],
      "EphemeralDisk": {
        "Sticky": false,
        "SizeMB": 300,
        "Migrate": false
      },
      "Meta": {
        "uniqueId": "6cef33c5-41a6-4a1d-ba9e-9e3a2b57d0fb"
      },
      "ReschedulePolicy": {
        "Attempts": 0,
        "Interval": 0,
        "Delay": 30000000000,
        "DelayFunction": "exponential",
        "MaxDelay": 3600000000000,
        "Unlimited": true
      },
      "Affinities": null,
      "Spreads": null,
      "Networks": null,
      "Services": null,
      "Volumes": null,
      "ShutdownDelay": null,
      "StopAfterClientDisconnect": null
    }
  ],
  "Update": {
    "Stagger": 30000000000,
    "MaxParallel": 1,
    "HealthCheck": "",
    "MinHealthyTime": 0,
    "HealthyDeadline": 0,
    "ProgressDeadline": 0,
    "AutoRevert": false,
    "AutoPromote": false,
    "Canary": 0
  },
  "Multiregion": null,
  "Periodic": null,
  "ParameterizedJob": null,
  "Dispatched": false,
  "Payload": null,
  "Meta": {
    "uniqueId": "6cef33c5-41a6-4a1d-ba9e-9e3a2b57d0fb"
  },
  "ConsulToken": "",
  "VaultToken": "",
  "VaultNamespace": "",
  "NomadTokenID": "",
  "Status": "pending",
  "StatusDescription": "",
  "Stable": false,
  "Version": 0,
  "SubmitTime": 1616165191544841200,
  "CreateIndex": 306,
  "ModifyIndex": 306,
  "JobModifyIndex": 306
}

From java deployer (via nomad sdk):

jobValidateResponseServerResponse: 

        {"DriverConfigValidated":true,"ValidationErrors":null,"Error":"","Warnings":""}

EvaluationResponse: 

        deploymentID: PCMMManagerNoSpreadStanza-PCMMManager, 
        jobID:PCMMManagerNoSpreadStanza-PCMMManager, 
        evalID: 9cd7ebc7-5e47-73a1-6cb1-daca156824d2
        Http Response: {}

I checked the "evaluationID" after a few second, immediate after deployment

nomad eval status 9cd7ebc7-5e47-73a1-6cb1-daca156824d2
No evaluation(s) with prefix or id "9cd7ebc7-5e47-73a1-6cb1-daca156824d2" found

That's from an eval you just ran, and not one that's been GC'd? If so, that's strange...

Exactly, It has not been GC'd, no doubt about it.
It hasn't been created in the first place.
Perhaps this could be a root cause?

So again the same behavior and result:

Status = pending
Reason not available.

Code snippet from the java client using nomad-sdk

final Job job = JobBuilder.build()

final ServerResponse<JobValidateResponse> jobValidateResponseServerResponse = nomadApiClient.getJobsApi().validate(job);

LOGGER.info("jobValidateResponseServerResponse: {}", jobValidateResponseServerResponse.getValue().toString());

final EvaluationResponse response = nomadApiClient.getJobsApi().register(job);

LOGGER.info("EvaluationResponse: deploymentID:{}, jobID:{}, evalID:{}, Http Response: {}",
                       nomadContext.getDeploymentId(), job.getId(), response.getValue());

Thank you for your time and effort!

Regards,
Ivan Prostran

@tgross
Copy link
Member

tgross commented Mar 19, 2021

@ivanprostran I think you might have copied the wrong logs... all 3 nodes are showing identical logs. Also, it looks like you're running the server and client together in the same agent? It shouldn't matter for purposes of this issue, but just to be clear that's not a recommended production configuration.

@ivanprostran
Copy link
Author

@ivanprostran I think you might have copied the wrong logs... all 3 nodes are showing identical logs. Also, it looks like you're running the server and client together in the same agent? It shouldn't matter for purposes of this issue, but just to be clear that's not a recommended production configuration.

I opened it from the GUI (3 tabs 3 clients DEBUG)

image

image

image

The setup is :

3 node cluster

  1. node (consul, nomad-client, nomad-server)
  2. node (consul, nomad-client, nomad-server)
  3. node (consul, nomad-client, nomad-server)

Regards,
Ivan

@ivanprostran
Copy link
Author

Here process list (same output for both node2/node3..)

  1. node
consul     99773       1  0 12:13 ?        00:02:28 /opt/nomad-consul/bin/consul agent -server -bootstrap-expect=2 -ui -node=grabber1 -bind=0.0.0.0 -data-dir=/opt/nomad-consul/data -config-dir=/etc/nomad-consul 
nomad      99789       1  1 12:13 ?        00:07:00 /opt/nomad-server/bin/nomad agent -server -config /etc/nomad-server/ -region grabber
nomad      99809       1  0 12:13 ?        00:01:00 /opt/nomad-client/bin/nomad agent -client -config /etc/nomad-client -region grabber -dc=dc1

@tgross
Copy link
Member

tgross commented Mar 19, 2021

Please pull the logs from the journal.

@ivanprostran
Copy link
Author

ivanprostran commented Mar 19, 2021

Clean start (alloc/data empty)

node01

journalctl -u nomad-server
Mar 19 19:06:18 grabber01 nomad-server[105297]: ==> Loaded configuration from /etc/nomad-server/nomad-server.hcl
Mar 19 19:06:18 grabber01 nomad-server[105297]: ==> Starting Nomad agent...
Mar 19 19:06:18 grabber01 nomad-server[105297]: ==> Nomad agent configuration:
Mar 19 19:06:18 grabber01 nomad-server[105297]:        Advertise Addrs: HTTP: 10.10.20.13:4646; RPC: 10.10.20.13:4647; Serf: 10.10.20.13:4648
Mar 19 19:06:18 grabber01 nomad-server[105297]:             Bind Addrs: HTTP: 0.0.0.0:4646; RPC: 0.0.0.0:4647; Serf: 0.0.0.0:4648
Mar 19 19:06:18 grabber01 nomad-server[105297]:                 Client: false
Mar 19 19:06:18 grabber01 nomad-server[105297]:              Log Level: INFO
Mar 19 19:06:18 grabber01 nomad-server[105297]:                 Region: grabber (DC: dc1)
Mar 19 19:06:18 grabber01 nomad-server[105297]:                 Server: true
Mar 19 19:06:18 grabber01 nomad-server[105297]:                Version: 1.0.4
Mar 19 19:06:18 grabber01 nomad-server[105297]: ==> Nomad agent started! Log data will stream in below:
Mar 19 19:06:18 grabber01 nomad-server[105297]:     2021-03-19T19:06:18.510+0100 [WARN]  agent.plugin_loader: skipping external plugins since plugin_dir doesn't exist: plugin_di>
Mar 19 19:06:18 grabber01 nomad-server[105297]:     2021-03-19T19:06:18.511+0100 [INFO]  agent: detected plugin: name=docker type=driver plugin_version=0.1.0
Mar 19 19:06:18 grabber01 nomad-server[105297]:     2021-03-19T19:06:18.511+0100 [INFO]  agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
Mar 19 19:06:18 grabber01 nomad-server[105297]:     2021-03-19T19:06:18.511+0100 [INFO]  agent: detected plugin: name=exec type=driver plugin_version=0.1.0
Mar 19 19:06:18 grabber01 nomad-server[105297]:     2021-03-19T19:06:18.511+0100 [INFO]  agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
Mar 19 19:06:18 grabber01 nomad-server[105297]:     2021-03-19T19:06:18.511+0100 [INFO]  agent: detected plugin: name=java type=driver plugin_version=0.1.0
Mar 19 19:06:18 grabber01 nomad-server[105297]:     2021-03-19T19:06:18.511+0100 [INFO]  agent: detected plugin: name=nvidia-gpu type=device plugin_version=0.1.0
Mar 19 19:06:18 grabber01 nomad-server[105297]:     2021-03-19T19:06:18.525+0100 [INFO]  nomad.raft: initial configuration: index=0 servers=[]
Mar 19 19:06:18 grabber01 nomad-server[105297]:     2021-03-19T19:06:18.525+0100 [INFO]  nomad.raft: entering follower state: follower="Node at 10.10.20.13:4647 [Follower]" lea>
Mar 19 19:06:18 grabber01 nomad-server[105297]:     2021-03-19T19:06:18.525+0100 [INFO]  nomad: serf: EventMemberJoin: grabber01.grabber 10.10.20.13
Mar 19 19:06:18 grabber01 nomad-server[105297]:     2021-03-19T19:06:18.525+0100 [INFO]  nomad: starting scheduling worker(s): num_workers=8 schedulers=[service, batch, system, >
Mar 19 19:06:18 grabber01 nomad-server[105297]:     2021-03-19T19:06:18.525+0100 [INFO]  nomad: adding server: server="grabber01.grabber (Addr: 10.10.20.13:4647) (DC: d>
Mar 19 19:06:19 grabber01 nomad-server[105297]:     2021-03-19T19:06:19.677+0100 [WARN]  nomad.raft: no known peers, aborting election
Mar 19 19:06:21 grabber01 nomad-server[105297]:     2021-03-19T19:06:21.530+0100 [INFO]  nomad: serf: EventMemberJoin: grabber02.grabber 10.10.20.14
Mar 19 19:06:21 grabber01 nomad-server[105297]:     2021-03-19T19:06:21.530+0100 [INFO]  nomad: serf: EventMemberJoin: grabber03.grabber 10.10.20.15
Mar 19 19:06:21 grabber01 nomad-server[105297]:     2021-03-19T19:06:21.530+0100 [INFO]  nomad: adding server: server="grabber02.grabber (Addr: 10.10.20.14:4647) (DC: d>
Mar 19 19:06:21 grabber01 nomad-server[105297]:     2021-03-19T19:06:21.531+0100 [INFO]  nomad: successfully contacted Nomad servers: num_servers=2
Mar 19 19:06:21 grabber01 nomad-server[105297]:     2021-03-19T19:06:21.532+0100 [INFO]  nomad: found expected number of peers, attempting to bootstrap cluster...: peers=10.160.>
Mar 19 19:06:21 grabber01 nomad-server[105297]:     2021-03-19T19:06:21.543+0100 [INFO]  nomad: adding server: server="grabber03.grabber (Addr: 10.10.20.15:4647) (DC: d>
Mar 19 19:06:21 grabber01 nomad-server[105297]:     2021-03-19T19:06:21.983+0100 [WARN]  nomad.raft: heartbeat timeout reached, starting election: last-leader=
Mar 19 19:06:21 grabber01 nomad-server[105297]:     2021-03-19T19:06:21.983+0100 [INFO]  nomad.raft: entering candidate state: node="Node at 10.10.20.13:4647 [Candidate]" term=2
Mar 19 19:06:21 grabber01 nomad-server[105297]:     2021-03-19T19:06:21.997+0100 [INFO]  nomad.raft: election won: tally=2
Mar 19 19:06:21 grabber01 nomad-server[105297]:     2021-03-19T19:06:21.997+0100 [INFO]  nomad.raft: entering leader state: leader="Node at 10.10.20.13:4647 [Leader]"
Mar 19 19:06:21 grabber01 nomad-server[105297]:     2021-03-19T19:06:21.997+0100 [INFO]  nomad.raft: added peer, starting replication: peer=10.10.20.14:4647
Mar 19 19:06:21 grabber01 nomad-server[105297]:     2021-03-19T19:06:21.997+0100 [INFO]  nomad.raft: added peer, starting replication: peer=10.10.20.15:4647
Mar 19 19:06:21 grabber01 nomad-server[105297]:     2021-03-19T19:06:21.997+0100 [INFO]  nomad: cluster leadership acquired
Mar 19 19:06:21 grabber01 nomad-server[105297]:     2021-03-19T19:06:21.998+0100 [INFO]  nomad.raft: pipelining replication: peer="{Voter 10.10.20.15:4647 10.10.20.15:4647}"
Mar 19 19:06:22 grabber01 nomad-server[105297]:     2021-03-19T19:06:22.001+0100 [INFO]  nomad.raft: pipelining replication: peer="{Voter 10.10.20.14:4647 10.10.20.14:4647}"
Mar 19 19:06:22 grabber01 nomad-server[105297]:     2021-03-19T19:06:22.029+0100 [INFO]  nomad.core: established cluster id: cluster_id=486a1600-9738-4f1d-298b-7b3fa7117c6e crea>

journalctl -u nomad-client
Mar 19 19:06:22 grabber01 nomad-client[105314]: ==> Loaded configuration from /etc/nomad-client/nomad-client.hcl
Mar 19 19:06:22 grabber01 nomad-client[105314]: ==> Starting Nomad agent...
Mar 19 19:06:30 grabber01 nomad-client[105314]: ==> Nomad agent configuration:
Mar 19 19:06:30 grabber01 nomad-client[105314]:        Advertise Addrs: HTTP: 10.10.20.13:5656
Mar 19 19:06:30 grabber01 nomad-client[105314]:             Bind Addrs: HTTP: 0.0.0.0:5656
Mar 19 19:06:30 grabber01 nomad-client[105314]:                 Client: true
Mar 19 19:06:30 grabber01 nomad-client[105314]:              Log Level: INFO
Mar 19 19:06:30 grabber01 nomad-client[105314]:                 Region: grabber (DC: dc1)
Mar 19 19:06:30 grabber01 nomad-client[105314]:                 Server: false
Mar 19 19:06:30 grabber01 nomad-client[105314]:                Version: 1.0.4
Mar 19 19:06:30 grabber01 nomad-client[105314]: ==> Nomad agent started! Log data will stream in below:
Mar 19 19:06:30 grabber01 nomad-client[105314]:     2021-03-19T19:06:22.484+0100 [WARN]  agent.plugin_loader: skipping external plugins since plugin_dir doesn't exist: plugin_di>
Mar 19 19:06:30 grabber01 nomad-client[105314]:     2021-03-19T19:06:22.486+0100 [INFO]  agent: detected plugin: name=nvidia-gpu type=device plugin_version=0.1.0
Mar 19 19:06:30 grabber01 nomad-client[105314]:     2021-03-19T19:06:22.486+0100 [INFO]  agent: detected plugin: name=raw_exec type=driver plugin_version=0.1.0
Mar 19 19:06:30 grabber01 nomad-client[105314]:     2021-03-19T19:06:22.486+0100 [INFO]  agent: detected plugin: name=exec type=driver plugin_version=0.1.0
Mar 19 19:06:30 grabber01 nomad-client[105314]:     2021-03-19T19:06:22.486+0100 [INFO]  agent: detected plugin: name=qemu type=driver plugin_version=0.1.0
Mar 19 19:06:30 grabber01 nomad-client[105314]:     2021-03-19T19:06:22.486+0100 [INFO]  agent: detected plugin: name=java type=driver plugin_version=0.1.0
Mar 19 19:06:30 grabber01 nomad-client[105314]:     2021-03-19T19:06:22.486+0100 [INFO]  agent: detected plugin: name=docker type=driver plugin_version=0.1.0
Mar 19 19:06:30 grabber01 nomad-client[105314]:     2021-03-19T19:06:22.486+0100 [INFO]  client: using state directory: state_dir=/opt/nomad-client/data/client
Mar 19 19:06:30 grabber01 nomad-client[105314]:     2021-03-19T19:06:22.494+0100 [INFO]  client: using alloc directory: alloc_dir=/opt/nomad-client/data/alloc
Mar 19 19:06:30 grabber01 nomad-client[105314]:     2021-03-19T19:06:22.529+0100 [INFO]  client.fingerprint_mgr.cgroup: cgroups are available
Mar 19 19:06:30 grabber01 nomad-client[105314]:     2021-03-19T19:06:22.531+0100 [INFO]  client.fingerprint_mgr.consul: consul agent is available
Mar 19 19:06:30 grabber01 nomad-client[105314]:     2021-03-19T19:06:22.537+0100 [WARN]  client.fingerprint_mgr.network: unable to parse speed: path=/usr/sbin/ethtool device=lo
Mar 19 19:06:30 grabber01 nomad-client[105314]:     2021-03-19T19:06:30.545+0100 [INFO]  client.plugin: starting plugin manager: plugin-type=csi
Mar 19 19:06:30 grabber01 nomad-client[105314]:     2021-03-19T19:06:30.545+0100 [INFO]  client.plugin: starting plugin manager: plugin-type=driver
Mar 19 19:06:30 grabber01 nomad-client[105314]:     2021-03-19T19:06:30.545+0100 [INFO]  client.plugin: starting plugin manager: plugin-type=device
Mar 19 19:06:30 grabber01 nomad-client[105314]:     2021-03-19T19:06:30.546+0100 [INFO]  client: started client: node_id=2280799a-6d46-f758-537a-4b3d7730a651
Mar 19 19:06:30 grabber01 nomad-client[105314]:     2021-03-19T19:06:30.583+0100 [INFO]  client: node registration complete
Mar 19 19:06:38 grabber01 nomad-client[105314]:     2021-03-19T19:06:38.949+0100 [INFO]  client: node registration complete
Waiting for data... (interrupt to abort)

After job deployment (no change in the log!!)

nomad status
ID                       Type     Priority  Status   Submit Date
PCMMManager-PCMMManager  service  50        pending  2021-03-19T19:13:34+01:00
nomad status PCMMManager-PCMMManager
ID            = PCMMManager-PCMMManager
Name          = PCMMManager-PCMMManager
Submit Date   = 2021-03-19T19:13:34+01:00
Type          = service
Priority      = 50
Datacenters   = dc1,dc2,dc3
Namespace     = default
Status        = pending
Periodic      = false
Parameterized = false

Summary
Task Group                 Queued  Starting  Running  Failed  Complete  Lost
PCMMManager-PCMMManager-0  0       0         0        0       0         0

Allocations
No allocations placed

@tgross
Copy link
Member

tgross commented Mar 22, 2021

@ivanprostran we need the log info from all the servers in the cluster, and we need them at the debug level so we can see what's going on. A nomad operator debug bundle might be the best way to make sure we're getting everything we need, but make sure you send that to the debug mailing list rather than posting it here unless this is a throwaway cluster.

@ivanprostran
Copy link
Author

Hi Tim,

This is testing environment for the moment.
The same behavior could be easily reproduced on a single node environment.
I will prepare single-node environment and send all relevant information as you suggested.

Thank you for your time and effort.

Regards,
Ivan

@ivanprostran
Copy link
Author

Problem is solved:

#10422

Thank you for your time and effort.

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 19, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants