Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Panic on job plan with Nomad 1.6.0 #18007

Closed
gmichalec-pandora opened this issue Jul 19, 2023 · 9 comments · Fixed by #18017 or #18021
Closed

Panic on job plan with Nomad 1.6.0 #18007

gmichalec-pandora opened this issue Jul 19, 2023 · 9 comments · Fixed by #18017 or #18021
Assignees
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/crash theme/node-pools Issues related to node pools type/bug

Comments

@gmichalec-pandora
Copy link

Nomad version

1.6.0

Operating system and Environment details

Debian Buster

Issue

When running a nomad plan, we get the following error message from the CLI:

$ ~/bin/nomad-1.6.0  job plan -hcl1 /tmp/20230719144447/nomad.hcl.final
Error during plan: Put "http://allyrion:4646/v1/job/savagecloud-monitoring-integration/plan?namespace=savagecloud-monitoring&region=integration": EOF

When we inspect the server logs, we see this:

2023-07-19T15:01:28.391-0700 [TRACE] nomad.job: job mutate results: mutator=canonicalize warnings=[] error=<nil>
2023-07-19T15:01:28.391-0700 [TRACE] nomad.job: job mutate results: mutator=connect warnings=[] error=<nil>
2023-07-19T15:01:28.391-0700 [TRACE] nomad.job: job mutate results: mutator=expose-check warnings=[] error=<nil>
2023-07-19T15:01:28.391-0700 [TRACE] nomad.job: job mutate results: mutator=constraints warnings=[] error=<nil>
2023-07-19T15:01:28.391-0700 [DEBUG] http: request complete: method=PUT path="/v1/job/savagecloud-monitoring-integration/plan?namespace=savagecloud-monitoring®ion=integration" duration=6.007368ms
2023-07-19T15:01:28.392-0700 [ERROR] http: http: panic serving 172.17.225.60:33158: runtime error: invalid memory address or nil pointer dereference
goroutine 916881 [running]:
net/http.(*conn).serve.func1()
	net/http/server.go:1854 +0xbf
panic({0x2bdf8a0, 0x50386e0})
	runtime/panic.go:890 +0x263
github.com/hashicorp/nomad/nomad.jobNodePoolMutatingHook.Mutate({0x0?}, 0xc00384e000)
	github.com/hashicorp/nomad/nomad/job_endpoint_hook_node_pool_ent.go:86 +0x5d
github.com/hashicorp/nomad/nomad.(*Job).admissionMutators(0xc000a9eaa0, 0x29e4960?)
	github.com/hashicorp/nomad/nomad/job_endpoint_hooks.go:99 +0x122
github.com/hashicorp/nomad/nomad.(*Job).admissionControllers(0xc000aba2c0?, 0x31ced0f?)
	github.com/hashicorp/nomad/nomad/job_endpoint_hooks.go:81 +0x27
github.com/hashicorp/nomad/nomad.(*Job).Plan(0xc000a9eaa0, 0xc00792f740, 0xc002b5ab60)
	github.com/hashicorp/nomad/nomad/job_endpoint.go:1747 +0x196
reflect.Value.call({0xc000194e40?, 0xc0009193f8?, 0x7fef9c699c40?}, {0x31cfab3, 0x4}, {0xc003663378, 0x3, 0x0?})
	reflect/value.go:586 +0xb0b
reflect.Value.Call({0xc000194e40?, 0xc0009193f8?, 0x30925a0?}, {0xc003663378?, 0xc0036633c8?, 0xa594b8?})
	reflect/value.go:370 +0xbc
net/rpc.(*service).call(0xc0003f39c0, 0x2f32440?, 0xc003663450?, 0x0, 0xc0005b3280, 0x40?, {0x30925a0?, 0xc00792f740?, 0x0?}, {0x2d55360, ...}, ...)
	net/rpc/server.go:382 +0x226
net/rpc.(*Server).ServeRequest(0x70?, {0x38eb430, 0xc0032ef640})
	net/rpc/server.go:503 +0x18c
github.com/hashicorp/nomad/nomad.(*Server).RPC(0xc000aba2c0, {0x31e3f05, 0x8}, {0x30925a0?, 0xc00792f680}, {0x2d55360?, 0xc002b5aaf0})
	github.com/hashicorp/nomad/nomad/server.go:1925 +0xf3
github.com/hashicorp/nomad/command/agent.(*Agent).RPC(0xc000ee72d0?, {0x31e3f05?, 0xc0005aa500?}, {0x30925a0?, 0xc00792f680?}, {0x2d55360?, 0xc002b5aaf0?})
	github.com/hashicorp/nomad/command/agent/agent.go:1216 +0x45
github.com/hashicorp/nomad/command/agent.(*HTTPServer).jobPlan(0xc000ee72d0, {0x38e98d0, 0xc00792f2c0}, 0xc0005aa500, {0xc005075f1c, 0x22})
	github.com/hashicorp/nomad/command/agent/job_endpoint.go:181 +0x31e
github.com/hashicorp/nomad/command/agent.(*HTTPServer).JobSpecificRequest(0x412ac7?, {0x38e98d0, 0xc00792f2c0}, 0xc0005aa500)
	github.com/hashicorp/nomad/command/agent/job_endpoint.go:82 +0x41b
github.com/hashicorp/nomad/command/agent.(*HTTPServer).auditHandler.func1({0x38e98d0?, 0xc00792f2c0}, 0xc0005aa400)
	github.com/hashicorp/nomad/command/agent/http_ent.go:239 +0x1fa
github.com/hashicorp/nomad/command/agent.(*HTTPServer).wrap.func1({0x38e98d0, 0xc00792f2c0}, 0xc0005aa400)
	github.com/hashicorp/nomad/command/agent/http.go:710 +0x17b
net/http.HandlerFunc.ServeHTTP(0xc001b93dd0?, {0x38e98d0?, 0xc00792f2c0?}, 0x510e87?)
	net/http/server.go:2122 +0x2f
net/http.(*ServeMux).ServeHTTP(0x0?, {0x38e98d0, 0xc00792f2c0}, 0xc0005aa400)
	net/http/server.go:2500 +0x149
github.com/gorilla/handlers.CompressHandlerLevel.func1({0x38e7a10?, 0xc0017602a0}, 0xc0005aa400)
	github.com/gorilla/[email protected]/compress.go:141 +0x6d0
net/http.HandlerFunc.ServeHTTP(0xc005075f44?, {0x38e7a10?, 0xc0017602a0?}, 0x74cc14?)
	net/http/server.go:2122 +0x2f
net/http.serverHandler.ServeHTTP({0x38db890?}, {0x38e7a10, 0xc0017602a0}, 0xc0005aa400)
	net/http/server.go:2936 +0x316
net/http.(*conn).serve(0xc0036f3dd0, {0x38ea358, 0xc00140d260})
	net/http/server.go:1995 +0x612
created by net/http.(*Server).Serve
	net/http/server.go:3089 +0x5ed
@gmichalec-pandora
Copy link
Author

hcl file:

  "job": {
    "savagecloud-monitoring-integration": {
      "constraint": {
        "attribute": "${attr.kernel.name}",
        "value": "linux"
      },
      "datacenters": [
        "integration"
      ],
      "group": [
        {
          "savagecloud-monitoring-group": {
            "count": 2,
            "restart": {
              "attempts": 2,
              "delay": "25s",
              "interval": "2m",
              "mode": "fail"
            },
            "task": [
              {
                "savagecloud-monitoring": {
                  "config": {
                    "image": "harbor-registry.savagebeast.com/savagecloud-monitoring/savagecloud-monitoring:master-latest",
                    "port_map": {
                      "web": 3000
                    }
                  },
                  "driver": "docker",
                  "env": {
                    "COLO": "integration"
                  },
                  "resources": {
                    "cpu": 300,
                    "memory": 100,
                    "network": {
                      "mbits": 1,
                      "port": {
                        "web": {}
                      }
                    }
                  },
                  "service": [
                    {
                      "canary_tags": [
                        "project--savagecloud-monitoring",
                        "docker_tag--master-latest",
                        "nomad_job--${NOMAD_JOB_NAME}",
                        "nomad-alloc--${NOMAD_ALLOC_ID}",
                        "nomad_task--${NOMAD_JOB_NAME}-${NOMAD_GROUP_NAME}-${NOMAD_TASK_NAME}",
                        "canary--true"
                      ],
                      "check": {
                        "interval": "10s",
                        "name": "${NOMAD_TASK_NAME} https",
                        "path": "/",
                        "protocol": "https",
                        "timeout": "2s",
                        "type": "http"
                      },
                      "name": "${NOMAD_TASK_NAME}",
                      "port": "web",
                      "tags": [
                        "env--integration",
                        "integration",
                        "urlprefix-savagecloud-monitoring.docker.integration.savagebeast.com/",
                        "probe-check-https--savagecloud-monitoring.docker.integration.savagebeast.com",
                        "project--savagecloud-monitoring",
                        "docker_tag--master-latest",
                        "nomad_job--${NOMAD_JOB_NAME}",
                        "nomad-alloc--${NOMAD_ALLOC_ID}",
                        "nomad_task--${NOMAD_JOB_NAME}-${NOMAD_GROUP_NAME}-${NOMAD_TASK_NAME}"
                      ]
                    }
                  ],
                  "template": [
                    {
                      "data": "{{ with secret \"pki/issue/savagecloud-monitoring\" \"ttl=1h\" \"common_name=savagecloud-monitoring.service.consul\" \"alt_names=savagecloud-monitoring.docker.integration.savagebeast.com\" (env \"attr.unique.network.ip-address\" | printf  \"ip_sans=%s\") }}{{ .Data.certificate }}\n{{ range .Data.ca_chain }}{{ . }}\n{{ end }}{{ end }}",
                      "destination": "/secrets/service.consul.crt",
                      "splay": "1h"
                    },
                    {
                      "data": "{{ with secret \"pki/issue/savagecloud-monitoring\" \"ttl=1h\" \"common_name=savagecloud-monitoring.service.consul\" \"alt_names=savagecloud-monitoring.docker.integration.savagebeast.com\" (env \"attr.unique.network.ip-address\" | printf  \"ip_sans=%s\") }}{{ .Data.private_key }}{{ end }}",
                      "destination": "/secrets/service.consul.key",
                      "splay": "1h"
                    }
                  ],
                  "vault": {
                    "policies": [
                      "savagecloud-monitoring"
                    ]
                  }
                }
              }
            ]
          }
        }
      ],
      "meta": {
        "BRANCH": "master",
        "DEPLOY_MODE": "ci",
        "DOCKER_TAG": "master-latest",
        "GIT_REPO": "ssh://[email protected]:2222/sysad/savagecloud-monitoring.git",
        "LEVANT_FILE": "integration.yml",
        "NOMAD_FILE": "nomad.hcl",
        "PROJECT": "savagecloud-monitoring"
      },
      "namespace": "savagecloud-monitoring",
      "region": "integration",
      "type": "service",
      "update": {
        "auto_revert": false,
        "healthy_deadline": "30s",
        "max_parallel": 1,
        "min_healthy_time": "2s",
        "stagger": "30s"
      }
    }
  }
}

@lgfa29
Copy link
Contributor

lgfa29 commented Jul 19, 2023

Hi @gmichalec-pandora 👋

Thanks for the report and apologies for the bug.

I have not being able to reproduce this yet, would be able to check a few things for me:

  1. Are all servers upgraded to Nomad 1.6.0?
  2. What's the output of nomad namespace inspect savagecloud-monitoring?

@lgfa29 lgfa29 self-assigned this Jul 19, 2023
@lgfa29 lgfa29 added the theme/node-pools Issues related to node pools label Jul 19, 2023
@gmichalec-pandora
Copy link
Author

all of the servers in this region have been upgraded to 1.6.0. We have servers in other regions running various older nomads (we keep running into various issues preventing us from doing a fleetwide upgrade).

$ nomad server members | grep alive | awk '{print $7}' | sort | uniq -c
      3 0.12.5+ent
      6 1.4.2+ent
      3 1.4.8+ent
      3 1.5.3+ent
      3 1.5.6+ent
      3 1.6.0+ent

FWIW, we also get an error in the web UI in the 1.6.0 region when we try to select any other region: rpc error: rpc: can't find service NodePool.List

Here's the command output:

nomad namespace inspect savagecloud-monitoring 
{
    "Capabilities": null,
    "CreateIndex": 7030606,
    "Description": "Project: savagecloud-monitoring, Team: sysad Slack: #sysad",
    "Meta": null,
    "ModifyIndex": 7030606,
    "Name": "savagecloud-monitoring",
    "Quota": ""
}

Thanks for looking into this

@lgfa29
Copy link
Contributor

lgfa29 commented Jul 20, 2023

Thanks for the extra info @gmichalec-pandora.

I tried to reproduce the panic in a multi-region environment but still no luck....Could you describe the steps you took to reach this state?

More specifically:

  1. Is the 1.6.0+ent region new or was it upgraded from a previous version?
  2. Which Nomad version is the authoritative region running?
  3. Which Nomad version received the job plan request?

@gmichalec-pandora
Copy link
Author

  • the 1.6.0 region was upgraded from 1.5.6
  • the authoritative region (called 'global') is running 1.4.8+ent
  • the plan is being submitted to the 1.6.0 region ("integration")

@gmichalec-pandora
Copy link
Author

The 1.6.0 leader logs also show lots of these:

2023-07-20T11:59:15.945-0700 [ERROR] nomad: failed to fetch node pools from authoritative region: error="rpc error: rpc: can't find service NodePool.List"
2023-07-20T11:59:15.946-0700 [ERROR] nomad: failed to fetch ACL auth-methods from authoritative region: error="rpc error: rpc: can't find method ACL.ListAuthMethods"
2023-07-20T11:59:15.953-0700 [ERROR] nomad: failed to fetch ACL binding rules from authoritative region: error="rpc error: rpc: can't find method ACL.ListBindingRules"
2023-07-20T11:59:46.014-0700 [ERROR] nomad: failed to fetch node pools from authoritative region: error="rpc error: rpc: can't find service NodePool.List"
2023-07-20T11:59:46.015-0700 [ERROR] nomad: failed to fetch ACL auth-methods from authoritative region: error="rpc error: rpc: can't find method ACL.ListAuthMethods"
2023-07-20T11:59:46.022-0700 [ERROR] nomad: failed to fetch ACL binding rules from authoritative region: error="rpc error: rpc: can't find method ACL.ListBindingRules"
2023-07-20T12:00:16.083-0700 [ERROR] nomad: failed to fetch ACL auth-methods from authoritative region: error="rpc error: rpc: can't find method ACL.ListAuthMethods"
2023-07-20T12:00:16.084-0700 [ERROR] nomad: failed to fetch node pools from authoritative region: error="rpc error: rpc: can't find service NodePool.List"
2023-07-20T12:00:16.090-0700 [ERROR] nomad: failed to fetch ACL binding rules from authoritative region: error="rpc error: rpc: can't find method ACL.ListBindingRules"

@lgfa29
Copy link
Contributor

lgfa29 commented Jul 20, 2023

Thanks! Those are somewhat expected because the newer versions of Nomad are trying to retrieve data that doesn't exist in older versions of Nomad (such as ACL auth methods and node pools).

We were able to identify the issue. We have some logic to upgrade namespaces on updated here:

for _, ns := range namespaces {
// Handle upgrade path.
ns.Canonicalize()
if err := s.upsertNamespaceImpl(index, txn, ns); err != nil {
return err
}
}

But this only gets applied to recent transactions. Older Raft logs are compacted and restored from the snapshot, and we're missing a similar handling here:

// NamespaceRestore is used to restore a namespace
func (r *StateRestore) NamespaceRestore(ns *structs.Namespace) error {
if err := r.txn.Insert(TableNamespaces, ns); err != nil {
return fmt.Errorf("namespace insert failed: %v", err)
}
return nil
}

We should have a fix for it shortly. Thanks again for the detailed report.

@lgfa29
Copy link
Contributor

lgfa29 commented Jul 24, 2023

Just noting here that the fix has been released on Nomad 1.6.1.

Thank you again for the report!

@gmichalec-pandora
Copy link
Author

Thank you! we upgraded to 1.6.1 and everything looks good now!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/crash theme/node-pools Issues related to node pools type/bug
Projects
2 participants