Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad server panic with SIGSEGV signal when using CSI volumes. #11174

Closed
CarbonCollins opened this issue Sep 13, 2021 · 5 comments · Fixed by #11185
Closed

Nomad server panic with SIGSEGV signal when using CSI volumes. #11174

CarbonCollins opened this issue Sep 13, 2021 · 5 comments · Fixed by #11185

Comments

@CarbonCollins
Copy link
Contributor

CarbonCollins commented Sep 13, 2021

Nomad version

1.1.4

Operating system and Environment details

OS: Rasbian 10 (buster)
Arch: armv7l
Kernel: 5.10.60-v7l+

Running on RasPi 4 with SSD boot drive.

Issue

I’m having trouble with my 3 Nomad servers where each of them is refusing to run for more than a few seconds before going into a panic. I recently started doing some testing with CSI volumes which as far as I can tell seems to be what’s causing the panic as its the last error in the logs before the SIGSEGV signal is thrown.

My clients nodes seem to be fine (minus the fact that there are no active servers right now).

The 3 servers boot and seem to fail with very similar errors (see server logs below)

Reproduction steps

Unfortunately I do not know exactly how to reproduce as I did not notice the issue was occurring until I was failing to submit jobs. I had recently restarted the machines (not all at once but they would have been in succession through an ansible playbook). Once I had noticed the error it occurs every time shortly after the server starts.

Expected Result

Nomad server to start without exiting with SIGSEGV panic

Actual Result

Nomad server runs for a few seconds while it initialises before panicking with a runtime error and then exiting with the SIGSEGV signal.

Job file (if appropriate)

CSI controller job:

job "storage-plugin-axion-controller" {
  type = "service"
  region = "global"
  datacenters = ["proxima"]

  priority = 80

  group "controller" {
    task "controller" {
      driver = "docker"

      config {
        image = "[[ .democraticImage ]]"

        args = [
          "--csi-mode=controller",
          "--csi-version=1.2.0",
          "--csi-name=org.democratic-csi.smb-client",
          "--driver-config-file=${NOMAD_TASK_DIR}/driver-config-file.yaml",
          "--server-socket=/csi-data/csi.sock"
        ]
      }

      vault {
        policies = ["cluster-storage"]
      }

      csi_plugin {
        id = "axion-proxima"
        type = "controller"
        mount_dir = "/csi-data"
      }

      template {
        data = <<EOH
{{ with secret "cluster-storage/proxima" }}
driver: smb-client
instance_id:
smb:
  shareHost: {{ index .Data.data "AXION_HOST" }}
  shareBasePath: "{{ index .Data.data "AXION_SHARE" }}"
  controllerBasePath: "/storage"
  dirPermissionsMode: "0777"
  dirPermissionsUser: 1005
  dirPermissionsGroup: 1005
{{ end }}
        EOH

        destination = "${NOMAD_TASK_DIR}/driver-config-file.yaml"
      }

      resources {
        cpu = 30
        memory = 50
      }
    }
  }
}

CSI node job

job "storage-plugin-axion-node" {
  type = "system"
  region = "global"
  datacenters = ["proxima"]

  priority = 79

  group "node" {
    task "node" {
      driver = "docker"

      config {
        image = "[[ .democraticImage ]]"

        args = [
          "--csi-mode=node",
          "--csi-version=1.2.0",
          "--csi-name=org.democratic-csi.smb-client",
          "--driver-config-file=${NOMAD_TASK_DIR}/driver-config-file.yaml",
          "--server-socket=/csi-data/csi.sock"
        ]

        privileged = true
      }

      vault {
        policies = ["cluster-storage"]
      }

      csi_plugin {
        id = "axion-proxima"
        type = "node"
        mount_dir = "/csi-data"
      }

      template {
        data = <<EOH
{{ with secret "cluster-storage/proxima" }}
driver: smb-client
instance_id:
smb:
  shareHost: {{ index .Data.data "AXION_HOST" }}
  shareBasePath: "{{ index .Data.data "AXION_SHARE" }}"
  controllerBasePath: "/storage"
  dirPermissionsMode: "0777"
  dirPermissionsUser: 1005
  dirPermissionsGroup: 1005
{{ end }}
        EOH

        destination = "${NOMAD_TASK_DIR}/driver-config-file.yaml"
      }

      resources {
        cpu = 30
        memory = 50
      }
    }
  }
}

These jobs are slight variations from the democratic-csi docs page: https://github.com/democratic-csi/democratic-csi/blob/master/docs/nomad.md

Nomad Server logs (if appropriate)

Sep 11 14:52:18 proxima-b nomad[547]:     2021-09-11T14:52:18.387+0200 [INFO]  nomad: adding server: server="proxima-b.global (Addr: 192.168.20.98:4647) (DC: proxima)"
Sep 11 14:52:18 proxima-b nomad[547]:     2021-09-11T14:52:18.390+0200 [INFO]  nomad: serf: EventMemberJoin: proxima-f.global 192.168.20.94
Sep 11 14:52:18 proxima-b nomad[547]:     2021-09-11T14:52:18.391+0200 [INFO]  nomad: serf: EventMemberJoin: proxima-e.global 192.168.20.95
Sep 11 14:52:18 proxima-b nomad[547]:     2021-09-11T14:52:18.391+0200 [INFO]  nomad: serf: Re-joined to previously known node: proxima-e.global: 192.168.20.95:4648
Sep 11 14:52:18 proxima-b nomad[547]:     2021-09-11T14:52:18.391+0200 [INFO]  nomad: adding server: server="proxima-f.global (Addr: 192.168.20.94:4647) (DC: proxima)"
Sep 11 14:52:18 proxima-b nomad[547]:     2021-09-11T14:52:18.392+0200 [INFO]  nomad: adding server: server="proxima-e.global (Addr: 192.168.20.95:4647) (DC: proxima)"
Sep 11 14:52:18 proxima-b nomad[547]:     2021-09-11T14:52:18.593+0200 [WARN]  nomad.raft: failed to get previous log: previous-index=484732 last-index=484718 error="log not found"
Sep 11 14:52:18 proxima-b nomad[547]:     2021-09-11T14:52:18.774+0200 [WARN]  nomad.raft: rejecting vote request since we have a leader: from=192.168.20.95:4647 leader=192.168.20.94:4647
Sep 11 14:52:18 proxima-b nomad[547]:     2021-09-11T14:52:18.845+0200 [WARN]  nomad.raft: failed to get previous log: previous-index=484734 last-index=484718 error="log not found"
Sep 11 14:52:19 proxima-b nomad[547]:     2021-09-11T14:52:19.136+0200 [ERROR] nomad.fsm: CSIVolumeClaim failed: error=unschedulable
Sep 11 14:52:19 proxima-b nomad[547]: panic: runtime error: invalid memory address or nil pointer dereference
Sep 11 14:52:19 proxima-b nomad[547]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x14 pc=0x702fa8]
Sep 11 14:52:19 proxima-b nomad[547]: goroutine 12 [running]:
Sep 11 14:52:19 proxima-b nomad[547]: github.com/hashicorp/go-immutable-radix.(*Iterator).Next(0x440f490, 0x0, 0x0, 0x0, 0x0, 0x47cfb64, 0x0)
Sep 11 14:52:19 proxima-b nomad[547]:         github.com/hashicorp/[email protected]/iter.go:178 +0x9c
Sep 11 14:52:19 proxima-b nomad[547]: github.com/hashicorp/go-memdb.(*radixIterator).Next(0x467fd30, 0x46cd040, 0x42ab1a0)
Sep 11 14:52:19 proxima-b nomad[547]:         github.com/hashicorp/[email protected]/txn.go:895 +0x20
Sep 11 14:52:19 proxima-b nomad[547]: github.com/hashicorp/nomad/nomad/state.upsertNodeCSIPlugins(0x4612180, 0x44a8fd0, 0x7639f, 0x0, 0x440f3c0, 0x0)
Sep 11 14:52:19 proxima-b nomad[547]:         github.com/hashicorp/nomad/nomad/state/state_store.go:1251 +0x27c
Sep 11 14:52:19 proxima-b nomad[547]: github.com/hashicorp/nomad/nomad/state.upsertNodeTxn(0x4612180, 0x7639f, 0x0, 0x44a8fd0, 0x4612180, 0x4068a80)
Sep 11 14:52:19 proxima-b nomad[547]:         github.com/hashicorp/nomad/nomad/state/state_store.go:856 +0x500
Sep 11 14:52:19 proxima-b nomad[547]: github.com/hashicorp/nomad/nomad/state.(*StateStore).UpsertNode(0x40c5d70, 0x1500, 0x7639f, 0x0, 0x44a8fd0, 0x0, 0x0)
Sep 11 14:52:19 proxima-b nomad[547]:         github.com/hashicorp/nomad/nomad/state/state_store.go:804 +0x9c
Sep 11 14:52:19 proxima-b nomad[547]: github.com/hashicorp/nomad/nomad.(*nomadFSM).applyUpsertNode(0x3ca5c00, 0x76300, 0x4458801, 0x1557, 0x1557, 0x7639f, 0x0, 0x0, 0x0)
Sep 11 14:52:19 proxima-b nomad[547]:         github.com/hashicorp/nomad/nomad/fsm.go:352 +0x140
Sep 11 14:52:19 proxima-b nomad[547]: github.com/hashicorp/nomad/nomad.(*nomadFSM).Apply(0x3ca5c00, 0x43e1710, 0x6ba71202, 0x3)
Sep 11 14:52:19 proxima-b nomad[547]:         github.com/hashicorp/nomad/nomad/fsm.go:211 +0x190
Sep 11 14:52:19 proxima-b nomad[547]: github.com/hashicorp/raft.(*Raft).runFSM.func1(0x41b3dd0)
Sep 11 14:52:19 proxima-b nomad[547]:         github.com/hashicorp/[email protected]/fsm.go:90 +0x204
Sep 11 14:52:19 proxima-b nomad[547]: github.com/hashicorp/raft.(*Raft).runFSM.func2(0x468a600, 0x40, 0x40)
Sep 11 14:52:19 proxima-b nomad[547]:         github.com/hashicorp/[email protected]/fsm.go:113 +0x5c
Sep 11 14:52:19 proxima-b nomad[547]: github.com/hashicorp/raft.(*Raft).runFSM(0x4110800)
Sep 11 14:52:19 proxima-b nomad[547]:         github.com/hashicorp/[email protected]/fsm.go:219 +0x27c
Sep 11 14:52:19 proxima-b nomad[547]: github.com/hashicorp/raft.(*raftState).goFunc.func1(0x4110800, 0x3cdf060)
Sep 11 14:52:19 proxima-b nomad[547]:         github.com/hashicorp/[email protected]/state.go:146 +0x50

Nomad Server config

All servers are Debian 10 armv7l

datacenter = "proxima"
name = "proxima-b"
region = "global"

data_dir = "/opt/nomad/data"
plugin_dir = "/opt/nomad/plugins"

server {
  enabled = true

  server_join {
    retry_join = ["192.168.20.95", "192.168.20.94"]
  }
}

I have excluded the vault, consul, advertise, and ports stanzas from the above config.

Nomad Client logs (if appropriate)

no logs for client

Nomad Client config

The clients have a range of different architectures (amd64, armv7l, and aarch64) and OS's (debian 10, Manjaro, Manjaro ARM, and macOS Big Sur)

datacenter = "proxima"
name = "proxima-d"
region = "global"

data_dir = "/opt/nomad/data"
plugin_dir = "/opt/nomad/plugins"

client {
  enabled = true

  chroot_env {
    "/bin" = "/bin"
    "/etc" = "/etc"
    "/lib" = "/lib"
    "/lib32" = "/lib32"
    "/lib64" = "/lib64"
    "/run/resolvconf" = "/run/resolvconf"
    "/sbin" = "/sbin"
    "/usr" = "/usr"
    "/opt/nomad/plugins" = "/opt/nomad/plugins"
    "/opt/nomad/bin" = "/opt/nomad/bin"
  }

  meta = {
    connect.gateway_image = "envoyproxy/envoy:v1.18.3"
    connect.sidecar_image = "envoyproxy/envoy:v1.18.3"
  }
}

plugin "usb" {
  config {
    enabled = true
    included_vendor_ids = [0x1CF1]
  }
}

plugin "docker" {
  config {
    allow_privileged = true
  }
}


I have excluded the vault, consul, advertise, ports, host volumes, and network stanzas from the above config.

Other

I originally raised a discussion on the forum so for cross linking purposes: https://discuss.hashicorp.com/t/nomad-servers-crash-within-a-few-seconds-of-starting-with-sigsegv-panic/29397

@lgfa29
Copy link
Contributor

lgfa29 commented Sep 14, 2021

Hi @CarbonCollins, thanks for the report.

I tried to reproduce this issue, but no luck so far. It seems like your server data may may be corrupted somehow?

[WARN]  nomad.raft: failed to get previous log: previous-index=484734 last-index=484718 error="log not found"

To give more details, the panic happens when Nomad tries to remove CSI plugin information from its internal database. In order to do this, it iterates over the list of registered plugins, and removes the ones that don't have a node or controller associated with it.

The problem is that, altering the table in the same transaction as an iteration is happening can cause the iterator state to become invalid. Consul actually had a similar error reported a while back (Nomad and Consul use the same database internally) and provided a fix.

Would you be able to build and test Nomad from this branch that contains an updated version of go-memdb?

Also, if you don't have any sensitive information if your cluster, would you mind sending us one of your server's data_dir to [email protected]? I will help us understand how your cluster ended up in this state.

Thank you!

@lgfa29
Copy link
Contributor

lgfa29 commented Sep 15, 2021

Ops, GithHub closed this by accident.

@lgfa29 lgfa29 reopened this Sep 15, 2021
@CarbonCollins
Copy link
Contributor Author

I unfortunately had to reset the servers so I no longer have the data_dir to be able to send.. Thinking back I should have made a copy of it before resetting 😅

I can still try with the test build on the new cluster if that will still be of any help?

Just as some observations with the new setup though I noticed that one of the CSI node containers was marked as unhealthy with the following error:

error: failed to create csi client: failed to open grpc connection to addr: /var/lib/nomad/client/csi/node/axion-proxima/csi.sock, err: context deadline exceeded

After this any volumes that were in use on that node still had an allocation marked even though the job using them has stopped.
I now have 1 allocation shown on the summary page:
image
and when I start the job that used the volume again I get two allocations, one complete, and one running:
image
however beforehand if the job was stopped the allocation would drop down to 0 again.

This also occurred before I started having issues with the servers before the SIGSEGV issue popped up.

Not sure if this allocation count is intentional or if this information is of use in anyway but i thought I would include it due to the events being similar to before.

Just to re-ask would testing the updated version of go-memdb still be of any help?

@lgfa29
Copy link
Contributor

lgfa29 commented Sep 15, 2021

I unfortunately had to reset the servers so I no longer have the data_dir to be able to send.. Thinking back I should have made a copy of it before resetting 😅

No worries. Priority number 1 is getting things running 😄

Just as some observations with the new setup though I noticed that one of the CSI node containers was marked as unhealthy with the following error:

error: failed to create csi client: failed to open grpc connection to addr: /var/lib/nomad/client/csi/node/axion-proxima/csi.sock, err: context deadline exceeded

After this any volumes that were in use on that node still had an allocation marked even though the job using them has stopped.
I now have 1 allocation shown on the summary page:

however beforehand if the job was stopped the allocation would drop down to 0 again.

This also occurred before I started having issues with the servers before the SIGSEGV issue popped up.

Not sure if this allocation count is intentional or if this information is of use in anyway but i thought I would include it due to the events being similar to before.

Hum...that's interesting. I wonder if this could be the cause for things like #10927, where Nomad thinks that the volume is still being used. A plugin allocation count is also used to determine if the plugin is still being used, so a miscount there could also cause your initial problem, where the plugin gets GC while the list of plugins is iterated on.

Just to re-ask would testing the updated version of go-memdb still be of any help?

I think at this point it won't make a difference. I wanted to see if it would help prevent the panic, so we're good on this. Thanks.

I will close this issue for now, since you were able to recreate your cluster, but feel free to reach out if you ever hit that panic again 🙂

@lgfa29 lgfa29 closed this as completed Sep 15, 2021
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 16, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants