Nomad goes to corrupt state after restart #4748

LuciferInLove · 2018-10-04T11:56:21Z

Hello. I'm testing Nomad in our environment. I have some jobs like this:

job "stolon-sentinel" {
  region = "eu"
  datacenters = ["dev"]
  type = "service"

  constraint {
    attribute = "${meta.host_group}"
    value     = "storage"
  }

  update {
    max_parallel = 1
    min_healthy_time = "10s"
    healthy_deadline = "3m"
    progress_deadline = "10m"
    auto_revert = false
    canary = 0
  }

  migrate {
    max_parallel = 1
    health_check = "checks"
    min_healthy_time = "10s"
    healthy_deadline = "5m"
  }

  group "stolon-sentinel" {
    count = 2

    restart {
      attempts = 2
      interval = "10m"
      delay = "15s"
      mode = "fail"
    }

    ephemeral_disk {
      size = 300
    }

    task "stolon-sentinel" {
      driver = "docker"
      config {
        image = "stolon-sentinel:latest"
        command = "/bin/bash"
        args = ["-c", "stolon-sentinel --cluster-name dev --store-backend consul --store-endpoints http://${attr.unique.network.ip-address}:8500"]
        interactive = true
        logging {
          type = "json-file"
          config {
            max-size = "25m"
            max-file = "4"
          }
        }
      }

      logs {
        max_files     = 4
        max_file_size = 25
      }

      resources {
        memory = 20
      }

      kill_timeout = "20s"
    }
  }
}

Sometimes when I restart nomad client via systemctl I get such message:

Oct  4 10:36:04 nomad: 2018/10/04 10:36:04.175211 [ERR] client: failed to restore state for alloc "91a42207-f278-16ac-4f57-2fe857f67979": failed to read allocation state: failed to read alloc runner alloc state: no data at key alloc
Oct  4 10:36:04 nomad: 2018/10/04 10:36:04.175226 [ERR] client: failed to restore state: 1 error(s) occurred:
Oct  4 10:36:04 nomad: * failed to read allocation state: failed to read alloc runner alloc state: no data at key alloc
Oct  4 10:36:04 nomad: 2018/10/04 10:36:04.175230 [ERR] client: Nomad is unable to start due to corrupt state. The safest way to proceed is to manually stop running task processes and remove Nomad's state ("/var/lib/nomad/client") and alloc ("/var/lib/nomad/alloc") directories before restarting. Lost allocations will be rescheduled.
Oct  4 10:36:04 nomad: 2018/10/04 10:36:04.175233 [ERR] client: Corrupt state is often caused by a bug. Please report as much information as possible to https://github.com/hashicorp/nomad/issues

I don't manually delete any files before restart. After unmounting secrets and deleting this folders Nomad starts correctly. Why does it happen?

Agent config:

data_dir   = "/var/lib/nomad"
datacenter = "dev"
region     = "eu"
log_level  = "INFO"
bind_addr  = "172.31.1.112"

advertise {
  http = "172.31.1.112"
  rpc  = "172.31.1.112"
  serf = "172.31.1.112"
}

server {
  enabled          = true
  bootstrap_expect = 2
  raft_protocol    = 3
  server_join {
    retry_join     = ["provider=aws tag_key=Type tag_value=dev region=eu-west-1"]
    retry_max      = 3
    retry_interval = "5s"
  }

}

client {
  enabled = true
  server_join {
    retry_join     = ["provider=aws tag_key=Type tag_value=dev region=eu-west-1"]
    retry_max      = 3
    retry_interval = "5s"
  }
  options = {
    "driver.whitelist"         = "docker"
    "driver.blacklist"         = "qemu,exec,java,lxc,rkt,raw_exec"
    "driver.docker"            = "1"
    "docker.auth.config"       = "/etc/docker/config.json"
  }
  meta {
    "host_group" = "storage"
  }
}

consul {
  address = "172.31.1.112:8500"
}

Nomad agent runs in client and server mode.

Nomad version

0.8.6

Operating system and Environment details

Amazon Linux 2 4.14.70-72.55.amzn2.x86_64

The text was updated successfully, but these errors were encountered:

dadgar · 2018-10-04T16:58:46Z

Closing in favor of #2560

Hopefully some work in 0.9.0 will be able to close this out.

pashinin · 2019-04-18T08:58:11Z

I have this exact problem with Nomad 0.9 (and had with 0.8.6).
This happens on each restart on Arch Linux and sometimes on Debian 9.
Looks like Nomad does not stop everything it should and gets corrupted state files.
On each restart I delete "alloc" and "client" folders and restart Nomad.
Really annoying. Please reopen.

schmichael · 2019-04-18T16:36:27Z

@pashinin The code involved change substantially between 0.8 and 0.9. Could you please open a new issue for the problem in 0.9, and we'll look into it ASAP! If there's information (job files, logs, etc) you don't feel comfortable sharing publicly, you can submit them via email to [email protected]. Please still open a new issue so we can track its resolution publicly.

github-actions · 2022-11-24T02:20:25Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

dadgar closed this as completed Oct 4, 2018

kcwong-verseon mentioned this issue Oct 4, 2018

Nomad Client crashes on startup failing to restore allocations with empty state files #2560

Closed

pashinin mentioned this issue Apr 19, 2019

Nomad failed to start after node reboot #5584

Closed

github-actions bot locked as resolved and limited conversation to collaborators Nov 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nomad goes to corrupt state after restart #4748

Nomad goes to corrupt state after restart #4748

LuciferInLove commented Oct 4, 2018 •

edited

Loading

dadgar commented Oct 4, 2018

pashinin commented Apr 18, 2019 •

edited

Loading

schmichael commented Apr 18, 2019 •

edited

Loading

github-actions bot commented Nov 24, 2022

Nomad goes to corrupt state after restart #4748

Nomad goes to corrupt state after restart #4748

Comments

LuciferInLove commented Oct 4, 2018 • edited Loading

Nomad version

Operating system and Environment details

dadgar commented Oct 4, 2018

pashinin commented Apr 18, 2019 • edited Loading

schmichael commented Apr 18, 2019 • edited Loading

github-actions bot commented Nov 24, 2022

LuciferInLove commented Oct 4, 2018 •

edited

Loading

pashinin commented Apr 18, 2019 •

edited

Loading

schmichael commented Apr 18, 2019 •

edited

Loading