Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad goes to corrupt state after restart #4748

Closed
LuciferInLove opened this issue Oct 4, 2018 · 4 comments
Closed

Nomad goes to corrupt state after restart #4748

LuciferInLove opened this issue Oct 4, 2018 · 4 comments

Comments

@LuciferInLove
Copy link

LuciferInLove commented Oct 4, 2018

Hello. I'm testing Nomad in our environment. I have some jobs like this:

job "stolon-sentinel" {
  region = "eu"
  datacenters = ["dev"]
  type = "service"

  constraint {
    attribute = "${meta.host_group}"
    value     = "storage"
  }

  update {
    max_parallel = 1
    min_healthy_time = "10s"
    healthy_deadline = "3m"
    progress_deadline = "10m"
    auto_revert = false
    canary = 0
  }

  migrate {
    max_parallel = 1
    health_check = "checks"
    min_healthy_time = "10s"
    healthy_deadline = "5m"
  }

  group "stolon-sentinel" {
    count = 2

    restart {
      attempts = 2
      interval = "10m"
      delay = "15s"
      mode = "fail"
    }

    ephemeral_disk {
      size = 300
    }

    task "stolon-sentinel" {
      driver = "docker"
      config {
        image = "stolon-sentinel:latest"
        command = "/bin/bash"
        args = ["-c", "stolon-sentinel --cluster-name dev --store-backend consul --store-endpoints http://${attr.unique.network.ip-address}:8500"]
        interactive = true
        logging {
          type = "json-file"
          config {
            max-size = "25m"
            max-file = "4"
          }
        }
      }

      logs {
        max_files     = 4
        max_file_size = 25
      }

      resources {
        memory = 20
      }

      kill_timeout = "20s"
    }
  }
}

Sometimes when I restart nomad client via systemctl I get such message:

Oct  4 10:36:04 nomad: 2018/10/04 10:36:04.175211 [ERR] client: failed to restore state for alloc "91a42207-f278-16ac-4f57-2fe857f67979": failed to read allocation state: failed to read alloc runner alloc state: no data at key alloc
Oct  4 10:36:04 nomad: 2018/10/04 10:36:04.175226 [ERR] client: failed to restore state: 1 error(s) occurred:
Oct  4 10:36:04 nomad: * failed to read allocation state: failed to read alloc runner alloc state: no data at key alloc
Oct  4 10:36:04 nomad: 2018/10/04 10:36:04.175230 [ERR] client: Nomad is unable to start due to corrupt state. The safest way to proceed is to manually stop running task processes and remove Nomad's state ("/var/lib/nomad/client") and alloc ("/var/lib/nomad/alloc") directories before restarting. Lost allocations will be rescheduled.
Oct  4 10:36:04 nomad: 2018/10/04 10:36:04.175233 [ERR] client: Corrupt state is often caused by a bug. Please report as much information as possible to https://github.com/hashicorp/nomad/issues

I don't manually delete any files before restart. After unmounting secrets and deleting this folders Nomad starts correctly. Why does it happen?

Agent config:

data_dir   = "/var/lib/nomad"
datacenter = "dev"
region     = "eu"
log_level  = "INFO"
bind_addr  = "172.31.1.112"

advertise {
  http = "172.31.1.112"
  rpc  = "172.31.1.112"
  serf = "172.31.1.112"
}

server {
  enabled          = true
  bootstrap_expect = 2
  raft_protocol    = 3
  server_join {
    retry_join     = ["provider=aws tag_key=Type tag_value=dev region=eu-west-1"]
    retry_max      = 3
    retry_interval = "5s"
  }

}

client {
  enabled = true
  server_join {
    retry_join     = ["provider=aws tag_key=Type tag_value=dev region=eu-west-1"]
    retry_max      = 3
    retry_interval = "5s"
  }
  options = {
    "driver.whitelist"         = "docker"
    "driver.blacklist"         = "qemu,exec,java,lxc,rkt,raw_exec"
    "driver.docker"            = "1"
    "docker.auth.config"       = "/etc/docker/config.json"
  }
  meta {
    "host_group" = "storage"
  }
}

consul {
  address = "172.31.1.112:8500"
}

Nomad agent runs in client and server mode.

Nomad version

0.8.6

Operating system and Environment details

Amazon Linux 2 4.14.70-72.55.amzn2.x86_64

@dadgar
Copy link
Contributor

dadgar commented Oct 4, 2018

Closing in favor of #2560

Hopefully some work in 0.9.0 will be able to close this out.

@pashinin
Copy link

pashinin commented Apr 18, 2019

I have this exact problem with Nomad 0.9 (and had with 0.8.6).
This happens on each restart on Arch Linux and sometimes on Debian 9.
Looks like Nomad does not stop everything it should and gets corrupted state files.
On each restart I delete "alloc" and "client" folders and restart Nomad.
Really annoying. Please reopen.

@schmichael
Copy link
Member

schmichael commented Apr 18, 2019

@pashinin The code involved change substantially between 0.8 and 0.9. Could you please open a new issue for the problem in 0.9, and we'll look into it ASAP! If there's information (job files, logs, etc) you don't feel comfortable sharing publicly, you can submit them via email to [email protected]. Please still open a new issue so we can track its resolution publicly.

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 24, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants