Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature] gracefully handle corrupt startup state #1367

Closed
camerondavison opened this issue Jun 28, 2016 · 20 comments
Closed

[feature] gracefully handle corrupt startup state #1367

camerondavison opened this issue Jun 28, 2016 · 20 comments

Comments

@camerondavison
Copy link
Contributor

As is mentioned in #1170 , if any of the state files are corrupted when a nomad client start up then it will never be able to start.

I think that on start up it should be considered (best effort) to try and recover the state of the machine but in the event that state files are corrupted maybe it would be best to make sure and shutdown any tasks that are still running, and delete all of the allocation state information on disk. This way it will still start up, but let the scheduler figure out a new home for the task.

@ghost
Copy link

ghost commented Jun 28, 2016

This would be neat! I experience this fairly occasionally when docker hangs and I have to force reboot an instance.

I've built a temporary work around where I've written health checks for backends that checks the state of docker and nomad, and, if either fail, let Google Compute Engine automatically destroy and reprovision the instance.

@camerondavison
Copy link
Contributor Author

I would be interested to know if you continue seeing this after 0.4.0 comes
out because of #1357

On Tue, Jun 28, 2016, 4:26 PM George Robinson [email protected]
wrote:

This would be neat! I experience this fairly occasionally when docker
hangs and I have to force reboot an instance.

I've built a temporary work around where I've written health checks for
backends that checks the state of docker and nomad are healthy, and, if
either fail, let Google Compute Engine automatically destroy and
reprovision the instance.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#1367 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AAKGBYVKtnIC8qndmvslKZB4sbbsJ60Kks5qQZGZgaJpZM4JAh0G
.

@flegler-zz
Copy link

+1 for this feature.

Occasionally I see this issue with Nomad 0.4.0

# /path/nomad agent -config /path/config.json
    Loaded configuration from /path/config.json
==> Starting Nomad agent...
==> Error starting agent: client setup failed: failed to restore state: 2 error(s) occurred:

* failed to decode state: unexpected end of JSON input
* failed to decode state: unexpected end of JSON input

It turns out that some state.json file are empty

# /...path.../nomad version
Nomad v0.4.0
# find /...nomad_directory.../ -name state.json | xargs ls -la
-rw------- 1 root root    0 Aug 11 13:13 /nomad_directory/client/client/alloc/5612e3fc-078d-0a08-049a-b8169d50920c/state.json
-rw------- 1 root root 2760 Aug  8 12:01 /nomad_directory/client/client/alloc/5612e3fc-078d-0a08-049a-b8169d50920c/task-a519f3bf481a344e286bbdab7d3cfe5f/state.json
-rw------- 1 root root    0 Aug 11 13:13 /nomad_directory/client/client/alloc/643ace46-34e0-7861-c26e-43cf5c82a6c1/state.json
-rw------- 1 root root    0 Aug  8 19:52 /nomad_directory/client/client/alloc/643ace46-34e0-7861-c26e-43cf5c82a6c1/task-e71f891226c0be65a6a1d38f954cc4b8/state.json

Deleting /nomad_directory/ helped to overcome the issue.

Not sure what caused this. It might have been the shutdown of the nodes. There was no shortage of disk space.

@capone212
Copy link
Contributor

+1
Experience the same problem

@camerondavison
Copy link
Contributor Author

camerondavison commented Aug 17, 2016

@flegler is that usually after a hard server crash?

@ghost
Copy link

ghost commented Aug 17, 2016

@a86c6f7964 I also see with kill -9. It should be simple to do atomic writes - write to a tmp file then do a rename operation - since renaming a file on Linux (and most fs implementations) is an atomic operation.

@camerondavison
Copy link
Contributor Author

Ya that is what #1357 was supposed to do. It just doesn't do an fsync which is why I was asking if it was on server crash. Makes me think we missed something

@flegler-zz
Copy link

@a86c6f7964 I am not sure, but it is possible that the nodes were shut down without ending Nomad first.

@flegler-zz
Copy link

Another issue today on 0.4.0 when the client did not want to start. This time not with JSON files. Anybody knows what could be wrong here?

# /path/nomad agent -config /path/config.json
    Loaded configuration from /path/config.json
==> Starting Nomad agent...
==> Error starting agent: client setup failed: failed to restore state: 1 error(s) occurred:

* 2 error(s) occurred:

* alloc_runner snapshot includes a nil allocation
* alloc_runner snapshot includes a nil context

Deleting the Nomad data directory helped again. I kept a copy in case you need some information from it.

@flegler-zz
Copy link

flegler-zz commented Aug 19, 2016

Another variation today with task runner:

/path/nomad agent -config /path/config.json
    Loaded configuration from /path/config.json
==> Starting Nomad agent...
==> Error starting agent: client setup failed: failed to restore state: 1 error(s) occurred:

* 1 error(s) occurred:

* task runner snapshot include nil Task

This is happening in a cloud environment where services are usually not shut down before nodes/machines are shut down.

@dadgar dadgar added this to the v0.5.0 milestone Aug 24, 2016
@dadgar
Copy link
Contributor

dadgar commented Aug 25, 2016

@flegler Yes it would be awesome if you could post those up

@schmichael
Copy link
Member

For now my initial work on this issue I'm going to do the following:

  • Handle restoring nil contexts properly as it's not an error restore a nil context
  • Add some sanity checks around state saving/restoring

If we determine there are unavoidable environmental factors that could lead to state files being truncated/corrupted while at rest (e.g. unclean shutdowns and restarts), then we have a plan on how to gracefully deal with them. However, we'd like to ensure there's no bugs with persisting state before we focus on defensive mitigations that could cover up solvable bugs.

schmichael added a commit that referenced this issue Sep 2, 2016
Prevent interleaving state syncs as it could conceivably lead to
empty state files as per #1367
@dadgar dadgar removed this from the v0.5.0 milestone Nov 9, 2016
@dadgar
Copy link
Contributor

dadgar commented Nov 9, 2016

Dropped the milestone but keeping this open in case anyone hits any issue we can track it. 0.5.0 includes some potential cases of corruption and adds sanity checks. Hopefully those were all the issues!

@yangzhares
Copy link

hey guys,

i also got this issue in v0.4.1, task state.json is empty.
-rw-------. 1 root root 0 Nov 28 06:47 /var/lib/nomad/client/alloc/5074d3ec-e4b8-9477-46a0-fa187300f84c/task-86a1b907d54bf7010394bf316e183e67/state.json

systemd start log:
Dec 01 02:08:40 hf1-nomad-agent-01.novalocal systemd[1]: Started Nomad.
Dec 01 02:08:40 hf1-nomad-agent-01.novalocal systemd[1]: Starting Nomad...
Dec 01 02:08:40 hf1-nomad-agent-01.novalocal nomad[3624]: Loaded configuration from /etc/nomad/client.hcl
Dec 01 02:08:40 hf1-nomad-agent-01.novalocal nomad[3624]: ==> Starting Nomad agent...
Dec 01 02:08:44 hf1-nomad-agent-01.novalocal nomad[3624]: ==> Error starting agent: client setup failed: failed to restore state: 1 error(s) occurred:
Dec 01 02:08:44 hf1-nomad-agent-01.novalocal nomad[3624]: * failed to decode state: unexpected end of JSON input
Dec 01 02:08:44 hf1-nomad-agent-01.novalocal systemd[1]: nomad.service: main process exited, code=exited, status=1/FAILURE
Dec 01 02:08:44 hf1-nomad-agent-01.novalocal systemd[1]: Unit nomad.service entered failed state.
Dec 01 02:08:44 hf1-nomad-agent-01.novalocal systemd[1]: nomad.service failed.

@schmichael
Copy link
Member

@yangzhares Any chance you can upgrade to 0.5.0? That's where some fixes and sanity checks were introduced to address this.

@pgporada
Copy link
Contributor

pgporada commented Dec 16, 2016

@schmichael I am encountering something related to this on Nomad v0.5.1

Nomad starts just fine.

[root@centos-7-cl1 ~]# systemctl status nomad
● nomad.service - Nomad Agent
   Loaded: loaded (/usr/lib/systemd/system/nomad.service; enabled; vendor preset: disabled)
   Active: active (running) since Fri 2016-12-16 20:54:42 UTC; 1s ago
 Main PID: 10955 (sh)
   Memory: 2.4M
   CGroup: /system.slice/nomad.service
           ├─10955 /bin/sh -c /sbin/nomad agent -config /etc/nomad  >> /var/log/nomad/nomad.log 2>&1
           └─10956 /sbin/nomad agent -config /etc/nomad

Dec 16 20:54:42 centos-7-cl1 systemd[1]: nomad.service holdoff time over, scheduling restart.
Dec 16 20:54:42 centos-7-cl1 systemd[1]: Started Nomad Agent.
Dec 16 20:54:42 centos-7-cl1 systemd[1]: Starting Nomad Agent...

I can schedule jobs onto my clients, however, these messages still appear on the clients.

$ cat /var/log/nomad/nomad.log
==> Error starting agent: client setup failed: failed to restore state: 2 error(s) occurred:

* failed to decode state: unexpected end of JSON input
* failed to decode state: unexpected end of JSON input
    Loaded configuration from /etc/nomad/client.hcl
==> Starting Nomad agent...
==> Error starting agent: client setup failed: failed to restore state: 2 error(s) occurred:

* failed to decode state: unexpected end of JSON input
* failed to decode state: unexpected end of JSON input
    Loaded configuration from /etc/nomad/client.hcl
==> Starting Nomad agent...

If I run these commands, I can stop the errors from appearing.

systemctl stop nomad
cd /opt/nomad/client/alloc/
rm -rf *
cd /opt/nomad/alloc/
rm -rf *
systemctl restart nomad

It would be really beneficial if the logs had a timestamp.

@jmvacdmon
Copy link
Contributor

jmvacdmon commented May 2, 2017

Same in nomad 0.5.4

Loaded configuration from /etc/nomad.d/nomad.hcl, /etc/nomad.d/nomad_client.hcl
==> Starting Nomad agent...
==> Error starting agent: client setup failed: failed to restore state: 1 error(s) occurred:

  • 2 error(s) occurred:
  • alloc_runner snapshot includes a nil allocation
  • alloc_runner snapshot includes a nil alloc dir

rm -R /opt/nomad/client/alloc/ && systemctl start nomad

@schmichael
Copy link
Member

I'm closing this as 0.6 substantially changes the state persistence code. Nomad now uses a single boltdb file instead of a tree of JSON files. This made atomic state changes much easier and hopefully fixed corrupt state issues (along with some other minor improvements to persistence code in 0.6).

I've also added more verbose logging when state issues are encountered in #2763

Please don't hesitate to open new issues if state restore bugs continue in 0.6!

(I opted against trying to write more sophisticated cleanup/recovery code at this time as it seems more likely to cause more problems than it fixes until we know the exact form new persistence issues take.)

@schmichael schmichael added this to the v0.6.0 milestone Jul 3, 2017
@artkrz
Copy link

artkrz commented Jan 4, 2018

Similar problems on:

$ nomad  -version
Nomad v0.7.1 (0b295d399d00199cfab4621566babd25987ba06e)

Here's a little script that helps restore the node quickly.

Please think twice before you run this on your box! Especially on production nodes.

#!/bin/bash

mount_points=`mount | grep /var/nomad/alloc | awk {'print $3'}`

echo "Stopping Nomad..."
systemctl stop nomad

echo "Kill processes using /var/nomad/alloc/..."
for pid in `lsof | grep '/var/nomad/alloc' | awk {'print $2'} | sort | uniq`
do
    kill -TERM $pid
done

nomad_procs=`pgrep -fc /usr/local/bin/nomad`
if [[ "$nomad_procs" == "0" ]]
then
    for mount_point in $mount_points
    do
        echo "Unmounting $mount_point"
        umount $mount_point
    done
fi

echo "Removing /var/nomad/alloc and /var/nomad/client..."
rm -r /var/nomad/alloc
rm -r /var/nomad/client

echo "Starting Nomad"
systemctl start nomad

@github-actions
Copy link

github-actions bot commented Dec 5, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 5, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

9 participants