-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feature] gracefully handle corrupt startup state #1367
Comments
This would be neat! I experience this fairly occasionally when docker hangs and I have to force reboot an instance. I've built a temporary work around where I've written health checks for backends that checks the state of docker and nomad, and, if either fail, let Google Compute Engine automatically destroy and reprovision the instance. |
I would be interested to know if you continue seeing this after 0.4.0 comes On Tue, Jun 28, 2016, 4:26 PM George Robinson [email protected]
|
+1 for this feature. Occasionally I see this issue with Nomad 0.4.0
It turns out that some state.json file are empty
Deleting /nomad_directory/ helped to overcome the issue. Not sure what caused this. It might have been the shutdown of the nodes. There was no shortage of disk space. |
+1 |
@flegler is that usually after a hard server crash? |
@a86c6f7964 I also see with |
Ya that is what #1357 was supposed to do. It just doesn't do an fsync which is why I was asking if it was on server crash. Makes me think we missed something |
@a86c6f7964 I am not sure, but it is possible that the nodes were shut down without ending Nomad first. |
Another issue today on 0.4.0 when the client did not want to start. This time not with JSON files. Anybody knows what could be wrong here?
Deleting the Nomad data directory helped again. I kept a copy in case you need some information from it. |
Another variation today with task runner:
This is happening in a cloud environment where services are usually not shut down before nodes/machines are shut down. |
@flegler Yes it would be awesome if you could post those up |
For now my initial work on this issue I'm going to do the following:
If we determine there are unavoidable environmental factors that could lead to state files being truncated/corrupted while at rest (e.g. unclean shutdowns and restarts), then we have a plan on how to gracefully deal with them. However, we'd like to ensure there's no bugs with persisting state before we focus on defensive mitigations that could cover up solvable bugs. |
Prevent interleaving state syncs as it could conceivably lead to empty state files as per #1367
Dropped the milestone but keeping this open in case anyone hits any issue we can track it. 0.5.0 includes some potential cases of corruption and adds sanity checks. Hopefully those were all the issues! |
hey guys, i also got this issue in v0.4.1, task state.json is empty.
|
@yangzhares Any chance you can upgrade to 0.5.0? That's where some fixes and sanity checks were introduced to address this. |
@schmichael I am encountering something related to this on Nomad starts just fine.
I can schedule jobs onto my clients, however, these messages still appear on the clients.
If I run these commands, I can stop the errors from appearing.
It would be really beneficial if the logs had a timestamp. |
Same in nomad 0.5.4 Loaded configuration from /etc/nomad.d/nomad.hcl, /etc/nomad.d/nomad_client.hcl
rm -R /opt/nomad/client/alloc/ && systemctl start nomad |
I'm closing this as 0.6 substantially changes the state persistence code. Nomad now uses a single boltdb file instead of a tree of JSON files. This made atomic state changes much easier and hopefully fixed corrupt state issues (along with some other minor improvements to persistence code in 0.6). I've also added more verbose logging when state issues are encountered in #2763 Please don't hesitate to open new issues if state restore bugs continue in 0.6! (I opted against trying to write more sophisticated cleanup/recovery code at this time as it seems more likely to cause more problems than it fixes until we know the exact form new persistence issues take.) |
Similar problems on:
Here's a little script that helps restore the node quickly. Please think twice before you run this on your box! Especially on production nodes. #!/bin/bash
mount_points=`mount | grep /var/nomad/alloc | awk {'print $3'}`
echo "Stopping Nomad..."
systemctl stop nomad
echo "Kill processes using /var/nomad/alloc/..."
for pid in `lsof | grep '/var/nomad/alloc' | awk {'print $2'} | sort | uniq`
do
kill -TERM $pid
done
nomad_procs=`pgrep -fc /usr/local/bin/nomad`
if [[ "$nomad_procs" == "0" ]]
then
for mount_point in $mount_points
do
echo "Unmounting $mount_point"
umount $mount_point
done
fi
echo "Removing /var/nomad/alloc and /var/nomad/client..."
rm -r /var/nomad/alloc
rm -r /var/nomad/client
echo "Starting Nomad"
systemctl start nomad |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
As is mentioned in #1170 , if any of the state files are corrupted when a nomad client start up then it will never be able to start.
I think that on start up it should be considered (best effort) to try and recover the state of the machine but in the event that state files are corrupted maybe it would be best to make sure and shutdown any tasks that are still running, and delete all of the allocation state information on disk. This way it will still start up, but let the scheduler figure out a new home for the task.
The text was updated successfully, but these errors were encountered: