Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix corrupt state after restart (missing key, bucket) #4807

Closed
wants to merge 15 commits into from

Conversation

jozef-slezak
Copy link

Hello, I have tested recent changes in 0.9.0-dev and I was able again to reproduce (just by restarting a cluster) the corrupted state issue #2560, therefore, I am providing the fix.

Let's explain the fix:

  1. original implementation fails too fast is a key or bucket does not exists
  2. this pull requests contains defensive implementation/fix, basically it fallbacks to initial TaskState and LocalState if the problem occurs
  3. I can imagine that somebody would argue that key/bucket should not be missing at the first place but we need reliable implementation and avoid workarounds based on deletion of local state file (see [feature] gracefully handle corrupt startup state #1367)

Please, do not close this pull request. This is a serious problem for us and it blocks some deployment of Nomad. If something in the pull request needs to be improved then I am eager to help with that.

Jozef

@schmichael
Copy link
Member

@jozef-slezak These fixes are great! Unfortunately they're all covered in another PR: #4803

In fact your checklist is surprisingly similar to one I was working off of internally!

  1. original implementation fails too fast is a key or bucket does not exists

👍 https://github.com/hashicorp/nomad/pull/4803/files#diff-64749d8d8c42d5b993eff64d7f522d00R164

  1. this pull requests contains defensive implementation/fix, basically it fallbacks to initial TaskState and LocalState if the problem occurs

👍 https://github.com/hashicorp/nomad/pull/4803/files#diff-44115f774c7a50972c7d0c97054ea0a4R197 (and they're only overwritten if the loaded values are non-nil)

  1. I can imagine that somebody would argue that key/bucket should not be missing at the first place but we need reliable implementation and avoid workarounds based on deletion of local state file (see [feature] gracefully handle corrupt startup state #1367)

👍 Agreed! The approach in #4803 (with further fixes coming in the b-restore-rerun work-in-progress branch) does not fail on missing data but reverts to an initial state. Right now we still fail on corrupt data (unmarshal/decode errors) with the assumption that something has gone very wrong. We may take a more defensive approach there as well based on further testing.

Please, do not close this pull request. This is a serious problem for us and it blocks some deployment of Nomad. If something in the pull request needs to be improved then I am eager to help with that.

I'm sorry you're blocked by state corruption issues! I was personally hoping to have the code ready for beta testing by HashiConf (next week!), but unfortunately I was unable to make that deadline.

I assure you this is my highest priority after HashiConf, and we will do everything we can to ensure 0.9 will not be released with state corruption errors.

Since the code is still under very active development (which will be paused next week during HashiConf), it's hard to coordinate testing and changes. Feel free to review or test #4803. I'm afraid until TaskRunner.Restore/Driver.RecoverTask is fully implemented, it will be difficult to fully test state handling. That work has been started in b-restore-rerun but won't be finished until after HashiConf.

Sorry we're not moving faster and thank you for your continued engagement! We'll get there!

@schmichael schmichael closed this Oct 19, 2018
@schmichael
Copy link
Member

If you are inclined, we could get you some test binaries to help us test the restore code path once development continues after HashiConf. I'll start trying to attach linux_amd64 binaries to PRs starting with the upcoming Restore implementation PR. Until Restore is implemented there isn't much to test.

@jozef-slezak
Copy link
Author

Great, please let me know when I can test linux_amd64 binaries attach to PRs (please build it also with GUI).
Thank you

@hashicorp hashicorp deleted a comment from shantanugadgil Oct 22, 2018
@shantanugadgil
Copy link
Contributor

hi @schmichael apologies to sound like a looping record ... but how easy (possible?) would it be merge this changes into the 0.8 branch?

I personally face this corruption issue often, though I have attributed it to me rebooting the machine abruptly. (I always do a sync and reboot, never a "reboot -f", yet this occurs) or if the undelying VM host reboots.

Impacted machines for me is the "on premise" QA cluster so manageable so far.

Along with the actual fixes, can we also have a deterministic way to check that the local nomad state is corrupted?
The reason I mention is this because I have a workaround (cron job) for now which checks the output of "systemctl status nomad" thrice.
If there is error in the output, it stops Nomad, Consul, Docker services, wipes out the Nomad's datadir and reboots the machine.
(It works, but I feel it is very hacky).

@schmichael
Copy link
Member

how easy (possible?) would it be merge this changes into the 0.8 branch?

Impossible I'm afraid. The fixes are part of a large refactor of the client code. Backporting them would effectively be the same as upgrading your clients to 0.9 but leaving your servers at 0.8.

Along with the actual fixes, can we also have a deterministic way to check that the local nomad state is corrupted?

We don't currently have plans for this as we're confident we can avoid corrupt state in all cases except OS/hardware failures. Even then our recovery should be more robust than it is today.

The reason I mention is this because I have a workaround (cron job) for now which checks the output of "systemctl status nomad" thrice.

Hm, I'm curious what this does. Corrupt state should only break Nomad on reboot or restart of the agent.

@jozef-slezak
Copy link
Author

Hello @schmichael , please do you have test binaries that we could test? I would like to test.

If you are inclined, we could get you some test binaries to help us test the restore code path once
development continues after HashiConf. I'll start trying to attach linux_amd64 binaries to PRs starting
with the upcoming Restore implementation PR. Until Restore is implemented there isn't much to test.

Are you still planning to release 0.9 in November? (referring Hashiconf info)

@schmichael
Copy link
Member

Are you still planning to release 0.9 in November? (referring Hashiconf info)

Sadly no. We still have a lot of polishing and testing to do. Sorry for the delays, but the last thing we want to do is ship before it's ready.

I don't think a test binary at this stage would be useful. Maybe in another week or two.

@jozef-slezak
Copy link
Author

I understand. Please let's give a try in a wee or two. I can retest our scenario.

I saw branch 0.8.7. Does it make sence to PR fix in 0.8.7? Ary you planning to release 0.8.7 with fixes because of 0.9 delay?

@jozef-slezak
Copy link
Author

@schmichael can we test and tweak v0.8.7-rc1 regarding this bug?

@schmichael
Copy link
Member

v0.8.7 does not contain any state corruption related handling. As I mentioned above all of our state corruption fixes are coming in 0.9. Sorry for the delay!

@jozef-slezak
Copy link
Author

jozef-slezak commented Jan 19, 2019

Sadly no. We still have a lot of polishing and testing to do. Sorry for the delays, but the last thing
we want to do is ship before it's ready.
I don't think a test binary at this stage would be useful. Maybe in another week or two.

Hello, does it make sence to retest this scenario? Are test binaries in this stage useful now?

@schmichael
Copy link
Member

Sorry I'm a bit late on the update, but yes! The beta releases should have fixed state corruption issues: https://releases.hashicorp.com/nomad/0.9.0-beta3/

@jozef-slezak
Copy link
Author

@schmichael, thank you for your update (glad to hear that). We are going to retest the binary.

@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 14, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants