-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nomad fails to restart on windows nodes with "funlock error: The segment is already unlocked" error #10086
Comments
Hi @gulavanir, thanks for reporting. Looks like this is probably a very old bug in the version of boltdb we use (v1.3.1 from June 2017). When Nomad agent stops it will call I'm not sure what the implications of migrating to the |
thanks for looking in to this @shoenig !! For some context, about 75% of nodes in a cluster (for us) are actually windows nodes. So we expect to see this more and more as we roll things out. If there's any information you'd like us to gather that would help in any way, please let us know what is most useful. Thanks! |
We ran into the same issue on one of our clusters managed to run 'nomad operator debug -duration=5m -interval=30s -server-id=all -node-id=all' attaching the debug archive here: nomad-debug-2021-03-09-014419Z.tar.gz Thanks, |
Hey @idrennanvmware which version of Windows are you deploying on? That might help us narrow down the behavior. |
Hey @tgross we are using "Windows server core 2019 LTS" |
@raunak2004 @gulavanir Sorry for the long delay. I'm investigating this issue further now. The following will be immensely helpful for debugging the issue for the next time it happens:
Thank you so much for your patience. |
As per Seth's original comment it does seem like this has been fixed upstream, however both Nomad and Consul (as of 1.1.0) are still on the unmaintained original boltdb package. We've found no evidence that switching could cause issues, but we've also not been able to test extensively. |
Want to clarify the etcd-io/bbolt issue, as I don't think it applies here. The issue etcd-io/bbolt#60 , fixed in etcd-io/bbolt#64 , was a regression in etcd-io/bbolt introduced in https://github.com/etcd-io/bbolt/pull/35/files#diff-1efe626cba25e2ad261690300496ba6e83057a4d24a338d5559e6f01f2fb770bR62-R72 . One of dangers danger of actively developed project! The original boltdb library uses Also, I suspect the the failure might be a red-harring or a secondary factor. The By trying to start the Nomad agent manually, I hope all Nomad output will be visible and pin point us to the problem/solution. |
I dug into this further, and I think I have a working theory. I believe there are at least three contributing factors and I was able to reproduce some of the symptoms though not a full reproduction of the problem. TheoryMy current theory is that this is a metastable failure, where Nomad crashed for unexpected reasons (e.g. memory overload, segmentation fault, etc); but without releasing some OS locks. Then Nomad fails to restart until the OS files locks are released or files are deleted. The DetailsThe easy one is the mystery of missing log indicating why the agent fails to start. Until #11353, we don't capture the error in the log files/syslog. That explains why your log captures are missing this line. The second issue is Windows file locking behavior: Boltdb, our storage engine, uses
If Nomad process restarts and tries to grab the lock, it may fail to do so until the OS unlocks file. Deleting the lock file manually would help I suspect. Thirdly, when Ways to validateThe following info can validate whether the theory is likely:
We still need to identify the cause of initial Nomad crash - though we must ensure that file locks are always freed so Nomad can recovery well. I will provide more documentation/details tomorrow hopefully with reproduction steps. |
Following up to report my reproduction. I have confirmed that On a Windows machine, I started two clients on separate terminals. I started a server/client agent with the following command:
Then on a second PowerShell terminal, I started a client using the same data-dir that should fail because the first agent holds the lock on data-dir:
Note that the actual error above is "failed to open state database: timed out while opening database, is another Nomad process accessing data_dir C:\Users\Mahmood\AppData\Local\Temp\data\client?" but it gets skipped from log files because of #11353 issue. Additionally, I've tried to trigger crashed-without-freeing lock issue independently without luck. I've run the following program in a loop while it simulate crashing but without luck. I've run the following but without having flock call fail:
main.gopackage main import (
) func main() {
} When this issue happens again, can you also inspect processes and see if there is a rogue Nomad agent that still holds the file lock on datadir? |
Nomad version
1.0.1
Operating system and Environment details
Windows
Issue
Nomad is unable to restart with following error
Reproduction steps
There isn't a specific scenario that leads to this state but we have seen this happening on multiple clusters which are long running, specifically on the windows nodes. The nomad service on windows nodes goes in loop trying to restart and failing with the same error.
Removing the client state db file and restarting the nomad solves the issue but needs manual intervention and the issue can reoccur too.
The text was updated successfully, but these errors were encountered: