-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nomad failing to create templating sandbox on Windows; appears to leak DACL entries on nomad.exe #20585
Comments
On the face of it, this appears to be a fairly major bug on Windows. That is:
Implication would be that once a certain number of jobs have been run, that's it, nomad will no longer work and be unable to launch new jobs. Using the I will experiment with vanilla nomad to verify I can create a simple reproduction scenario. |
Reproduction:
Note that the problem only occurs (obviously) if there's a Template in the nomad task. Golang driver program:
I suspect what is actually run isn't important, I suspect the fact that I use raw-exec here isn't especially important, but I haven't tried it with anything else. |
Just hit this in production, so glad to see it being worked on. If more logs/examples are needed I can help with that. |
Hey @tomqwpl, we just merged 2 changes that will remedy this problem. Nomad 1.8.2 will no longer sandbox template rendering on Windows, and to address the security aspect (which is only relevant for running Docker with Process Isolation as |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Nomad v1.7.5
BuildDate 2024-02-13T15:10:13Z
Revision 5f5d464
Operating system and Environment details
Windows 11
Issue
Repeated errors of
Reproduction steps
Hard to give specific reproduction steps at this time. I have one instance of Nomad running on a laptop, this isn't a "production" environment.
This occurred when I was performing some load testing on a solution that utilises nomad. I submitted 1000 jobs in quick succession, all of which were fairly small jobs. Many succeeded, but very many failed with this error. To it feels like a load issue.
Note that I've deliberately got the jobs configured not to allow restarts (I don't want then to be rerun just because they return a non zero exit code)
Possible reproduction would be to submit a very large number of jobs, each of which has default resource requirements, and each of which is a "raw-exec" doing something simple like "dir". Nomad will attempt to schedule maybe 100 at once (32GB memory, default resource requirements are 300MB).
Actual Result
It looks like Nomad isn't cleaning up the access control lists properly, possibly always, possibly only in error circumstances. Having run the experiment above, the only explanation I could think of was the ACl had got too big. So I used powershell to run
Get-Acl nomad.exe | Format-List
. There were many hundreds of entries in the list. So it would appear that the "The parameter is incorrect" error above means "enough already, that DACL is way too big". Nomad appears to be not always correctly undoing the ACL modifications it does, so the ACL grows until it can grow no longer. Incidentally I tried looking at the access control list using Windows explorer (Properties, Security tab), Windows explorer crashed because the ACL was too big for it to display.The text was updated successfully, but these errors were encountered: