-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Batch jobs scheduled multiple times when node goes down, regardless if drained or just stopped #1050
Comments
A small discussion was already had here https://groups.google.com/forum/#!searchin/nomad-tool/multiple/nomad-tool/97IHLdr4FX8/c1ujs8DXBgAJ |
@g0t4 Could you please share the Nomad server logs of this happening and the job file. How many clients do you have running? I was not able to reproduce using the instructions. I used two clients running a docker container in batch mode, killing one of the clients. |
I was able to reproduce with a generic batch job composed of sleeps, attached is the job file Here's a video explaining the steps to reproduce: https://youtu.be/Pm0nQtqQlWk Hope this helps, let me know if you want anything else. The video includes a dump of the server logs in that example if you want to see those without trying the repro on your end. |
@g0t4 just watched the video. That does look like a nasty bug! I will pull down that file tomorrow and try to do the repro. I think one of the differences when I was trying to reproduce it earlier today was that I had a task group with one task and had the count set to ~20, whereas you have each task separately defined. |
Glad the video helped! I figured that was easier to explain what was going Let me know if I can help with anything else, I'd be happy to test things On Mon, Apr 11, 2016 at 9:10 PM, Alex Dadgar [email protected]
|
Hey Wes, Do you want to try this branch and make sure the fix works: Thanks, On Mon, Apr 11, 2016 at 6:52 PM, Wes Higbee [email protected]
|
Will do, thanks! On Tue, Apr 12, 2016 at 7:16 PM, Alex Dadgar [email protected]
|
Works with my mock job file that I sent you, if I get a chance this week On Tue, Apr 12, 2016 at 8:40 PM, Wes Higbee [email protected] wrote:
|
Thanks a bunch for fixing this so quickly! On Tue, Apr 12, 2016 at 10:24 PM, Wes Higbee [email protected] wrote:
|
@g0t4 thanks for testing that! I would just wait for 0.3.2-RC which should be out soon! |
Fantastic, looking forward to it! On Wed, Apr 13, 2016 at 12:42 PM, Alex Dadgar [email protected]
|
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
Nomad v0.3.1
Operating system and Environment details
centos 7 3.10.0-327.10.1.el7.x86_64
Docker version 1.10.3, build 20f81dd
Issue
I have batch jobs with 50 tasks. If i take a node down while it is processing work and then bring a new node up, the work from the node that went down tends to be run multiple times, like up to 50 times, seemingly endlessly. I have to stop the job to get it to stop scheduling new allocations, even as the previous runs complete successfully, that doesn't stop Nomad from continuing to schedule new allocations.
I ran into this issue both by just shutting down a node, and by draining a node. So this seems like an issue in either case.
Is something going wrong with evaluations if a node goes down?
Shouldn't we be able to lose a node and have the processing eventually move to another machine?
I don't have count set on the tasks, so I don't know why nomad would run multiple occurrences.
Reproduction steps
Launch a batch job with work running on multiple nodes. Take one node down, bring up a new node in its place. Might not need to bring up a new node, that's just something auto scaling is doing in my situation.
The text was updated successfully, but these errors were encountered: