Batch jobs scheduled multiple times when node goes down, regardless if drained or just stopped #1050

g0t4 · 2016-04-07T16:29:29Z

Nomad version

Nomad v0.3.1

Operating system and Environment details

centos 7 3.10.0-327.10.1.el7.x86_64
Docker version 1.10.3, build 20f81dd

Issue

I have batch jobs with 50 tasks. If i take a node down while it is processing work and then bring a new node up, the work from the node that went down tends to be run multiple times, like up to 50 times, seemingly endlessly. I have to stop the job to get it to stop scheduling new allocations, even as the previous runs complete successfully, that doesn't stop Nomad from continuing to schedule new allocations.

I ran into this issue both by just shutting down a node, and by draining a node. So this seems like an issue in either case.

Is something going wrong with evaluations if a node goes down?

Shouldn't we be able to lose a node and have the processing eventually move to another machine?

I don't have count set on the tasks, so I don't know why nomad would run multiple occurrences.

Reproduction steps

Launch a batch job with work running on multiple nodes. Take one node down, bring up a new node in its place. Might not need to bring up a new node, that's just something auto scaling is doing in my situation.

g0t4 · 2016-04-07T16:29:59Z

A small discussion was already had here https://groups.google.com/forum/#!searchin/nomad-tool/multiple/nomad-tool/97IHLdr4FX8/c1ujs8DXBgAJ

dadgar · 2016-04-11T20:12:08Z

@g0t4 Could you please share the Nomad server logs of this happening and the job file. How many clients do you have running?

I was not able to reproduce using the instructions. I used two clients running a docker container in batch mode, killing one of the clients.

g0t4 · 2016-04-12T00:51:33Z

I was able to reproduce with a generic batch job composed of sleeps, attached is the job file
repro.hcl.txt

Here's a video explaining the steps to reproduce: https://youtu.be/Pm0nQtqQlWk

Hope this helps, let me know if you want anything else. The video includes a dump of the server logs in that example if you want to see those without trying the repro on your end.

dadgar · 2016-04-12T01:10:38Z

@g0t4 just watched the video. That does look like a nasty bug! I will pull down that file tomorrow and try to do the repro. I think one of the differences when I was trying to reproduce it earlier today was that I had a task group with one task and had the count set to ~20, whereas you have each task separately defined.

g0t4 · 2016-04-12T01:52:09Z

Glad the video helped! I figured that was easier to explain what was going
on :)

Let me know if I can help with anything else, I'd be happy to test things
when you have updates.

On Mon, Apr 11, 2016 at 9:10 PM, Alex Dadgar [email protected]
wrote:

@g0t4 https://github.com/g0t4 just watched the video. That does look
like a nasty bug! I will pull down that file tomorrow and try to do the
repro. I think one of the differences when I was trying to reproduce it
earlier today was that I had a task group with one task and had the count
set to ~20, whereas you have each task separately defined.

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#1050 (comment)

dadgar · 2016-04-12T23:16:29Z

Hey Wes,

Do you want to try this branch and make sure the fix works:
#1086

Thanks,
Alex

On Mon, Apr 11, 2016 at 6:52 PM, Wes Higbee [email protected]
wrote:

Glad the video helped! I figured that was easier to explain what was going
on :)

Let me know if I can help with anything else, I'd be happy to test things
when you have updates.

On Mon, Apr 11, 2016 at 9:10 PM, Alex Dadgar [email protected]
wrote:

@g0t4 https://github.com/g0t4 just watched the video. That does look
like a nasty bug! I will pull down that file tomorrow and try to do the
repro. I think one of the differences when I was trying to reproduce it
earlier today was that I had a task group with one task and had the count
set to ~20, whereas you have each task separately defined.

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#1050 (comment)

—
You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#1050 (comment)

g0t4 · 2016-04-13T00:40:02Z

Will do, thanks!

On Tue, Apr 12, 2016 at 7:16 PM, Alex Dadgar [email protected]
wrote:

Hey Wes,

Do you want to try this branch and make sure the fix works:
#1086

Thanks,
Alex

On Mon, Apr 11, 2016 at 6:52 PM, Wes Higbee [email protected]
wrote:

Glad the video helped! I figured that was easier to explain what was
going
on :)

Let me know if I can help with anything else, I'd be happy to test things
when you have updates.

On Mon, Apr 11, 2016 at 9:10 PM, Alex Dadgar [email protected]
wrote:

@g0t4 https://github.com/g0t4 just watched the video. That does look
like a nasty bug! I will pull down that file tomorrow and try to do the
repro. I think one of the differences when I was trying to reproduce it
earlier today was that I had a task group with one task and had the
count
set to ~20, whereas you have each task separately defined.

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
<#1050 (comment)

—
You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#1050 (comment)

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#1050 (comment)

g0t4 · 2016-04-13T02:24:21Z

Works with my mock job file that I sent you, if I get a chance this week
I'll try to swap this into my test cluster and put real work against it.
How stable is this branch?

On Tue, Apr 12, 2016 at 8:40 PM, Wes Higbee [email protected] wrote:

Will do, thanks!

On Tue, Apr 12, 2016 at 7:16 PM, Alex Dadgar [email protected]
wrote:

Hey Wes,

Do you want to try this branch and make sure the fix works:
#1086

Thanks,
Alex

On Mon, Apr 11, 2016 at 6:52 PM, Wes Higbee [email protected]
wrote:

Glad the video helped! I figured that was easier to explain what was
going
on :)

Let me know if I can help with anything else, I'd be happy to test
things
when you have updates.

On Mon, Apr 11, 2016 at 9:10 PM, Alex Dadgar [email protected]
wrote:

@g0t4 https://github.com/g0t4 just watched the video. That does
look
like a nasty bug! I will pull down that file tomorrow and try to do
the
repro. I think one of the differences when I was trying to reproduce
it
earlier today was that I had a task group with one task and had the
count
set to ~20, whereas you have each task separately defined.

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
<
https://github.com/hashicorp/nomad/issues/1050#issuecomment-208652525>

—
You are receiving this because you commented.
Reply to this email directly or view it on GitHub
#1050 (comment)

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#1050 (comment)

g0t4 · 2016-04-13T02:24:30Z

Thanks a bunch for fixing this so quickly!

On Tue, Apr 12, 2016 at 10:24 PM, Wes Higbee [email protected] wrote:

Works with my mock job file that I sent you, if I get a chance this week
I'll try to swap this into my test cluster and put real work against it.
How stable is this branch?

On Tue, Apr 12, 2016 at 8:40 PM, Wes Higbee [email protected] wrote:

Will do, thanks!

On Tue, Apr 12, 2016 at 7:16 PM, Alex Dadgar [email protected]
wrote:

Hey Wes,

Do you want to try this branch and make sure the fix works:
#1086

Thanks,
Alex

On Mon, Apr 11, 2016 at 6:52 PM, Wes Higbee [email protected]
wrote:

Glad the video helped! I figured that was easier to explain what was
going
on :)

Let me know if I can help with anything else, I'd be happy to test
things
when you have updates.

On Mon, Apr 11, 2016 at 9:10 PM, Alex Dadgar <[email protected]

wrote:

@g0t4 https://github.com/g0t4 just watched the video. That does
look
like a nasty bug! I will pull down that file tomorrow and try to do
the
repro. I think one of the differences when I was trying to reproduce
it
earlier today was that I had a task group with one task and had the
count
set to ~20, whereas you have each task separately defined.

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
<
https://github.com/hashicorp/nomad/issues/1050#issuecomment-208652525>

—
You are receiving this because you commented.
Reply to this email directly or view it on GitHub
<#1050 (comment)

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#1050 (comment)

dadgar · 2016-04-13T16:42:09Z

@g0t4 thanks for testing that! I would just wait for 0.3.2-RC which should be out soon!

g0t4 · 2016-04-13T17:47:13Z

Fantastic, looking forward to it!

On Wed, Apr 13, 2016 at 12:42 PM, Alex Dadgar [email protected]
wrote:

@g0t4 https://github.com/g0t4 thanks for testing that! I would just
wait for 0.3.2-RC which should be out soon!

—
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#1050 (comment)

github-actions · 2022-12-23T02:14:50Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

g0t4 closed this as completed Apr 7, 2016

g0t4 reopened this Apr 7, 2016

g0t4 changed the title ~~Batch jobs scheduled multiple times when node goes down~~ Batch jobs scheduled multiple times when node goes down, regardless if drained or just stopped Apr 7, 2016

dadgar added the stage/waiting-reply label Apr 11, 2016

dadgar mentioned this issue Apr 12, 2016

Fix drained/batch allocations from continually migrating #1086

Merged

dadgar closed this as completed in #1086 Apr 13, 2016

github-actions bot locked as resolved and limited conversation to collaborators Dec 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch jobs scheduled multiple times when node goes down, regardless if drained or just stopped #1050

Batch jobs scheduled multiple times when node goes down, regardless if drained or just stopped #1050

g0t4 commented Apr 7, 2016

g0t4 commented Apr 7, 2016

dadgar commented Apr 11, 2016

g0t4 commented Apr 12, 2016

dadgar commented Apr 12, 2016

g0t4 commented Apr 12, 2016

dadgar commented Apr 12, 2016

g0t4 commented Apr 13, 2016

g0t4 commented Apr 13, 2016

g0t4 commented Apr 13, 2016

dadgar commented Apr 13, 2016

g0t4 commented Apr 13, 2016

github-actions bot commented Dec 23, 2022

Batch jobs scheduled multiple times when node goes down, regardless if drained or just stopped #1050

Batch jobs scheduled multiple times when node goes down, regardless if drained or just stopped #1050

Comments

g0t4 commented Apr 7, 2016

Nomad version

Operating system and Environment details

Issue

Reproduction steps

g0t4 commented Apr 7, 2016

dadgar commented Apr 11, 2016

g0t4 commented Apr 12, 2016

dadgar commented Apr 12, 2016

g0t4 commented Apr 12, 2016

dadgar commented Apr 12, 2016

g0t4 commented Apr 13, 2016

g0t4 commented Apr 13, 2016

g0t4 commented Apr 13, 2016

dadgar commented Apr 13, 2016

g0t4 commented Apr 13, 2016

github-actions bot commented Dec 23, 2022