raw_exec job started after 2.5 hours after scheduled on windows #2563

capone212 · 2017-04-14T11:52:50Z

Nomad version

0.5.4

Operating system and Environment details

5 windows boxes = 3 servers + 2 clients

Issue

After nomad client restart, some of scheduled jobs on that client was in pending state for more than two hours, without any attempt to start the jobs. This is severe issue for us and we would like to have it fixed ASAP.

Reproduction steps

I don't have exact steps. We know that this sometimes occurs after client restart . But we definitely faced with this problem more than once.

Nomad Client logs (if appropriate)

The full nomad client logs here
https://gist.github.com/capone212/94f47912f1f9e700195d9b45e846b7e3

On of problem jobs is "ngprunner.Node001"
When problem was reproducing i made following observations

>nomad status ngprunner.Node001
ID            = ngprunner.Node001
Name          = ngprunner.Node001
Type          = service
Priority      = 50
Datacenters   = dc1
Status        = running
Periodic      = false
Parameterized = false

←[1mSummary←[0m←[0m
Task Group  Queued  Starting  Running  Failed  Complete  Lost
ngp         0       1         0        0       0         1
postgres    0       0         1        0       0         1

←[1mAllocations←[0m←[0m
ID        Eval ID   Node ID   Task Group  Desired  Status   Created At
2748409b  209b0121  34ce800b  ngp         run      pending  04/14/17 11:25:09 EAT
94ec5139  209b0121  34ce800b  postgres    run      running  04/14/17 11:25:09 EAT
3d170428  0224aabc  34ce800b  postgres    stop     lost     04/14/17 10:39:24 EAT
b0741602  0224aabc  34ce800b  ngp         stop     lost     04/14/17 10:39:24 EAT



>nomad alloc-status -verbose 2748409b

ID                  = 2748409b-ad2c-4008-9dde-72a38a01480e
Eval ID             = 209b0121-81ec-ebf0-6616-a8a403136673
Name                = ngprunner.Node001.ngp[0]
Node ID             = 34ce800b-5d57-1b17-de70-c494fadbdeb8
Job ID              = ngprunner.Node001
Client Status       = pending
Client Description  = <none>
Desired Status      = run
Desired Description = <none>
Created At          = 04/14/17 11:25:09 EAT
Evaluated Nodes     = 4
Filtered Nodes      = 3
Exhausted Nodes     = 0
Allocation Time     = 0s
Failures            = 0

←[1mPlacement Metrics←[0m←[0m
  * Constraint "${node.unique.name} = QA-T21" filtered 3 nodes
  * Score "34ce800b-5d57-1b17-de70-c494fadbdeb8.binpack" = 4.575736
  * Score "34ce800b-5d57-1b17-de70-c494fadbdeb8.job-anti-affinity" = -10.000000

You can see that allocation 2748409b in pending state. But there were no attempt to start or restart that allocation on host machine.

This is extraction from nomad.log about the job

    2017/04/14 10:39:24.687255 [DEBUG] client: starting task runners for alloc 'b0741602-7127-f66e-5aaf-d4d1df1397be'
    2017/04/14 10:39:24.687255 [DEBUG] client: starting task context for 'ngprunner' (alloc 'b0741602-7127-f66e-5aaf-d4d1df1397be')
    2017/04/14 10:39:24 [DEBUG] plugin: starting plugin: C:\Program Files\RaftLauncher\current\nomad.exe []string{"C:\\Program Files\\RaftLauncher\\current\\nomad.exe", "executor", "{\"LogFile\":\"C:\\\\ProgramData\\\\AxxonSoft\\\\AxxonNext\\\\RaFT\\\\cluster\\\\nomad\\\\data\\\\alloc\\\\b0741602-7127-f66e-5aaf-d4d1df1397be\\\\ngprunner\\\\executor.out\",\"LogLevel\":\"DEBUG\"}"}
    2017/04/14 11:22:52.439567 [DEBUG] client: alloc "b0741602-7127-f66e-5aaf-d4d1df1397be" in terminal status, waiting for destroy
    2017/04/14 11:25:00.300414 [ERR] client: failed to open handle to task 'ngprunner' for alloc 'b0741602-7127-f66e-5aaf-d4d1df1397be': error connecting to plugin: error creating rpc client for executor plugin: OpenProcess: The parameter is incorrect.
    2017/04/14 11:25:00.300414 [DEBUG] client: alloc "b0741602-7127-f66e-5aaf-d4d1df1397be" in terminal status, waiting for destroy
    2017/04/14 11:25:09.286548 [DEBUG] client: added alloc "2748409b-ad2c-4008-9dde-72a38a01480e" to blocked queue for previous allocation "b0741602-7127-f66e-5aaf-d4d1df1397be"

Allocation 2748409b was in "blocked queue". And after more than two hours allocation suddenly started

    2017/04/14 13:40:09.807652 [DEBUG] client: starting task runners for alloc '2748409b-ad2c-4008-9dde-72a38a01480e'
    2017/04/14 13:40:09.807652 [DEBUG] client: starting task context for 'ngprunner' (alloc '2748409b-ad2c-4008-9dde-72a38a01480e')

The text was updated successfully, but these errors were encountered:

capone212 · 2017-04-14T12:01:44Z

This is how allocation is looking after jobs suddenly started

ID                  = 2748409b-ad2c-4008-9dde-72a38a01480e                                                                               13:58
Eval ID             = 209b0121-81ec-ebf0-6616-a8a403136673
Name                = ngprunner.Node001.ngp[0]
Node ID             = 34ce800b-5d57-1b17-de70-c494fadbdeb8
Job ID              = ngprunner.Node001
Client Status       = running
Client Description  = <none>
Desired Status      = run
Desired Description = <none>
Created At          = 04/14/17 11:25:09 EAT
Evaluated Nodes     = 4
Filtered Nodes      = 3
Exhausted Nodes     = 0
Allocation Time     = 0s
Failures            = 0


←[1mTask "ngprunner" is "running"←[0m←[0m
Task Resources
CPU        Memory   Disk  IOPS  Addresses
10000 MHz  2.9 GiB  0 B   0     NGP_BASE: 10.0.15.1:1000

Recent Events:
Time                   Type      Description
04/14/17 13:40:11 EAT  Started   Task started by client
04/14/17 13:40:09 EAT  Received  Task received by client

←[1mPlacement Metrics←[0m←[0m
  * Constraint "${node.unique.name} = QA-T21" filtered 3 nodes
  * Score "34ce800b-5d57-1b17-de70-c494fadbdeb8.binpack" = 4.575736
  * Score "34ce800b-5d57-1b17-de70-c494fadbdeb8.job-anti-affinity" = -10.000000

dadgar · 2017-04-14T19:54:30Z

Can you give the output of: curl http://127.0.0.1:4646/v1/node/<full node-id>/allocations

dadgar · 2017-04-14T20:11:22Z

@capone212 That should fix it! Master should be bug free if you want to build and role out master!

capone212 · 2017-04-14T23:37:46Z

@dadgar thanks a lot for your quick fix!!!

capone212 · 2017-06-22T23:38:29Z

Hi @dadgar I have just noticed that unfortunately fix does not help. Good news through I can reproduce it easily now in my environment.
With 3 nodes in server+client mode I do the following:

Create pinned job to one specific server : job.Constrain(apiNomad.NewConstraint("${node.unique.name}", "=", agentId))
Run job
power off that specific server
wait 1 minute, update that job
up server

At that point roughly at 30-50% cases we will have task with pending state.
The following is debug data from my last test with your fix applied.

>nomad status  pgrunner.cxRPaR97LE.SRV1
ID            = pgrunner.cxRPaR97LE.SRV1
Name          = pgrunner.cxRPaR97LE.SRV1
Type          = service
Priority      = 60
Datacenters   = dc1
Status        = running
Periodic      = false
Parameterized = false

←[1mSummary←[0m←[0m
Task Group  Queued  Starting  Running  Failed  Complete  Lost
postgres    0       1         1        0       4         0

←[1mAllocations←[0m←[0m
ID        Eval ID   Node ID   Task Group  Desired  Status    Created At
0eac3064  75bbb534  f303c409  postgres    run      pending   06/22/17 19:13:06 BOT
19fb4725  ea67e6e3  f303c409  postgres    stop     complete  06/22/17 19:09:08 BOT
75443349  4bf104fd  f303c409  postgres    stop     complete  06/22/17 19:08:02 BOT
05569c70  08f92ee3  f303c409  postgres    stop     running   06/22/17 19:05:07 BOT
d28605a7  a8e408f1  f303c409  postgres    stop     complete  06/22/17 19:02:19 BOT
a0284ddf  2992e4a9  f303c409  postgres    stop     complete  06/22/17 18:58:46 BOT

From logs:

    2017/06/22 19:13:06.348691 [DEBUG] client: added alloc "0eac3064-a10f-1eb0-f7e2-1c540dd29b97" to blocked queue for previous allocation "19fb4725-34fa-5e15-91b3-c5ce6899e860"

From allocations

1	
ID	"0eac3064-a10f-1eb0-f7e2-1c540dd29b97"
EvalID	"75bbb534-6f17-9edf-e314-9aed533ffdbe"
Name	"pgrunner.cxRPaR97LE.SRV1.postgres[0]"
NodeID	"f303c409-ed57-4f62-2b2b-f88a45f9540f"
JobID	"pgrunner.cxRPaR97LE.SRV1"
TaskGroup	"postgres"
DesiredStatus	"run"
DesiredDescription	""
ClientStatus	"pending"
ClientDescription	""
TaskStates	
CreateIndex	2041
ModifyIndex	2041
CreateTime	1498173186329835000
2	
ID	"19fb4725-34fa-5e15-91b3-c5ce6899e860"
EvalID	"ea67e6e3-6280-0358-0368-519e03ee1740"
Name	"pgrunner.cxRPaR97LE.SRV1.postgres[0]"
NodeID	"f303c409-ed57-4f62-2b2b-f88a45f9540f"
JobID	"pgrunner.cxRPaR97LE.SRV1"
TaskGroup	"postgres"
DesiredStatus	"stop"
DesiredDescription	"alloc is lost since its node is down"
ClientStatus	"complete"
ClientDescription	""
TaskStates	
pgrunner	
State	"dead"
Failed	false
StartedAt	"2017-06-22T23:09:09.5177181Z"
FinishedAt	"2017-06-22T23:12:57.7315015Z"
Events	
0	
Type	"Received"
Time	1498172948476707300
FailsTask	false
RestartReason	""
SetupError	""
DriverError	""
ExitCode	0
Signal	0
Message	""
KillTimeout	0
KillError	""
KillReason	""
StartDelay	0
DownloadError	""
ValidationError	""
DiskLimit	0
FailedSibling	""
VaultError	""
TaskSignalReason	""
TaskSignal	""
DriverMessage	""
1	
Type	"Task Setup"
Time	1498172948476707300
FailsTask	false
RestartReason	""
SetupError	""
DriverError	""
ExitCode	0
Signal	0
Message	"Building Task Directory"
KillTimeout	0
KillError	""
KillReason	""
StartDelay	0
DownloadError	""
ValidationError	""
DiskLimit	0
FailedSibling	""
VaultError	""
TaskSignalReason	""
TaskSignal	""
DriverMessage	""
2	
Type	"Started"
Time	1498172949517718000
FailsTask	false
RestartReason	""
SetupError	""
DriverError	""
ExitCode	0
Signal	0
Message	""
KillTimeout	0
KillError	""
KillReason	""
StartDelay	0
DownloadError	""
ValidationError	""
DiskLimit	0
FailedSibling	""
VaultError	""
TaskSignalReason	""
TaskSignal	""
DriverMessage	""
3	
Type	"Started"
Time	1498173175443420200
FailsTask	false
RestartReason	""
SetupError	""
DriverError	""
ExitCode	0
Signal	0
Message	""
KillTimeout	0
KillError	""
KillReason	""
StartDelay	0
DownloadError	""
ValidationError	""
DiskLimit	0
FailedSibling	""
VaultError	""
TaskSignalReason	""
TaskSignal	""
DriverMessage	""
4	
Type	"Killing"
Time	1498173175443420200
FailsTask	false
RestartReason	""
SetupError	""
DriverError	""
ExitCode	0
Signal	0
Message	""
KillTimeout	5000000000
KillError	""
KillReason	""
StartDelay	0
DownloadError	""
ValidationError	""
DiskLimit	0
FailedSibling	""
VaultError	""
TaskSignalReason	""
TaskSignal	""
DriverMessage	""
5	
Type	"Killed"
Time	1498173177579159800
FailsTask	false
RestartReason	""
SetupError	""
DriverError	""
ExitCode	0
Signal	0
Message	""
KillTimeout	0
KillError	""
KillReason	""
StartDelay	0
DownloadError	""
ValidationError	""
DiskLimit	0
FailedSibling	""
VaultError	""
TaskSignalReason	""
TaskSignal	""
DriverMessage	""
CreateIndex	1989
ModifyIndex	2034
CreateTime	1498172948382785500

capone212 · 2017-06-22T23:42:53Z

Please find full log and full allocations attached:

allocations.txt

nomad.log.txt

@dadgar , If you need any help please let me know. As I sad, I can reproduce it and eager to test patches.

capone212 · 2017-06-23T01:39:28Z

And yes, after several hours, job finally started!

ID            = pgrunner.cxRPaR97LE.SRV1
Name          = pgrunner.cxRPaR97LE.SRV1
Type          = service
Priority      = 60
Datacenters   = dc1
Status        = running
Periodic      = false
Parameterized = false

←[1mSummary←[0m←[0m
Task Group  Queued  Starting  Running  Failed  Complete  Lost
postgres    0       0         2        0       4         0

←[1mAllocations←[0m←[0m
ID        Eval ID   Node ID   Task Group  Desired  Status   Created At
0eac3064  75bbb534  f303c409  postgres    run      running  06/22/17 19:13:06 BOT

capone212 · 2017-06-23T01:41:46Z

nomad logs after ~2 hours
newAllocs.txt

capone212 · 2017-06-23T01:48:15Z

The essence of what is going on

[ERR] client: cleanup allocation faf96d53-2900-acd2-61b7-dbe621244981==> Nomad agent configuration:
    2017/06/22 19:12:54.403378 [ERR] client: failed to find task dir metadata for alloc "faf96d53-2900-acd2-61b7-dbe621244981" task "pgrunner"
    2017/06/22 19:12:54.403378 [ERR] client: failed to restore state for alloc faf96d53-2900-acd2-61b7-dbe621244981: failed to find task dir metadata for alloc "faf96d53-2900-acd2-61b7-dbe621244981" task "pgrunner"
    2017/06/22 19:12:56.662167 [DEBUG] client: alloc "faf96d53-2900-acd2-61b7-dbe621244981" in terminal status, waiting for destroy
    2017/06/22 19:12:56.662167 [DEBUG] client: added alloc "c75d7468-0055-73ca-1411-bfb932f9de7b" to blocked queue for previous allocation "faf96d53-2900-acd2-61b7-dbe621244981"
    2017/06/22 20:40:34.379971 [INFO] client: marking allocation faf96d53-2900-acd2-61b7-dbe621244981 for GC
    2017/06/22 20:40:34.379971 [INFO] client: garbage collecting allocation "faf96d53-2900-acd2-61b7-dbe621244981"
    2017/06/22 20:40:34.380948 [DEBUG] client: terminating runner for alloc 'faf96d53-2900-acd2-61b7-dbe621244981'
    2017/06/22 20:40:34.380948 [DEBUG] client: garbage collected "faf96d53-2900-acd2-61b7-dbe621244981"

    2017/06/22 20:40:34.380948 [DEBUG] client: starting task runners for alloc '0eac3064-a10f-1eb0-f7e2-1c540dd29b97'
    2017/06/22 20:40:34.380948 [DEBUG] client: starting task context for 'pgrunner' (alloc '0eac3064-a10f-1eb0-f7e2-1c540dd29b97')

capone212 · 2017-06-23T01:51:26Z

Allocation started only after GC of previous allocation.

dadgar · 2017-06-23T19:45:40Z

Hmm thanks for the great repo steps! Will get this fixed by 0.6.0 and will update issue when we have a patch for you to test!

schmichael · 2017-07-04T04:19:00Z

Sorry for the delay, I've been trying to reproduce your bug to make sure it's fixed before we release 0.6. Unfortunately I've been unable to reproduce it. We've made a lot of changes to our state file handling (related to the failed to restore state for alloc log messages), so I'm hoping your issue has been fixed!

Any chance you'd be able to test using these binaries?

linux_amd64.zip
windows_amd64.zip

If you still see the failed to restore state for alloc log message I would love to get a copy of your <state_dir>/state.db file. Please do not upload it to GitHub as it contains full copies of your jobs and allocations which may contain private data. Feel free to email me directly: [email protected]

capone212 · 2017-07-05T13:47:50Z

Hi @schmichael
I am going to test binaries you provided.

capone212 · 2017-07-05T19:43:11Z

Hi @schmichael
Unfortunately I have the same issue with the provided binaries
allocation.txt
nomad.log.txt

capone212 · 2017-07-05T19:44:44Z

There is no "ailed to restore state for alloc" in log file.
If you need any additional files let me know.

Plus some logging improvements that may help with #2563

schmichael · 2017-07-05T23:26:21Z

Thanks for testing. I'm having a hard time reproducing this. Could you provide more details on how to reproduce this?

It appears you're running (at least) a 3 node cluster with the node in question running as a client+server. While that's not a recommended configuration it shouldn't cause bugs.

I'm also not sure what exactly happened in that last log file you posted as there's no long delay like in your initial report.

I found and fixed a small race condition in somewhat related code, but I think think it could cause the original behavior you described. I'm planning to do a larger refactor of the alloc blocking logic but it's unlikely to make it into 0.6.

Here's a build from the #2780 branch if you want to test:
windows_amd64.zip

capone212 · 2017-07-09T14:01:58Z

@schmichael I have tried the last binary you provided, but was able to reproduce the same problem with it.

I have invested my weekend for understanding what is going on, and I think I found the bug. Looking forward for your feedback on proposed solution.

#2563 fixed pending state for allocations with terminal status

github-actions · 2022-12-10T02:15:51Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

capone212 changed the title ~~nomad client restart: raw_exec job started after 2.5 hours after scheduled on windows~~ raw_exec job started after 2.5 hours after scheduled on windows Apr 14, 2017

dadgar added type/bug theme/client stage/waiting-reply labels Apr 14, 2017

dadgar closed this as completed in 3018ae0 Apr 14, 2017

capone212 mentioned this issue Jun 22, 2017

Allocation never starting ([DEBUG] client: added alloc "<id>" to blocked queue for previous allocation "<id>) #2711

Closed

dadgar reopened this Jun 23, 2017

dadgar removed the stage/waiting-reply label Jun 23, 2017

schmichael added this to the v0.6.0 milestone Jul 3, 2017

schmichael added a commit that referenced this issue Jul 5, 2017

Tiny client race condition fix

4794de9

Plus some logging improvements that may help with #2563

schmichael mentioned this issue Jul 5, 2017

Tiny client race condition fix #2780

Merged

capone212 pushed a commit to capone212/nomad that referenced this issue Jul 9, 2017

hashicorp#2563 fixed pending state for allocations with terminal status

5f9cd4f

capone212 mentioned this issue Jul 9, 2017

#2563 fixed pending state for allocations with terminal status #2816

Merged

schmichael removed this from the v0.6.0 milestone Jul 31, 2017

schmichael added a commit that referenced this issue Aug 4, 2017

Merge pull request #2816 from capone212/fix-pending-state

9692eef

#2563 fixed pending state for allocations with terminal status

capone212 closed this as completed Aug 8, 2017

github-actions bot locked as resolved and limited conversation to collaborators Dec 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

raw_exec job started after 2.5 hours after scheduled on windows #2563

raw_exec job started after 2.5 hours after scheduled on windows #2563

capone212 commented Apr 14, 2017 •

edited

Loading

capone212 commented Apr 14, 2017 •

edited

Loading

dadgar commented Apr 14, 2017 •

edited

Loading

dadgar commented Apr 14, 2017

capone212 commented Apr 14, 2017

capone212 commented Jun 22, 2017 •

edited

Loading

capone212 commented Jun 22, 2017

capone212 commented Jun 23, 2017

capone212 commented Jun 23, 2017

capone212 commented Jun 23, 2017 •

edited

Loading

capone212 commented Jun 23, 2017

dadgar commented Jun 23, 2017

schmichael commented Jul 4, 2017

capone212 commented Jul 5, 2017

capone212 commented Jul 5, 2017 •

edited

Loading

capone212 commented Jul 5, 2017

schmichael commented Jul 5, 2017

capone212 commented Jul 9, 2017 •

edited

Loading

github-actions bot commented Dec 10, 2022

raw_exec job started after 2.5 hours after scheduled on windows #2563

raw_exec job started after 2.5 hours after scheduled on windows #2563

Comments

capone212 commented Apr 14, 2017 • edited Loading

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Nomad Client logs (if appropriate)

capone212 commented Apr 14, 2017 • edited Loading

dadgar commented Apr 14, 2017 • edited Loading

dadgar commented Apr 14, 2017

capone212 commented Apr 14, 2017

capone212 commented Jun 22, 2017 • edited Loading

capone212 commented Jun 22, 2017

capone212 commented Jun 23, 2017

capone212 commented Jun 23, 2017

capone212 commented Jun 23, 2017 • edited Loading

capone212 commented Jun 23, 2017

dadgar commented Jun 23, 2017

schmichael commented Jul 4, 2017

capone212 commented Jul 5, 2017

capone212 commented Jul 5, 2017 • edited Loading

capone212 commented Jul 5, 2017

schmichael commented Jul 5, 2017

capone212 commented Jul 9, 2017 • edited Loading

github-actions bot commented Dec 10, 2022

capone212 commented Apr 14, 2017 •

edited

Loading

capone212 commented Apr 14, 2017 •

edited

Loading

dadgar commented Apr 14, 2017 •

edited

Loading

capone212 commented Jun 22, 2017 •

edited

Loading

capone212 commented Jun 23, 2017 •

edited

Loading

capone212 commented Jul 5, 2017 •

edited

Loading

capone212 commented Jul 9, 2017 •

edited

Loading