Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raw_exec job started after 2.5 hours after scheduled on windows #2563

Closed
capone212 opened this issue Apr 14, 2017 · 18 comments
Closed

raw_exec job started after 2.5 hours after scheduled on windows #2563

capone212 opened this issue Apr 14, 2017 · 18 comments

Comments

@capone212
Copy link
Contributor

capone212 commented Apr 14, 2017

Nomad version

0.5.4

Operating system and Environment details

5 windows boxes = 3 servers + 2 clients

Issue

After nomad client restart, some of scheduled jobs on that client was in pending state for more than two hours, without any attempt to start the jobs. This is severe issue for us and we would like to have it fixed ASAP.

Reproduction steps

I don't have exact steps. We know that this sometimes occurs after client restart . But we definitely faced with this problem more than once.

Nomad Client logs (if appropriate)

The full nomad client logs here
https://gist.github.com/capone212/94f47912f1f9e700195d9b45e846b7e3

On of problem jobs is "ngprunner.Node001"
When problem was reproducing i made following observations

>nomad status ngprunner.Node001
ID            = ngprunner.Node001
Name          = ngprunner.Node001
Type          = service
Priority      = 50
Datacenters   = dc1
Status        = running
Periodic      = false
Parameterized = false

←[1mSummary←[0m←[0m
Task Group  Queued  Starting  Running  Failed  Complete  Lost
ngp         0       1         0        0       0         1
postgres    0       0         1        0       0         1

←[1mAllocations←[0m←[0m
ID        Eval ID   Node ID   Task Group  Desired  Status   Created At
2748409b  209b0121  34ce800b  ngp         run      pending  04/14/17 11:25:09 EAT
94ec5139  209b0121  34ce800b  postgres    run      running  04/14/17 11:25:09 EAT
3d170428  0224aabc  34ce800b  postgres    stop     lost     04/14/17 10:39:24 EAT
b0741602  0224aabc  34ce800b  ngp         stop     lost     04/14/17 10:39:24 EAT



>nomad alloc-status -verbose 2748409b

ID                  = 2748409b-ad2c-4008-9dde-72a38a01480e
Eval ID             = 209b0121-81ec-ebf0-6616-a8a403136673
Name                = ngprunner.Node001.ngp[0]
Node ID             = 34ce800b-5d57-1b17-de70-c494fadbdeb8
Job ID              = ngprunner.Node001
Client Status       = pending
Client Description  = <none>
Desired Status      = run
Desired Description = <none>
Created At          = 04/14/17 11:25:09 EAT
Evaluated Nodes     = 4
Filtered Nodes      = 3
Exhausted Nodes     = 0
Allocation Time     = 0s
Failures            = 0

←[1mPlacement Metrics←[0m←[0m
  * Constraint "${node.unique.name} = QA-T21" filtered 3 nodes
  * Score "34ce800b-5d57-1b17-de70-c494fadbdeb8.binpack" = 4.575736
  * Score "34ce800b-5d57-1b17-de70-c494fadbdeb8.job-anti-affinity" = -10.000000

You can see that allocation 2748409b in pending state. But there were no attempt to start or restart that allocation on host machine.

This is extraction from nomad.log about the job

    2017/04/14 10:39:24.687255 [DEBUG] client: starting task runners for alloc 'b0741602-7127-f66e-5aaf-d4d1df1397be'
    2017/04/14 10:39:24.687255 [DEBUG] client: starting task context for 'ngprunner' (alloc 'b0741602-7127-f66e-5aaf-d4d1df1397be')
    2017/04/14 10:39:24 [DEBUG] plugin: starting plugin: C:\Program Files\RaftLauncher\current\nomad.exe []string{"C:\\Program Files\\RaftLauncher\\current\\nomad.exe", "executor", "{\"LogFile\":\"C:\\\\ProgramData\\\\AxxonSoft\\\\AxxonNext\\\\RaFT\\\\cluster\\\\nomad\\\\data\\\\alloc\\\\b0741602-7127-f66e-5aaf-d4d1df1397be\\\\ngprunner\\\\executor.out\",\"LogLevel\":\"DEBUG\"}"}
    2017/04/14 11:22:52.439567 [DEBUG] client: alloc "b0741602-7127-f66e-5aaf-d4d1df1397be" in terminal status, waiting for destroy
    2017/04/14 11:25:00.300414 [ERR] client: failed to open handle to task 'ngprunner' for alloc 'b0741602-7127-f66e-5aaf-d4d1df1397be': error connecting to plugin: error creating rpc client for executor plugin: OpenProcess: The parameter is incorrect.
    2017/04/14 11:25:00.300414 [DEBUG] client: alloc "b0741602-7127-f66e-5aaf-d4d1df1397be" in terminal status, waiting for destroy
    2017/04/14 11:25:09.286548 [DEBUG] client: added alloc "2748409b-ad2c-4008-9dde-72a38a01480e" to blocked queue for previous allocation "b0741602-7127-f66e-5aaf-d4d1df1397be"

Allocation 2748409b was in "blocked queue". And after more than two hours allocation suddenly started

    2017/04/14 13:40:09.807652 [DEBUG] client: starting task runners for alloc '2748409b-ad2c-4008-9dde-72a38a01480e'
    2017/04/14 13:40:09.807652 [DEBUG] client: starting task context for 'ngprunner' (alloc '2748409b-ad2c-4008-9dde-72a38a01480e')
@capone212 capone212 changed the title nomad client restart: raw_exec job started after 2.5 hours after scheduled on windows raw_exec job started after 2.5 hours after scheduled on windows Apr 14, 2017
@capone212
Copy link
Contributor Author

capone212 commented Apr 14, 2017

This is how allocation is looking after jobs suddenly started

ID                  = 2748409b-ad2c-4008-9dde-72a38a01480e                                                                               13:58
Eval ID             = 209b0121-81ec-ebf0-6616-a8a403136673
Name                = ngprunner.Node001.ngp[0]
Node ID             = 34ce800b-5d57-1b17-de70-c494fadbdeb8
Job ID              = ngprunner.Node001
Client Status       = running
Client Description  = <none>
Desired Status      = run
Desired Description = <none>
Created At          = 04/14/17 11:25:09 EAT
Evaluated Nodes     = 4
Filtered Nodes      = 3
Exhausted Nodes     = 0
Allocation Time     = 0s
Failures            = 0


←[1mTask "ngprunner" is "running"←[0m←[0m
Task Resources
CPU        Memory   Disk  IOPS  Addresses
10000 MHz  2.9 GiB  0 B   0     NGP_BASE: 10.0.15.1:1000

Recent Events:
Time                   Type      Description
04/14/17 13:40:11 EAT  Started   Task started by client
04/14/17 13:40:09 EAT  Received  Task received by client

←[1mPlacement Metrics←[0m←[0m
  * Constraint "${node.unique.name} = QA-T21" filtered 3 nodes
  * Score "34ce800b-5d57-1b17-de70-c494fadbdeb8.binpack" = 4.575736
  * Score "34ce800b-5d57-1b17-de70-c494fadbdeb8.job-anti-affinity" = -10.000000

@dadgar
Copy link
Contributor

dadgar commented Apr 14, 2017

Can you give the output of: curl http://127.0.0.1:4646/v1/node/<full node-id>/allocations

@dadgar
Copy link
Contributor

dadgar commented Apr 14, 2017

@capone212 That should fix it! Master should be bug free if you want to build and role out master!

@capone212
Copy link
Contributor Author

@dadgar thanks a lot for your quick fix!!!

@capone212
Copy link
Contributor Author

capone212 commented Jun 22, 2017

Hi @dadgar I have just noticed that unfortunately fix does not help. Good news through I can reproduce it easily now in my environment.
With 3 nodes in server+client mode I do the following:

  1. Create pinned job to one specific server : job.Constrain(apiNomad.NewConstraint("${node.unique.name}", "=", agentId))
  2. Run job
  3. power off that specific server
  4. wait 1 minute, update that job
  5. up server

At that point roughly at 30-50% cases we will have task with pending state.
The following is debug data from my last test with your fix applied.

>nomad status  pgrunner.cxRPaR97LE.SRV1
ID            = pgrunner.cxRPaR97LE.SRV1
Name          = pgrunner.cxRPaR97LE.SRV1
Type          = service
Priority      = 60
Datacenters   = dc1
Status        = running
Periodic      = false
Parameterized = false

←[1mSummary←[0m←[0m
Task Group  Queued  Starting  Running  Failed  Complete  Lost
postgres    0       1         1        0       4         0

←[1mAllocations←[0m←[0m
ID        Eval ID   Node ID   Task Group  Desired  Status    Created At
0eac3064  75bbb534  f303c409  postgres    run      pending   06/22/17 19:13:06 BOT
19fb4725  ea67e6e3  f303c409  postgres    stop     complete  06/22/17 19:09:08 BOT
75443349  4bf104fd  f303c409  postgres    stop     complete  06/22/17 19:08:02 BOT
05569c70  08f92ee3  f303c409  postgres    stop     running   06/22/17 19:05:07 BOT
d28605a7  a8e408f1  f303c409  postgres    stop     complete  06/22/17 19:02:19 BOT
a0284ddf  2992e4a9  f303c409  postgres    stop     complete  06/22/17 18:58:46 BOT

From logs:

    2017/06/22 19:13:06.348691 [DEBUG] client: added alloc "0eac3064-a10f-1eb0-f7e2-1c540dd29b97" to blocked queue for previous allocation "19fb4725-34fa-5e15-91b3-c5ce6899e860"

From allocations

1	
ID	"0eac3064-a10f-1eb0-f7e2-1c540dd29b97"
EvalID	"75bbb534-6f17-9edf-e314-9aed533ffdbe"
Name	"pgrunner.cxRPaR97LE.SRV1.postgres[0]"
NodeID	"f303c409-ed57-4f62-2b2b-f88a45f9540f"
JobID	"pgrunner.cxRPaR97LE.SRV1"
TaskGroup	"postgres"
DesiredStatus	"run"
DesiredDescription	""
ClientStatus	"pending"
ClientDescription	""
TaskStates	
CreateIndex	2041
ModifyIndex	2041
CreateTime	1498173186329835000
2	
ID	"19fb4725-34fa-5e15-91b3-c5ce6899e860"
EvalID	"ea67e6e3-6280-0358-0368-519e03ee1740"
Name	"pgrunner.cxRPaR97LE.SRV1.postgres[0]"
NodeID	"f303c409-ed57-4f62-2b2b-f88a45f9540f"
JobID	"pgrunner.cxRPaR97LE.SRV1"
TaskGroup	"postgres"
DesiredStatus	"stop"
DesiredDescription	"alloc is lost since its node is down"
ClientStatus	"complete"
ClientDescription	""
TaskStates	
pgrunner	
State	"dead"
Failed	false
StartedAt	"2017-06-22T23:09:09.5177181Z"
FinishedAt	"2017-06-22T23:12:57.7315015Z"
Events	
0	
Type	"Received"
Time	1498172948476707300
FailsTask	false
RestartReason	""
SetupError	""
DriverError	""
ExitCode	0
Signal	0
Message	""
KillTimeout	0
KillError	""
KillReason	""
StartDelay	0
DownloadError	""
ValidationError	""
DiskLimit	0
FailedSibling	""
VaultError	""
TaskSignalReason	""
TaskSignal	""
DriverMessage	""
1	
Type	"Task Setup"
Time	1498172948476707300
FailsTask	false
RestartReason	""
SetupError	""
DriverError	""
ExitCode	0
Signal	0
Message	"Building Task Directory"
KillTimeout	0
KillError	""
KillReason	""
StartDelay	0
DownloadError	""
ValidationError	""
DiskLimit	0
FailedSibling	""
VaultError	""
TaskSignalReason	""
TaskSignal	""
DriverMessage	""
2	
Type	"Started"
Time	1498172949517718000
FailsTask	false
RestartReason	""
SetupError	""
DriverError	""
ExitCode	0
Signal	0
Message	""
KillTimeout	0
KillError	""
KillReason	""
StartDelay	0
DownloadError	""
ValidationError	""
DiskLimit	0
FailedSibling	""
VaultError	""
TaskSignalReason	""
TaskSignal	""
DriverMessage	""
3	
Type	"Started"
Time	1498173175443420200
FailsTask	false
RestartReason	""
SetupError	""
DriverError	""
ExitCode	0
Signal	0
Message	""
KillTimeout	0
KillError	""
KillReason	""
StartDelay	0
DownloadError	""
ValidationError	""
DiskLimit	0
FailedSibling	""
VaultError	""
TaskSignalReason	""
TaskSignal	""
DriverMessage	""
4	
Type	"Killing"
Time	1498173175443420200
FailsTask	false
RestartReason	""
SetupError	""
DriverError	""
ExitCode	0
Signal	0
Message	""
KillTimeout	5000000000
KillError	""
KillReason	""
StartDelay	0
DownloadError	""
ValidationError	""
DiskLimit	0
FailedSibling	""
VaultError	""
TaskSignalReason	""
TaskSignal	""
DriverMessage	""
5	
Type	"Killed"
Time	1498173177579159800
FailsTask	false
RestartReason	""
SetupError	""
DriverError	""
ExitCode	0
Signal	0
Message	""
KillTimeout	0
KillError	""
KillReason	""
StartDelay	0
DownloadError	""
ValidationError	""
DiskLimit	0
FailedSibling	""
VaultError	""
TaskSignalReason	""
TaskSignal	""
DriverMessage	""
CreateIndex	1989
ModifyIndex	2034
CreateTime	1498172948382785500

@capone212
Copy link
Contributor Author

Please find full log and full allocations attached:

allocations.txt

nomad.log.txt

@dadgar , If you need any help please let me know. As I sad, I can reproduce it and eager to test patches.

@capone212
Copy link
Contributor Author

And yes, after several hours, job finally started!

ID            = pgrunner.cxRPaR97LE.SRV1
Name          = pgrunner.cxRPaR97LE.SRV1
Type          = service
Priority      = 60
Datacenters   = dc1
Status        = running
Periodic      = false
Parameterized = false

←[1mSummary←[0m←[0m
Task Group  Queued  Starting  Running  Failed  Complete  Lost
postgres    0       0         2        0       4         0

←[1mAllocations←[0m←[0m
ID        Eval ID   Node ID   Task Group  Desired  Status   Created At
0eac3064  75bbb534  f303c409  postgres    run      running  06/22/17 19:13:06 BOT

@capone212
Copy link
Contributor Author

@capone212
Copy link
Contributor Author

capone212 commented Jun 23, 2017

The essence of what is going on

[ERR] client: cleanup allocation faf96d53-2900-acd2-61b7-dbe621244981==> Nomad agent configuration:
    2017/06/22 19:12:54.403378 [ERR] client: failed to find task dir metadata for alloc "faf96d53-2900-acd2-61b7-dbe621244981" task "pgrunner"
    2017/06/22 19:12:54.403378 [ERR] client: failed to restore state for alloc faf96d53-2900-acd2-61b7-dbe621244981: failed to find task dir metadata for alloc "faf96d53-2900-acd2-61b7-dbe621244981" task "pgrunner"
    2017/06/22 19:12:56.662167 [DEBUG] client: alloc "faf96d53-2900-acd2-61b7-dbe621244981" in terminal status, waiting for destroy
    2017/06/22 19:12:56.662167 [DEBUG] client: added alloc "c75d7468-0055-73ca-1411-bfb932f9de7b" to blocked queue for previous allocation "faf96d53-2900-acd2-61b7-dbe621244981"
    2017/06/22 20:40:34.379971 [INFO] client: marking allocation faf96d53-2900-acd2-61b7-dbe621244981 for GC
    2017/06/22 20:40:34.379971 [INFO] client: garbage collecting allocation "faf96d53-2900-acd2-61b7-dbe621244981"
    2017/06/22 20:40:34.380948 [DEBUG] client: terminating runner for alloc 'faf96d53-2900-acd2-61b7-dbe621244981'
    2017/06/22 20:40:34.380948 [DEBUG] client: garbage collected "faf96d53-2900-acd2-61b7-dbe621244981"

    2017/06/22 20:40:34.380948 [DEBUG] client: starting task runners for alloc '0eac3064-a10f-1eb0-f7e2-1c540dd29b97'
    2017/06/22 20:40:34.380948 [DEBUG] client: starting task context for 'pgrunner' (alloc '0eac3064-a10f-1eb0-f7e2-1c540dd29b97')

@capone212
Copy link
Contributor Author

Allocation started only after GC of previous allocation.

@dadgar
Copy link
Contributor

dadgar commented Jun 23, 2017

Hmm thanks for the great repo steps! Will get this fixed by 0.6.0 and will update issue when we have a patch for you to test!

@dadgar dadgar reopened this Jun 23, 2017
@schmichael schmichael added this to the v0.6.0 milestone Jul 3, 2017
@schmichael
Copy link
Member

Sorry for the delay, I've been trying to reproduce your bug to make sure it's fixed before we release 0.6. Unfortunately I've been unable to reproduce it. We've made a lot of changes to our state file handling (related to the failed to restore state for alloc log messages), so I'm hoping your issue has been fixed!

Any chance you'd be able to test using these binaries?

linux_amd64.zip
windows_amd64.zip

If you still see the failed to restore state for alloc log message I would love to get a copy of your <state_dir>/state.db file. Please do not upload it to GitHub as it contains full copies of your jobs and allocations which may contain private data. Feel free to email me directly: [email protected]

@capone212
Copy link
Contributor Author

Hi @schmichael
I am going to test binaries you provided.

@capone212
Copy link
Contributor Author

capone212 commented Jul 5, 2017

Hi @schmichael
Unfortunately I have the same issue with the provided binaries
allocation.txt
nomad.log.txt

@capone212
Copy link
Contributor Author

There is no "ailed to restore state for alloc" in log file.
If you need any additional files let me know.

schmichael added a commit that referenced this issue Jul 5, 2017
Plus some logging improvements that may help with #2563
@schmichael
Copy link
Member

Thanks for testing. I'm having a hard time reproducing this. Could you provide more details on how to reproduce this?

It appears you're running (at least) a 3 node cluster with the node in question running as a client+server. While that's not a recommended configuration it shouldn't cause bugs.

I'm also not sure what exactly happened in that last log file you posted as there's no long delay like in your initial report.

I found and fixed a small race condition in somewhat related code, but I think think it could cause the original behavior you described. I'm planning to do a larger refactor of the alloc blocking logic but it's unlikely to make it into 0.6.

Here's a build from the #2780 branch if you want to test:
windows_amd64.zip

@capone212
Copy link
Contributor Author

capone212 commented Jul 9, 2017

@schmichael I have tried the last binary you provided, but was able to reproduce the same problem with it.

I have invested my weekend for understanding what is going on, and I think I found the bug. Looking forward for your feedback on proposed solution.

@schmichael schmichael removed this from the v0.6.0 milestone Jul 31, 2017
schmichael added a commit that referenced this issue Aug 4, 2017
#2563 fixed pending state for allocations with terminal status
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 10, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants