Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Under high load, repeated use of allocation name #3593

Closed
dukeland9 opened this issue Nov 28, 2017 · 27 comments
Closed

Under high load, repeated use of allocation name #3593

dukeland9 opened this issue Nov 28, 2017 · 27 comments

Comments

@dukeland9
Copy link

Nomad version

0.7.0

Operating system and Environment details

Ubuntu 14.04 (some of the client machines are Ubuntu 16.04)

Issue

I'm using nomad to schedule distributed batch jobs. The cluster has ~30 machines with about 80 cores.
If I divide the job to about 100 tasks, the system works fine. However, if the tasks were increased to around 200, some of the tasks (allocations) would never be started.

For example, a job I started was divided into 256 tasks (allocs), however task (alloc) No. 251-255 was never scheduled (as below). On the other hand, the job did have 256 successful tasks done, but the problem is that my batch job depends on the alloc number (0-255) to assign the right data slice to process. So some of the tasks never started is not acceptable.

image

The jobs are of raw_exec type.

Reproduction steps

The issue can always be reproduced when the number of allocations is increase to 200 or more. About 6-8 allocs will never be started.

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

Job file (if appropriate)

@chelseakomlo
Copy link
Contributor

Hi, thanks for reporting this issue. Would you mind posting logs (server and/or client) from when this issue occurs?

@dukeland9
Copy link
Author

@chelseakomlo Thanks for replying!

I guess the server/client logs may help little. Here is part of the logs the server printed when executing my job:

2017/11/29 10:14:34.301944 [INFO] client: task "extract_vgg_feature" for alloc "a841ffb0-76a7-0141-ebd4-811077bb52ed" completed successfully
2017/11/29 10:14:34.301979 [INFO] client: Not restarting task: extract_vgg_feature for alloc: a841ffb0-76a7-0141-ebd4-811077bb52ed 
2017/11/29 10:14:34.302229 [INFO] client: marking allocation a841ffb0-76a7-0141-ebd4-811077bb52ed for GC
2017/11/29 10:14:35.722916 [WARN] client: failed to broadcast update to allocation "49ecbe4d-54a3-6104-4b0e-675603c01efc"
2017/11/29 10:14:35.726904 [INFO] client: garbage collecting allocation d2742c89-ba05-8c67-1e04-fe3f2e89fc02 due to disk usage of 85 is over gc threshold of 80
2017/11/29 10:14:35.726994 [INFO] client: marking allocation d2742c89-ba05-8c67-1e04-fe3f2e89fc02 for GC
2017/11/29 10:14:37.227270 [WARN] client: failed to broadcast update to allocation "d2742c89-ba05-8c67-1e04-fe3f2e89fc02"
2017/11/29 10:14:37.236402 [INFO] client: garbage collecting allocation bf78484b-2937-36d5-e47b-09ad3927fdaf due to disk usage of 85 is over gc threshold of 80
2017/11/29 10:14:37.236494 [INFO] client: marking allocation bf78484b-2937-36d5-e47b-09ad3927fdaf for GC
2017/11/29 10:14:38.736764 [WARN] client: failed to broadcast update to allocation "bf78484b-2937-36d5-e47b-09ad3927fdaf"
2017/11/29 10:14:38.740071 [INFO] client: garbage collecting allocation 5a4a6f37-2e01-263f-39f3-d7bd69580dab due to disk usage of 85 is over gc threshold of 80
2017/11/29 10:14:38.740395 [INFO] client: garbage collecting allocation ef2f2d55-35f5-965a-6d81-413b37eb803d due to disk usage of 85 is over gc threshold of 80
2017/11/29 10:14:38.740684 [INFO] client: garbage collecting allocation 9dde0884-d610-fb93-62eb-077aa58eb69e due to disk usage of 85 is over gc threshold of 80
2017/11/29 10:14:38.740773 [INFO] client: marking allocation 9dde0884-d610-fb93-62eb-077aa58eb69e for GC
2017/11/29 10:14:40.241062 [WARN] client: failed to broadcast update to allocation "9dde0884-d610-fb93-62eb-077aa58eb69e"
2017/11/29 10:14:40.244489 [INFO] client: garbage collecting allocation 49ecbe4d-54a3-6104-4b0e-675603c01efc due to disk usage of 85 is over gc threshold of 80
2017/11/29 10:14:40.244801 [INFO] client: garbage collecting allocation a841ffb0-76a7-0141-ebd4-811077bb52ed due to disk usage of 85 is over gc threshold of 80
2017/11/29 10:14:40.244890 [INFO] client: marking allocation a841ffb0-76a7-0141-ebd4-811077bb52ed for GC
2017/11/29 10:14:41.745167 [WARN] client: failed to broadcast update to allocation "a841ffb0-76a7-0141-ebd4-811077bb52ed"
2017/11/29 10:14:41.748710 [INFO] client: garbage collecting allocation d2742c89-ba05-8c67-1e04-fe3f2e89fc02 due to disk usage of 85 is over gc threshold of 80
2017/11/29 10:14:41.749013 [INFO] client: garbage collecting allocation bf78484b-2937-36d5-e47b-09ad3927fdaf due to disk usage of 85 is over gc threshold of 80
2017/11/29 10:14:41.749302 [INFO] client: garbage collecting allocation 9dde0884-d610-fb93-62eb-077aa58eb69e due to disk usage of 85 is over gc threshold of 80
2017/11/29 10:14:41.749576 [INFO] client: garbage collecting allocation a841ffb0-76a7-0141-ebd4-811077bb52ed due to disk usage of 85 is over gc threshold of 80
2017/11/29 10:14:41.749855 [WARN] client: garbage collection due to disk usage of 85 is over gc threshold of 80 skipped because no terminal allocations
2017/11/29 10:15:31.214337 [WARN] client: garbage collection due to disk usage of 85 is over gc threshold of 80 skipped because no terminal allocations
2017/11/29 10:16:31.214330 [WARN] client: garbage collection due to disk usage of 85 is over gc threshold of 80 skipped because no terminal allocations

But I notice there's another issue which is very likely to be related to this issue: #3595

In my case, I observed allocation lost, too. I started my nomad job with a python program (using python-nomad) and here's the status my program printed (polling job summary every 5 seconds):

Workers: running 0, starting 0, queuing 0, lost 0, failed 0, completed 0
Workers: running 54, starting 0, queuing 202, lost 0, failed 0, completed 0
Workers: running 54, starting 0, queuing 202, lost 0, failed 0, completed 0
Workers: running 54, starting 0, queuing 202, lost 0, failed 0, completed 0
Workers: running 54, starting 0, queuing 202, lost 0, failed 0, completed 0
Workers: running 54, starting 0, queuing 202, lost 0, failed 0, completed 0
Workers: running 54, starting 0, queuing 202, lost 0, failed 0, completed 0
Workers: running 54, starting 0, queuing 202, lost 0, failed 0, completed 0
Workers: running 54, starting 0, queuing 201, lost 0, failed 0, completed 1
Workers: running 54, starting 0, queuing 198, lost 0, failed 0, completed 4
Workers: running 54, starting 0, queuing 195, lost 0, failed 0, completed 7
Workers: running 53, starting 1, queuing 181, lost 0, failed 0, completed 21
Workers: running 48, starting 1, queuing 176, lost 5, failed 0, completed 31 <--- notice here
Workers: running 49, starting 0, queuing 167, lost 5, failed 0, completed 40 <--- notice here
Workers: running 49, starting 5, queuing 162, lost 5, failed 0, completed 40 <--- notice here
Workers: running 54, starting 5, queuing 159, lost 0, failed 0, completed 43
Workers: running 53, starting 1, queuing 157, lost 0, failed 0, completed 50
Workers: running 54, starting 0, queuing 156, lost 0, failed 0, completed 51
Workers: running 54, starting 0, queuing 150, lost 0, failed 0, completed 57
Workers: running 54, starting 0, queuing 147, lost 0, failed 0, completed 60
Workers: running 54, starting 0, queuing 142, lost 0, failed 0, completed 65
Workers: running 54, starting 0, queuing 134, lost 0, failed 0, completed 73
Workers: running 53, starting 1, queuing 129, lost 0, failed 0, completed 78
Workers: running 54, starting 0, queuing 124, lost 0, failed 0, completed 83
Workers: running 53, starting 1, queuing 119, lost 0, failed 0, completed 88
Workers: running 54, starting 0, queuing 111, lost 0, failed 0, completed 96
Workers: running 54, starting 0, queuing 108, lost 0, failed 0, completed 99
Workers: running 54, starting 0, queuing 108, lost 0, failed 0, completed 99
Workers: running 53, starting 1, queuing 106, lost 0, failed 0, completed 101
Workers: running 53, starting 1, queuing 101, lost 0, failed 0, completed 106
Workers: running 54, starting 0, queuing 92, lost 0, failed 0, completed 115
Workers: running 54, starting 0, queuing 91, lost 0, failed 0, completed 116
Workers: running 54, starting 0, queuing 85, lost 0, failed 0, completed 122
Workers: running 53, starting 0, queuing 81, lost 0, failed 0, completed 127
Workers: running 54, starting 0, queuing 75, lost 0, failed 0, completed 132
Workers: running 54, starting 0, queuing 73, lost 0, failed 0, completed 134
Workers: running 53, starting 1, queuing 65, lost 0, failed 0, completed 142
Workers: running 53, starting 1, queuing 57, lost 0, failed 0, completed 150
Workers: running 54, starting 0, queuing 52, lost 0, failed 0, completed 155
Workers: running 53, starting 1, queuing 46, lost 0, failed 0, completed 161
Workers: running 54, starting 0, queuing 42, lost 0, failed 0, completed 165
Workers: running 53, starting 1, queuing 40, lost 0, failed 0, completed 167
Workers: running 54, starting 0, queuing 38, lost 0, failed 0, completed 169
Workers: running 53, starting 1, queuing 33, lost 0, failed 0, completed 174
Workers: running 53, starting 0, queuing 28, lost 0, failed 0, completed 180
Workers: running 54, starting 0, queuing 24, lost 0, failed 0, completed 183
Workers: running 54, starting 0, queuing 24, lost 0, failed 0, completed 183
Workers: running 54, starting 0, queuing 18, lost 0, failed 0, completed 189
Workers: running 53, starting 1, queuing 10, lost 0, failed 0, completed 197
Workers: running 47, starting 0, queuing 0, lost 0, failed 0, completed 209
Workers: running 41, starting 0, queuing 0, lost 0, failed 0, completed 215
Workers: running 40, starting 0, queuing 0, lost 0, failed 0, completed 216
Workers: running 34, starting 0, queuing 0, lost 0, failed 0, completed 222
Workers: running 25, starting 0, queuing 0, lost 0, failed 0, completed 231
Workers: running 22, starting 0, queuing 0, lost 0, failed 0, completed 234
Workers: running 17, starting 0, queuing 0, lost 0, failed 0, completed 239
Workers: running 11, starting 0, queuing 0, lost 0, failed 0, completed 245
Workers: running 8, starting 0, queuing 0, lost 0, failed 0, completed 248
Workers: running 3, starting 0, queuing 0, lost 0, failed 0, completed 253
Workers: running 3, starting 0, queuing 0, lost 0, failed 0, completed 253
Workers: running 3, starting 0, queuing 0, lost 0, failed 0, completed 253
Workers: running 3, starting 0, queuing 0, lost 0, failed 0, completed 253
Workers: running 3, starting 0, queuing 0, lost 0, failed 0, completed 253
Workers: running 3, starting 0, queuing 0, lost 0, failed 0, completed 253
Workers: running 1, starting 0, queuing 0, lost 0, failed 0, completed 255
Workers: running 1, starting 0, queuing 0, lost 0, failed 0, completed 255
...

In the end, the number of missing allocs (never scheduled) was also 5. I don't think it was coincidence.
I guess it may be the same issue as #3595

@dukeland9
Copy link
Author

@chelseakomlo Also, my job is also CPU-heavy. The client machines go around 100% CPU utilization when executing my job. It's like the situation in #3595

@dukeland9
Copy link
Author

Any updates on this issue?

@chelseakomlo
Copy link
Contributor

Hi @dukeland9, thanks for the further log messages. What is the output of nomad node-status and nomad node-status -self when this occurs? Is it similar to what is described in #3595?

@dukeland9
Copy link
Author

@chelseakomlo I tried nomad node-status and nomad node-status -self, but they just print the list of clients and the information of the machine, respectively. I don't think those are relevant and may not be appropriate to post here.

It's also hard to catch the log when the allocation is lost, since the "lost" status seems transient. As seen in the status polling log I provided above, the lost allocations quickly disappeared (which I don't understand why), it's hard to locate the "crime scene". Also, it seems that all lost or rescheduled tasks turned to completed status in the end, would that be a problem?

I'm wondering if similar problem was reported before version 0.7.0? If not, I can use rolling back to 0.6.3 as a quick solution here.

@dukeland9
Copy link
Author

I found this: #3289 (comment)
By setting heartbeat_grace to 30s, it seems that no allocation is lost unexpectedly now. I guess I can get rid of manually rerunning the problematic jobs for a moment... It's a workaround for me.

But I think:

  1. Some allocations never scheduled is definitely a bug. So please fix it. I love this product.
  2. Maybe we can improve the robustness of the server/client agent communication when the CPU utilization is very high.

@dadgar
Copy link
Contributor

dadgar commented Dec 7, 2017

@dukeland9 Can you output the results of curl http://127.0.0.1:4646/v1/job/<bad_job>/allocations? I would suggest running the app under exec so that you get some resource isolation as well but that is tangential.

@dukeland9
Copy link
Author

@dadgar I reproduced the bug again by setting the allocation number to 500.
My status polling output was like:

Workers: running 0, starting 0, queuing 0, lost 0, failed 0, completed 0
Workers: running 60, starting 0, queuing 440, lost 0, failed 0, completed 0
Workers: running 60, starting 0, queuing 440, lost 0, failed 0, completed 0
Workers: running 60, starting 0, queuing 440, lost 0, failed 0, completed 0
Unable to track job status: no response. <--- Mark I: The server went down temporarily.
Unable to track job status: no response.
Unable to track job status: no response.
Unable to track job status: no response.
Unable to track job status: no response.
Unable to track job status: no response.
Unable to track job status: no response.
Unable to track job status: no response.
Unable to track job status: no response.
Workers: running 53, starting 7, queuing 383, lost 3, failed 0, completed 58 <--- Mark II: Some of the clients went down temporarily
Workers: running 51, starting 10, queuing 377, lost 3, failed 0, completed 63
Workers: running 54, starting 7, queuing 377, lost 3, failed 0, completed 63
Workers: running 57, starting 4, queuing 376, lost 3, failed 0, completed 64
Workers: running 56, starting 5, queuing 366, lost 3, failed 0, completed 74
Workers: running 57, starting 3, queuing 363, lost 3, failed 0, completed 78
Workers: running 57, starting 3, queuing 357, lost 3, failed 0, completed 84
Workers: running 57, starting 3, queuing 357, lost 3, failed 0, completed 84
Workers: running 57, starting 0, queuing 360, lost 6, failed 0, completed 84
Workers: running 57, starting 0, queuing 358, lost 6, failed 0, completed 86
Workers: running 57, starting 0, queuing 346, lost 6, failed 0, completed 98
Workers: running 57, starting 0, queuing 336, lost 6, failed 0, completed 108
Workers: running 57, starting 0, queuing 330, lost 6, failed 0, completed 114
Workers: running 55, starting 2, queuing 322, lost 6, failed 0, completed 122
Workers: running 54, starting 3, queuing 317, lost 6, failed 0, completed 127
Workers: running 57, starting 0, queuing 312, lost 6, failed 0, completed 132
Workers: running 57, starting 0, queuing 308, lost 6, failed 0, completed 136
Workers: running 57, starting 0, queuing 306, lost 6, failed 0, completed 138
Workers: running 57, starting 0, queuing 305, lost 6, failed 0, completed 139
Workers: running 57, starting 0, queuing 302, lost 6, failed 0, completed 142
Workers: running 54, starting 3, queuing 292, lost 6, failed 0, completed 152
Workers: running 58, starting 0, queuing 286, lost 5, failed 0, completed 158
Workers: running 57, starting 0, queuing 284, lost 5, failed 0, completed 161
Workers: running 57, starting 0, queuing 280, lost 3, failed 0, completed 167
Workers: running 56, starting 4, queuing 261, lost 3, failed 0, completed 180
Workers: running 57, starting 3, queuing 255, lost 3, failed 0, completed 186
Workers: running 57, starting 3, queuing 252, lost 3, failed 0, completed 189
Workers: running 57, starting 3, queuing 245, lost 3, failed 0, completed 196
Workers: running 59, starting 0, queuing 242, lost 0, failed 0, completed 203 <--- Mark III: All failed clients/allocations recovered?
Workers: running 58, starting 2, queuing 234, lost 0, failed 0, completed 210
Workers: running 60, starting 0, queuing 226, lost 0, failed 0, completed 218
Workers: running 59, starting 0, queuing 221, lost 0, failed 0, completed 224
Workers: running 60, starting 0, queuing 218, lost 0, failed 0, completed 226
Workers: running 57, starting 3, queuing 213, lost 0, failed 0, completed 231
Workers: running 57, starting 3, queuing 208, lost 0, failed 0, completed 236
Workers: running 57, starting 3, queuing 201, lost 0, failed 0, completed 243
Workers: running 54, starting 6, queuing 194, lost 0, failed 0, completed 250
Workers: running 59, starting 1, queuing 191, lost 0, failed 0, completed 253
Workers: running 58, starting 2, queuing 187, lost 0, failed 0, completed 257
Workers: running 58, starting 2, queuing 181, lost 0, failed 0, completed 263
Workers: running 58, starting 2, queuing 177, lost 0, failed 0, completed 267
Workers: running 54, starting 6, queuing 168, lost 0, failed 0, completed 276
Workers: running 55, starting 5, queuing 159, lost 0, failed 0, completed 285
Workers: running 55, starting 4, queuing 154, lost 0, failed 0, completed 291
Workers: running 55, starting 5, queuing 147, lost 0, failed 0, completed 297
Workers: running 55, starting 5, queuing 143, lost 0, failed 0, completed 301
Workers: running 57, starting 3, queuing 141, lost 0, failed 0, completed 303
Workers: running 57, starting 3, queuing 138, lost 0, failed 0, completed 306
Workers: running 60, starting 0, queuing 131, lost 0, failed 0, completed 313
Workers: running 59, starting 1, queuing 125, lost 0, failed 0, completed 319
Workers: running 57, starting 3, queuing 118, lost 0, failed 0, completed 326
Workers: running 58, starting 2, queuing 109, lost 0, failed 0, completed 335
Workers: running 58, starting 2, queuing 99, lost 0, failed 0, completed 345
Workers: running 58, starting 2, queuing 94, lost 0, failed 0, completed 350
Workers: running 58, starting 1, queuing 90, lost 0, failed 0, completed 355
Workers: running 59, starting 1, queuing 85, lost 0, failed 0, completed 359
Workers: running 60, starting 0, queuing 83, lost 0, failed 0, completed 361
Workers: running 60, starting 0, queuing 80, lost 0, failed 0, completed 364
Workers: running 60, starting 0, queuing 73, lost 0, failed 0, completed 371
Workers: running 60, starting 0, queuing 65, lost 0, failed 0, completed 379
Workers: running 60, starting 0, queuing 57, lost 0, failed 0, completed 387
Workers: running 60, starting 0, queuing 55, lost 0, failed 0, completed 389
Workers: running 57, starting 2, queuing 44, lost 0, failed 0, completed 401
Workers: running 58, starting 2, queuing 34, lost 0, failed 0, completed 410
Workers: running 59, starting 1, queuing 24, lost 0, failed 0, completed 420
Workers: running 59, starting 1, queuing 19, lost 0, failed 0, completed 425
Workers: running 59, starting 1, queuing 12, lost 0, failed 0, completed 432
Workers: running 58, starting 2, queuing 8, lost 0, failed 0, completed 436
Workers: running 58, starting 1, queuing 0, lost 0, failed 0, completed 442
Workers: running 57, starting 0, queuing 0, lost 0, failed 0, completed 444
Workers: running 55, starting 0, queuing 0, lost 0, failed 0, completed 446
Workers: running 48, starting 0, queuing 0, lost 0, failed 0, completed 453
Workers: running 46, starting 0, queuing 0, lost 0, failed 0, completed 455
Workers: running 38, starting 0, queuing 0, lost 0, failed 0, completed 463
Workers: running 29, starting 0, queuing 0, lost 0, failed 0, completed 472
Workers: running 15, starting 0, queuing 0, lost 0, failed 0, completed 486
Workers: running 8, starting 0, queuing 0, lost 0, failed 0, completed 493
Workers: running 3, starting 0, queuing 0, lost 0, failed 0, completed 498
Workers: running 1, starting 0, queuing 0, lost 0, failed 0, completed 500
Workers: running 1, starting 0, queuing 0, lost 0, failed 0, completed 500
Workers: running 1, starting 0, queuing 0, lost 0, failed 0, completed 500
Workers: running 1, starting 0, queuing 0, lost 0, failed 0, completed 500
Workers: running 1, starting 0, queuing 0, lost 0, failed 0, completed 500
Workers: running 1, starting 0, queuing 0, lost 0, failed 0, completed 500
Workers: running 1, starting 0, queuing 0, lost 0, failed 0, completed 500
Workers: running 1, starting 0, queuing 0, lost 0, failed 0, completed 500
Workers: running 1, starting 0, queuing 0, lost 0, failed 0, completed 500
Workers: running 1, starting 0, queuing 0, lost 0, failed 0, completed 500 <--- Mark IV: allocation information was saved, and job was manually killed.

The "story" went like this:
1: The job started to run and perhaps due to the high cpu load, the server (the one I was polling. I had three servers) and some of the clients stopped responding for a moment (as Mark I and Mark II showing)
2: Lost allocations magically recovered? (as Mark III showing)
3: Nomad thought all the allocations were completed (except allocation 0 -- I use allocation 0 as a cleanup worker), but in fact Allocation 493 - 499 were never scheduled (as below). I saved the allocation information and killed the job (as Mark IV showing). The allocation information is in the attachment.
image

allocations.txt

@dukeland9
Copy link
Author

Another simpler story: This time I increased the cpu resource requirement a bit, so the parallelism went down a bit. No server or client went temporarily not responding. How ever, Alloc 498 and 499 were never scheduled.

The status polling output:

Workers: running 0, starting 0, queuing 0, lost 0, failed 0, completed 0
Workers: running 41, starting 7, queuing 452, lost 0, failed 0, completed 0
Workers: running 44, starting 4, queuing 452, lost 0, failed 0, completed 0
Workers: running 44, starting 4, queuing 452, lost 0, failed 0, completed 0
Workers: running 48, starting 0, queuing 452, lost 0, failed 0, completed 0
Workers: running 48, starting 0, queuing 452, lost 0, failed 0, completed 0
Workers: running 48, starting 0, queuing 452, lost 0, failed 0, completed 0
Workers: running 46, starting 2, queuing 446, lost 0, failed 0, completed 6
Workers: running 47, starting 0, queuing 440, lost 0, failed 0, completed 13
Workers: running 44, starting 4, queuing 432, lost 0, failed 0, completed 20
Workers: running 38, starting 10, queuing 420, lost 0, failed 0, completed 32
Workers: running 45, starting 3, queuing 414, lost 0, failed 0, completed 38
Workers: running 46, starting 2, queuing 413, lost 0, failed 0, completed 39
Workers: running 44, starting 4, queuing 408, lost 0, failed 0, completed 44
Workers: running 44, starting 4, queuing 405, lost 0, failed 0, completed 47
Workers: running 44, starting 4, queuing 399, lost 0, failed 0, completed 53
Workers: running 43, starting 5, queuing 393, lost 0, failed 0, completed 59
Workers: running 43, starting 5, queuing 388, lost 0, failed 0, completed 64
Workers: running 44, starting 4, queuing 386, lost 0, failed 0, completed 66
Workers: running 40, starting 8, queuing 376, lost 0, failed 0, completed 76
Workers: running 41, starting 7, queuing 373, lost 0, failed 0, completed 79
Workers: running 42, starting 6, queuing 366, lost 0, failed 0, completed 86
Workers: running 41, starting 7, queuing 357, lost 0, failed 0, completed 95
Workers: running 45, starting 3, queuing 353, lost 0, failed 0, completed 99
Workers: running 44, starting 4, queuing 352, lost 0, failed 0, completed 100
Workers: running 44, starting 4, queuing 347, lost 0, failed 0, completed 105
Workers: running 48, starting 0, queuing 345, lost 0, failed 0, completed 107
Workers: running 46, starting 2, queuing 341, lost 0, failed 0, completed 111
Workers: running 47, starting 1, queuing 335, lost 0, failed 0, completed 117
Workers: running 43, starting 5, queuing 320, lost 0, failed 0, completed 132
Workers: running 43, starting 5, queuing 318, lost 0, failed 0, completed 134
Workers: running 48, starting 0, queuing 318, lost 0, failed 0, completed 134
Workers: running 46, starting 2, queuing 310, lost 0, failed 0, completed 142
Workers: running 48, starting 0, queuing 306, lost 0, failed 0, completed 146
Workers: running 44, starting 4, queuing 298, lost 0, failed 0, completed 154
Workers: running 41, starting 7, queuing 288, lost 0, failed 0, completed 164
Workers: running 42, starting 6, queuing 282, lost 0, failed 0, completed 170
Workers: running 38, starting 10, queuing 277, lost 0, failed 0, completed 175
Workers: running 41, starting 7, queuing 273, lost 0, failed 0, completed 179
Workers: running 42, starting 6, queuing 270, lost 0, failed 0, completed 182
Workers: running 42, starting 6, queuing 266, lost 0, failed 0, completed 186
Workers: running 44, starting 4, queuing 263, lost 0, failed 0, completed 189
Workers: running 44, starting 4, queuing 257, lost 0, failed 0, completed 195
Workers: running 44, starting 4, queuing 251, lost 0, failed 0, completed 201
Workers: running 48, starting 0, queuing 244, lost 0, failed 0, completed 208
Workers: running 48, starting 0, queuing 237, lost 0, failed 0, completed 215
Workers: running 44, starting 4, queuing 229, lost 0, failed 0, completed 223
Workers: running 48, starting 0, queuing 225, lost 0, failed 0, completed 227
Workers: running 48, starting 0, queuing 223, lost 0, failed 0, completed 229
Workers: running 47, starting 1, queuing 216, lost 0, failed 0, completed 236
Workers: running 45, starting 3, queuing 212, lost 0, failed 0, completed 240
Workers: running 44, starting 4, queuing 208, lost 0, failed 0, completed 244
Workers: running 45, starting 3, queuing 202, lost 0, failed 0, completed 250
Workers: running 44, starting 4, queuing 194, lost 0, failed 0, completed 258
Workers: running 45, starting 3, queuing 189, lost 0, failed 0, completed 263
Workers: running 41, starting 6, queuing 185, lost 0, failed 0, completed 268
Workers: running 44, starting 4, queuing 175, lost 0, failed 0, completed 277
Workers: running 42, starting 6, queuing 171, lost 0, failed 0, completed 281
Workers: running 42, starting 6, queuing 169, lost 0, failed 0, completed 283
Workers: running 41, starting 7, queuing 162, lost 0, failed 0, completed 290
Workers: running 40, starting 8, queuing 155, lost 0, failed 0, completed 297
Workers: running 45, starting 3, queuing 146, lost 0, failed 0, completed 306
Workers: running 47, starting 1, queuing 141, lost 0, failed 0, completed 311
Workers: running 47, starting 1, queuing 135, lost 0, failed 0, completed 317
Workers: running 44, starting 4, queuing 132, lost 0, failed 0, completed 320
Workers: running 47, starting 1, queuing 131, lost 0, failed 0, completed 321
Workers: running 46, starting 2, queuing 128, lost 0, failed 0, completed 324
Workers: running 46, starting 2, queuing 125, lost 0, failed 0, completed 327
Workers: running 39, starting 9, queuing 114, lost 0, failed 0, completed 338
Workers: running 38, starting 10, queuing 101, lost 0, failed 0, completed 351
Workers: running 41, starting 7, queuing 94, lost 0, failed 0, completed 358
Workers: running 44, starting 4, queuing 91, lost 0, failed 0, completed 361
Workers: running 42, starting 6, queuing 86, lost 0, failed 0, completed 366
Workers: running 39, starting 9, queuing 79, lost 0, failed 0, completed 373
Workers: running 46, starting 2, queuing 77, lost 0, failed 0, completed 375
Workers: running 47, starting 1, queuing 75, lost 0, failed 0, completed 377
Workers: running 47, starting 1, queuing 72, lost 0, failed 0, completed 380
Workers: running 47, starting 1, queuing 69, lost 0, failed 0, completed 383
Workers: running 37, starting 10, queuing 58, lost 0, failed 0, completed 395
Workers: running 41, starting 7, queuing 47, lost 0, failed 0, completed 405
Workers: running 44, starting 3, queuing 43, lost 0, failed 0, completed 410
Workers: running 39, starting 7, queuing 37, lost 0, failed 0, completed 417
Workers: running 47, starting 1, queuing 30, lost 0, failed 0, completed 422
Workers: running 47, starting 1, queuing 26, lost 0, failed 0, completed 426
Workers: running 47, starting 1, queuing 24, lost 0, failed 0, completed 428
Workers: running 48, starting 0, queuing 19, lost 0, failed 0, completed 433
Workers: running 47, starting 1, queuing 16, lost 0, failed 0, completed 436
Workers: running 44, starting 4, queuing 11, lost 0, failed 0, completed 441
Workers: running 44, starting 4, queuing 1, lost 0, failed 0, completed 451
Workers: running 42, starting 0, queuing 0, lost 0, failed 0, completed 458
Workers: running 33, starting 0, queuing 0, lost 0, failed 0, completed 467
Workers: running 31, starting 0, queuing 0, lost 0, failed 0, completed 469
Workers: running 28, starting 0, queuing 0, lost 0, failed 0, completed 472
Workers: running 20, starting 0, queuing 0, lost 0, failed 0, completed 480
Workers: running 17, starting 0, queuing 0, lost 0, failed 0, completed 483
Workers: running 13, starting 0, queuing 0, lost 0, failed 0, completed 487
Workers: running 6, starting 0, queuing 0, lost 0, failed 0, completed 494
Workers: running 4, starting 0, queuing 0, lost 0, failed 0, completed 496
Workers: running 4, starting 0, queuing 0, lost 0, failed 0, completed 496
Workers: running 4, starting 0, queuing 0, lost 0, failed 0, completed 496
Workers: running 2, starting 0, queuing 0, lost 0, failed 0, completed 498
Workers: running 1, starting 0, queuing 0, lost 0, failed 0, completed 499
Workers: running 1, starting 0, queuing 0, lost 0, failed 0, completed 499
Workers: running 1, starting 0, queuing 0, lost 0, failed 0, completed 499
Workers: running 1, starting 0, queuing 0, lost 0, failed 0, completed 499
Workers: running 1, starting 0, queuing 0, lost 0, failed 0, completed 499
Workers: running 1, starting 0, queuing 0, lost 0, failed 0, completed 499
Workers: running 1, starting 0, queuing 0, lost 0, failed 0, completed 499
Workers: running 1, starting 0, queuing 0, lost 0, failed 0, completed 499
Workers: running 1, starting 0, queuing 0, lost 0, failed 0, completed 499
Workers: running 1, starting 0, queuing 0, lost 0, failed 0, completed 499
Workers: running 1, starting 0, queuing 0, lost 0, failed 0, completed 499
Workers: running 1, starting 0, queuing 0, lost 0, failed 0, completed 499
Workers: running 1, starting 0, queuing 0, lost 0, failed 0, completed 499
Workers: running 1, starting 0, queuing 0, lost 0, failed 0, completed 499
Workers: running 1, starting 0, queuing 0, lost 0, failed 0, completed 499
Workers: running 1, starting 0, queuing 0, lost 0, failed 0, completed 499
Workers: running 1, starting 0, queuing 0, lost 0, failed 0, completed 499
Workers: running 1, starting 0, queuing 0, lost 0, failed 0, completed 499
Workers: running 1, starting 0, queuing 0, lost 0, failed 0, completed 499
Workers: running 1, starting 0, queuing 0, lost 0, failed 0, completed 499
Workers: running 1, starting 0, queuing 0, lost 0, failed 0, completed 499
Workers: running 1, starting 0, queuing 0, lost 0, failed 0, completed 499
Workers: running 1, starting 0, queuing 0, lost 0, failed 0, completed 499
Workers: running 1, starting 0, queuing 0, lost 0, failed 0, completed 499

The allocation information is in the attachment.
allocations2.txt

@dadgar
Copy link
Contributor

dadgar commented Dec 8, 2017

@dukeland9 I look through the allocations you gave (thanks!) I think there may be some confusion. All the allocations did schedule but some of the names were reused. There is no guarantee by the scheduler that the name will always be unique, only the ID. However the scheduler does make a best effort to provide that guarantee.

I would be curious to see your server logs

@dadgar dadgar changed the title Allocations were not appropriately started when # of allocs > ~200 Under high load, repeated use of allocation name Dec 8, 2017
@dukeland9
Copy link
Author

@dadgar It's great to see that we finally found what the problem is.

I think there are at least three things associated with an alloc: NOMAD_ALLOC_ID, NOMAD_ALLOC_NAME, NOMAD_ALLOC_INDEX. I'm not sure if the NAME should be reused, but the INDEX, like the ID, should not be reused, in my opinion.

One thing I like nomad is that we can pass the alloc index to each instance, so that instances can be "heterogeneous". I think it's an important feature. In my case, each shard worker uses the alloc index to identify the part of the work it's responsible for. If the index can be reused, then my system does not work.

I think logically there should not be two allocations with the same index.
So let's guarantee unique index for each alloc?

@dadgar
Copy link
Contributor

dadgar commented Dec 11, 2017

@dukeland9 That was a guarantee we use to make but it placed constraints on the features the scheduler could implement and thus we had to remove it. The scheduler makes a best effort to keep unique indexes but it is not a guarantee. If possible I would like to see your logs because it seems like it should have been able to keep the alloc index correct.

@dukeland9
Copy link
Author

@dadgar Reproduced it again. Alloc 498 and 499 were never scheduled once again.
The alloc information and server log during the execution are in the attachment.

BTW, what is the last version we kept the guarantee of unique alloc index? I may want to use that version, because my system depends on the assumption that alloc indices are unique.

allocations3.txt
server_output.txt

@dadgar
Copy link
Contributor

dadgar commented Dec 15, 2017

@dukeland9 The log output is of the client can you share the server logs?

@dukeland9
Copy link
Author

@dadgar I used the log of one of the servers (it's also a client), maybe I got the wrong log. How to get the server log?

@dadgar
Copy link
Contributor

dadgar commented Dec 18, 2017

@dukeland9 We are fairly sure we have the fix with #3669. Would you be willing to test a build and see if it resolves the issue?

@dukeland9
Copy link
Author

@dadgar Great!! I'm happy to try that! Would you please provide a binary in the releases? I'm not very familiar with Go, so I'm afraid I might get something wrong if I build it myself.

@dadgar
Copy link
Contributor

dadgar commented Dec 19, 2017

@dukeland9 Here is a Linux AMD64 build! Thanks for testing!

nomad.zip

@dukeland9
Copy link
Author

@dadgar I restarted the servers with the binary you provided and tested two large jobs (with 500 and 1000 allocs respectively), the problem did not happen! I guess we have fixed the problem! Thanks a lot! I really appreciate it!

@preetapan
Copy link
Contributor

@dukeland9 Thanks a lot for trying out this binary and verifying!

@dadgar
Copy link
Contributor

dadgar commented Dec 19, 2017

@dukeland9 Yep thanks for testing! Will be fixed in 0.7.1

@dadgar dadgar closed this as completed Dec 19, 2017
@dukeland9
Copy link
Author

@dadgar I'm sorry, the problem happened again :(
The allocation information is attached.
allocations_0.7.1.txt

@dukeland9
Copy link
Author

Would the solution fail if the leading server shifts?

@dadgar
Copy link
Contributor

dadgar commented Dec 20, 2017

@dukeland9 What was the count in that job? It looks like node d1a6fe56-af17-eb7f-9923-99f4a6eee60d was marked as down momentarily and thus Nomad replaced the allocations on it. All 3 allocs that got duplicates were on that node. This seems like expected behavior and is different than what we were seeing before

@dukeland9 What are the machines types you are running on? What is their CPU/Network usage look like when you are running this test?

@dukeland9
Copy link
Author

@dadgar Thanks for further looking into this issue.

There are about 30 machines with nearly 80 cores in my nomad cluster. The OS's were Ubuntu 14.04 or 16.04. The job I was running was very CPU intensive - 100% CPU utilization while running.

I had three machines for both servers and clients. Now I let them to be only servers, so that the servers won't be resource exhausted when the jobs are running. It looks fine now.

I expect that each alloc (identifed by its index) is successfully scheduled exactly once under all circumstances:

  • No alloc (identified by its index) is never scheduled.
  • No alloc (identified by its index) has two or more successful scheduling in one job.

@github-actions
Copy link

github-actions bot commented Dec 5, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 5, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants