-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feature request or bug] Guarantee unique allocation index #3698
Comments
@dukeland9 we are back after the break and looking at this again. Would like to confirm one thing - in your job specification did you ask for 100 allocs? Would be helpful if you posted the job specification here. |
@dukeland9 After some more investigation, we found that there are two different things happening here: Reusing the same alloc index on a lost allocation that was replaced - that is expected behavior. Nomad will use indexes starting from 0 to desired_count -1. When one of those need to be replaced, like 19, 25 and 36 in your example because it lost its connection to the node they were on, the scheduler will reuse that alloc index to create a replacement. Creating the right number of allocations - We did find a bug in how we count whether a batch job was successfully allocated. This resulted in the scheduler not creating allocations with indexes 97, 98 and 99 to make a desired total count of 100. The bug was because it incorrectly counted the replaced allocations (19, 25, 36) against the total number of desired running allocations(100). We have a fix for this, will be commenting shortly with a binary to test this. Also noting that this bug is a rare edge case, it only happens when there is a large enough batch being requested in a CPU contentious environment, and all the initial set of placements have not yet been made before which there are lost allocations. Thanks once again for stress testing this in your environment. |
@preetapan Thank you for investigating this issue! |
@dadgar I tried to run your binary, but it got "error while loading shared libraries: liblxc.so.1: cannot open shared object file: No such file or directory". I want to confirm that the binary was compiled correctly. I don't think we had lxc dependency before. |
@dukeland9 can you try this binary ? Tried building one for you on my Linux box. Verified that this does not depend on liblxc.so (I added some output from ldd below that shows this)
|
@preetapan I can't access your file from Amazon S3 due to the blocking issue in China. Would you please provide one on github like dadgar did? Thanks a lot. BTW, I only have to update the binaries on the servers right? |
@preetapan Never mind. I managed to build one myself. I replaced the binary on the servers and tested running two jobs with node draining situation. It seemed the system worked correctly. |
The system has been running as expected for several days. Thank @preetapan and @dadgar for fixing this! |
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
This is a feature request or bug report after this issue: #3593
After that issue, there are still duplicate indices for allocations from time to time.
For example, in the following job:
https://github.com/hashicorp/nomad/files/1574186/allocations_0.7.1.txt
Allocation 36, 19 and 25 were scheduled twice.
If we can't guarantee the uniqueness the alloc index, why bother offer such variable interpolation on https://www.nomadproject.io/docs/runtime/interpolation.html? It's very misleading and useless.
The text was updated successfully, but these errors were encountered: