-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Under high load, repeated use of allocation name #3593
Comments
Hi, thanks for reporting this issue. Would you mind posting logs (server and/or client) from when this issue occurs? |
@chelseakomlo Thanks for replying! I guess the server/client logs may help little. Here is part of the logs the server printed when executing my job:
But I notice there's another issue which is very likely to be related to this issue: #3595 In my case, I observed allocation lost, too. I started my nomad job with a python program (using python-nomad) and here's the status my program printed (polling job summary every 5 seconds):
In the end, the number of missing allocs (never scheduled) was also 5. I don't think it was coincidence. |
@chelseakomlo Also, my job is also CPU-heavy. The client machines go around 100% CPU utilization when executing my job. It's like the situation in #3595 |
Any updates on this issue? |
Hi @dukeland9, thanks for the further log messages. What is the output of |
@chelseakomlo I tried nomad node-status and nomad node-status -self, but they just print the list of clients and the information of the machine, respectively. I don't think those are relevant and may not be appropriate to post here. It's also hard to catch the log when the allocation is lost, since the "lost" status seems transient. As seen in the status polling log I provided above, the lost allocations quickly disappeared (which I don't understand why), it's hard to locate the "crime scene". Also, it seems that all lost or rescheduled tasks turned to completed status in the end, would that be a problem? I'm wondering if similar problem was reported before version 0.7.0? If not, I can use rolling back to 0.6.3 as a quick solution here. |
I found this: #3289 (comment) But I think:
|
@dukeland9 Can you output the results of |
@dadgar I reproduced the bug again by setting the allocation number to 500.
The "story" went like this: |
Another simpler story: This time I increased the cpu resource requirement a bit, so the parallelism went down a bit. No server or client went temporarily not responding. How ever, Alloc 498 and 499 were never scheduled. The status polling output:
The allocation information is in the attachment. |
@dukeland9 I look through the allocations you gave (thanks!) I think there may be some confusion. All the allocations did schedule but some of the names were reused. There is no guarantee by the scheduler that the name will always be unique, only the ID. However the scheduler does make a best effort to provide that guarantee. I would be curious to see your server logs |
@dadgar It's great to see that we finally found what the problem is. I think there are at least three things associated with an alloc: NOMAD_ALLOC_ID, NOMAD_ALLOC_NAME, NOMAD_ALLOC_INDEX. I'm not sure if the NAME should be reused, but the INDEX, like the ID, should not be reused, in my opinion. One thing I like nomad is that we can pass the alloc index to each instance, so that instances can be "heterogeneous". I think it's an important feature. In my case, each shard worker uses the alloc index to identify the part of the work it's responsible for. If the index can be reused, then my system does not work. I think logically there should not be two allocations with the same index. |
@dukeland9 That was a guarantee we use to make but it placed constraints on the features the scheduler could implement and thus we had to remove it. The scheduler makes a best effort to keep unique indexes but it is not a guarantee. If possible I would like to see your logs because it seems like it should have been able to keep the alloc index correct. |
@dadgar Reproduced it again. Alloc 498 and 499 were never scheduled once again. BTW, what is the last version we kept the guarantee of unique alloc index? I may want to use that version, because my system depends on the assumption that alloc indices are unique. |
@dukeland9 The log output is of the client can you share the server logs? |
@dadgar I used the log of one of the servers (it's also a client), maybe I got the wrong log. How to get the server log? |
@dukeland9 We are fairly sure we have the fix with #3669. Would you be willing to test a build and see if it resolves the issue? |
@dadgar Great!! I'm happy to try that! Would you please provide a binary in the releases? I'm not very familiar with Go, so I'm afraid I might get something wrong if I build it myself. |
@dukeland9 Here is a Linux AMD64 build! Thanks for testing! |
@dadgar I restarted the servers with the binary you provided and tested two large jobs (with 500 and 1000 allocs respectively), the problem did not happen! I guess we have fixed the problem! Thanks a lot! I really appreciate it! |
@dukeland9 Thanks a lot for trying out this binary and verifying! |
@dukeland9 Yep thanks for testing! Will be fixed in 0.7.1 |
@dadgar I'm sorry, the problem happened again :( |
Would the solution fail if the leading server shifts? |
@dukeland9 What was the count in that job? It looks like node @dukeland9 What are the machines types you are running on? What is their CPU/Network usage look like when you are running this test? |
@dadgar Thanks for further looking into this issue. There are about 30 machines with nearly 80 cores in my nomad cluster. The OS's were Ubuntu 14.04 or 16.04. The job I was running was very CPU intensive - 100% CPU utilization while running. I had three machines for both servers and clients. Now I let them to be only servers, so that the servers won't be resource exhausted when the jobs are running. It looks fine now. I expect that each alloc (identifed by its index) is successfully scheduled exactly once under all circumstances:
|
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. |
Nomad version
0.7.0
Operating system and Environment details
Ubuntu 14.04 (some of the client machines are Ubuntu 16.04)
Issue
I'm using nomad to schedule distributed batch jobs. The cluster has ~30 machines with about 80 cores.
If I divide the job to about 100 tasks, the system works fine. However, if the tasks were increased to around 200, some of the tasks (allocations) would never be started.
For example, a job I started was divided into 256 tasks (allocs), however task (alloc) No. 251-255 was never scheduled (as below). On the other hand, the job did have 256 successful tasks done, but the problem is that my batch job depends on the alloc number (0-255) to assign the right data slice to process. So some of the tasks never started is not acceptable.
The jobs are of raw_exec type.
Reproduction steps
The issue can always be reproduced when the number of allocations is increase to 200 or more. About 6-8 allocs will never be started.
Nomad Server logs (if appropriate)
Nomad Client logs (if appropriate)
Job file (if appropriate)
The text was updated successfully, but these errors were encountered: