-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Runner died during the build #118
Comments
No idea what's causing this but I've been trying to track this via #94 as well. I've seen it seemingly intermittently. One thought I had before was that it might be when we redeploy ARC Runners maybe it kills the old ones even if they are in use to redeploy. However I think that theory is debunked with this report because while we deploy in Canary often. Prod & Vanguard haven't been deployed since last week and I see those jobs listed ran on Prod/Vanguard. Not sure if this is interesting or not but I noticed all those failed jobs seem to have been run in Vanguard. @jeanschmidt any ideas? |
I was discussing with @zxiiro in relationship to the failures found. They all are from vanguard clusters, and in very close timewise from each other. This points to a vanguard cluster dying or scaling down forcefully or having some failure. There is no official release initiated this time, but could be a deployment that started and got canceled at that time locally. There is no way we can rule out that this can be trigger by normal deployment as well. We will keep an eye to the eviction pattern, lets us aware of issues that are happening, so we can debug them and fix as soon as possible. In the meantime we will be improving our monitoring in order to proactively detect more failure modes. |
After investigating, seems we're doing heavy use of SPOT instances, that by definition can die at any time (so they are sold at a discount). So, seems reasonable that we migrate to only use ondemand instances and pay the excess price for the guarantee, this is due the long lived nature of our jobs. |
this latest example is production, so, I guess we can drop out the vanguard hypothesis |
This signature is quite different but signal feels suspicious, as if container was killed: https://github.com/pytorch/pytorch/actions/runs/8727615485/job/23945549849 |
We got our quotas grant. With it I deployed the instance request type changes to production. Hopefully this will handle the situation and solve the problem we're facing. Lets continue to keep an eye. |
I saw this one in one of my jobs yesterday too. https://github.com/pytorch/pytorch-canary/actions/runs/8710434864/job/23934297083?pr=208 |
Forwarding recent issues here:
For those failures, It looks to me that they are still victims of the issue that got fixed on pytorch/test-infra#5102 still I cannot guarantee that there might be other issues that can cause similar problems. Thanks for looking into it and reporting problems you are seeing. But I believe we can benefit more by having a single open issue tracking this. |
@zxiiro found another potential solution: It is possible to force karpenter to not evict pods by tagging them as long jobs and non-evictable. We're experimenting with that as well ATM |
@zxiiro are we confident we can close this issue? I haven't seen canceled jobs since we migrated to ondemand and added the non evict annotations for karpenter.... |
I'm not sure I want to say it's resolved as my understanding is the scale of testing is not as it was before so it's possible we are not hitting it due to that. I think it's reasonable that we can close this and reopen if necessary since at least so far it seems like the issue has resolved. |
Noticed few strange examples while looking at the hud:
Any idea what might have caused it? Any logs?
The text was updated successfully, but these errors were encountered: