Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runner died during the build #118

Open
malfet opened this issue Apr 11, 2024 · 13 comments
Open

Runner died during the build #118

malfet opened this issue Apr 11, 2024 · 13 comments
Labels
workstream/linux-cpu Get CPU jobs working on linux

Comments

@zxiiro
Copy link
Collaborator

zxiiro commented Apr 11, 2024

No idea what's causing this but I've been trying to track this via #94 as well. I've seen it seemingly intermittently.

One thought I had before was that it might be when we redeploy ARC Runners maybe it kills the old ones even if they are in use to redeploy. However I think that theory is debunked with this report because while we deploy in Canary often. Prod & Vanguard haven't been deployed since last week and I see those jobs listed ran on Prod/Vanguard.

Not sure if this is interesting or not but I noticed all those failed jobs seem to have been run in Vanguard.

@jeanschmidt any ideas?

@malfet
Copy link
Author

malfet commented Apr 16, 2024

@jeanschmidt
Copy link
Contributor

I was discussing with @zxiiro in relationship to the failures found. They all are from vanguard clusters, and in very close timewise from each other.

This points to a vanguard cluster dying or scaling down forcefully or having some failure.

There is no official release initiated this time, but could be a deployment that started and got canceled at that time locally.

There is no way we can rule out that this can be trigger by normal deployment as well.

We will keep an eye to the eviction pattern, lets us aware of issues that are happening, so we can debug them and fix as soon as possible.

In the meantime we will be improving our monitoring in order to proactively detect more failure modes.

@jeanschmidt
Copy link
Contributor

After investigating, seems we're doing heavy use of SPOT instances, that by definition can die at any time (so they are sold at a discount).

So, seems reasonable that we migrate to only use ondemand instances and pay the excess price for the guarantee, this is due the long lived nature of our jobs.

@jeanschmidt
Copy link
Contributor

this latest example is production, so, I guess we can drop out the vanguard hypothesis

@malfet
Copy link
Author

malfet commented Apr 17, 2024

This signature is quite different but signal feels suspicious, as if container was killed: https://github.com/pytorch/pytorch/actions/runs/8727615485/job/23945549849

@jeanschmidt
Copy link
Contributor

We got our quotas grant. With it I deployed the instance request type changes to production.

Hopefully this will handle the situation and solve the problem we're facing. Lets continue to keep an eye.

@zxiiro
Copy link
Collaborator

zxiiro commented Apr 18, 2024

This signature is quite different but signal feels suspicious, as if container was killed: https://github.com/pytorch/pytorch/actions/runs/8727615485/job/23945549849

I saw this one in one of my jobs yesterday too.

https://github.com/pytorch/pytorch-canary/actions/runs/8710434864/job/23934297083?pr=208

@jeanschmidt
Copy link
Contributor

Forwarding recent issues here:

pytorch/pytorch/actions/runs/8743512800/job/23994213718
pytorch/pytorch/actions/runs/8743512800/job/23994215119
pytorch/pytorch/actions/runs/8743512800/job/23994213896
pytorch/pytorch/actions/runs/8743512800/job/23994214290
pytorch/pytorch/actions/runs/8738177261/job/23991638097
This one looks like symptoms of #118 and #94 since the logs abruptly stop.

For those failures, It looks to me that they are still victims of the issue that got fixed on pytorch/test-infra#5102 still I cannot guarantee that there might be other issues that can cause similar problems.

Thanks for looking into it and reporting problems you are seeing. But I believe we can benefit more by having a single open issue tracking this.

@jeanschmidt
Copy link
Contributor

@zxiiro found another potential solution: It is possible to force karpenter to not evict pods by tagging them as long jobs and non-evictable.

We're experimenting with that as well ATM

@ZainRizvi ZainRizvi added the workstream/linux-cpu Get CPU jobs working on linux label Apr 30, 2024
@ZainRizvi ZainRizvi added this to the ARC Runner Reliability milestone Apr 30, 2024
@jeanschmidt
Copy link
Contributor

@zxiiro are we confident we can close this issue? I haven't seen canceled jobs since we migrated to ondemand and added the non evict annotations for karpenter....

@zxiiro
Copy link
Collaborator

zxiiro commented May 15, 2024

@zxiiro are we confident we can close this issue? I haven't seen canceled jobs since we migrated to ondemand and added the non evict annotations for karpenter....

I'm not sure I want to say it's resolved as my understanding is the scale of testing is not as it was before so it's possible we are not hitting it due to that. I think it's reasonable that we can close this and reopen if necessary since at least so far it seems like the issue has resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
workstream/linux-cpu Get CPU jobs working on linux
Projects
None yet
Development

No branches or pull requests

4 participants