Runner died during the build #118

malfet · 2024-04-11T06:38:00Z

Noticed few strange examples while looking at the hud:

zxiiro · 2024-04-11T14:06:55Z

No idea what's causing this but I've been trying to track this via #94 as well. I've seen it seemingly intermittently.

One thought I had before was that it might be when we redeploy ARC Runners maybe it kills the old ones even if they are in use to redeploy. However I think that theory is debunked with this report because while we deploy in Canary often. Prod & Vanguard haven't been deployed since last week and I see those jobs listed ran on Prod/Vanguard.

Not sure if this is interesting or not but I noticed all those failed jobs seem to have been run in Vanguard.

@jeanschmidt any ideas?

malfet · 2024-04-16T01:18:29Z

One more: https://github.com/pytorch/pytorch/actions/runs/8698230306/job/23854926500

jeanschmidt · 2024-04-16T19:31:19Z

I was discussing with @zxiiro in relationship to the failures found. They all are from vanguard clusters, and in very close timewise from each other.

This points to a vanguard cluster dying or scaling down forcefully or having some failure.

There is no official release initiated this time, but could be a deployment that started and got canceled at that time locally.

There is no way we can rule out that this can be trigger by normal deployment as well.

We will keep an eye to the eviction pattern, lets us aware of issues that are happening, so we can debug them and fix as soon as possible.

In the meantime we will be improving our monitoring in order to proactively detect more failure modes.

jeanschmidt · 2024-04-17T13:43:49Z

After investigating, seems we're doing heavy use of SPOT instances, that by definition can die at any time (so they are sold at a discount).

So, seems reasonable that we migrate to only use ondemand instances and pay the excess price for the guarantee, this is due the long lived nature of our jobs.

malfet · 2024-04-17T14:13:33Z

Few more examples:

jeanschmidt · 2024-04-17T14:21:01Z

this latest example is production, so, I guess we can drop out the vanguard hypothesis

malfet · 2024-04-17T22:32:32Z

This signature is quite different but signal feels suspicious, as if container was killed: https://github.com/pytorch/pytorch/actions/runs/8727615485/job/23945549849

jeanschmidt · 2024-04-18T11:40:14Z

We got our quotas grant. With it I deployed the instance request type changes to production.

Hopefully this will handle the situation and solve the problem we're facing. Lets continue to keep an eye.

zxiiro · 2024-04-18T11:57:49Z

This signature is quite different but signal feels suspicious, as if container was killed: https://github.com/pytorch/pytorch/actions/runs/8727615485/job/23945549849

I saw this one in one of my jobs yesterday too.

https://github.com/pytorch/pytorch-canary/actions/runs/8710434864/job/23934297083?pr=208

jeanschmidt · 2024-04-19T13:26:10Z

Forwarding recent issues here:

pytorch/pytorch/actions/runs/8743512800/job/23994213718
pytorch/pytorch/actions/runs/8743512800/job/23994215119
pytorch/pytorch/actions/runs/8743512800/job/23994213896
pytorch/pytorch/actions/runs/8743512800/job/23994214290
pytorch/pytorch/actions/runs/8738177261/job/23991638097
This one looks like symptoms of #118 and #94 since the logs abruptly stop.

For those failures, It looks to me that they are still victims of the issue that got fixed on pytorch/test-infra#5102 still I cannot guarantee that there might be other issues that can cause similar problems.

Thanks for looking into it and reporting problems you are seeing. But I believe we can benefit more by having a single open issue tracking this.

jeanschmidt · 2024-04-24T07:21:23Z

@zxiiro found another potential solution: It is possible to force karpenter to not evict pods by tagging them as long jobs and non-evictable.

We're experimenting with that as well ATM

jeanschmidt · 2024-05-15T15:41:52Z

@zxiiro are we confident we can close this issue? I haven't seen canceled jobs since we migrated to ondemand and added the non evict annotations for karpenter....

zxiiro · 2024-05-15T16:23:42Z

@zxiiro are we confident we can close this issue? I haven't seen canceled jobs since we migrated to ondemand and added the non evict annotations for karpenter....

I'm not sure I want to say it's resolved as my understanding is the scale of testing is not as it was before so it's possible we are not hitting it due to that. I think it's reasonable that we can close this and reopen if necessary since at least so far it seems like the issue has resolved.

This was referenced Apr 17, 2024

Use ON_DEMAND instances for ARC nodes config to avoid disruptions pytorch/test-infra#5102

Merged

Use ON_DEMAND instances for core EKS nodes #123

Merged

zxiiro mentioned this issue Apr 19, 2024

Stop migration unless there are some metric of CI stability #138

Closed

ZainRizvi added the workstream/linux-cpu Get CPU jobs working on linux label Apr 30, 2024

ZainRizvi added this to the ARC Runner Reliability milestone Apr 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Runner died during the build #118

Runner died during the build #118

malfet commented Apr 11, 2024 •

edited

Loading

zxiiro commented Apr 11, 2024

malfet commented Apr 16, 2024

jeanschmidt commented Apr 16, 2024

jeanschmidt commented Apr 17, 2024

malfet commented Apr 17, 2024

jeanschmidt commented Apr 17, 2024

malfet commented Apr 17, 2024

jeanschmidt commented Apr 18, 2024

zxiiro commented Apr 18, 2024

jeanschmidt commented Apr 19, 2024

jeanschmidt commented Apr 24, 2024

jeanschmidt commented May 15, 2024

zxiiro commented May 15, 2024 •

edited

Loading

Runner died during the build #118

Runner died during the build #118

Comments

malfet commented Apr 11, 2024 • edited Loading

zxiiro commented Apr 11, 2024

malfet commented Apr 16, 2024

jeanschmidt commented Apr 16, 2024

jeanschmidt commented Apr 17, 2024

malfet commented Apr 17, 2024

jeanschmidt commented Apr 17, 2024

malfet commented Apr 17, 2024

jeanschmidt commented Apr 18, 2024

zxiiro commented Apr 18, 2024

jeanschmidt commented Apr 19, 2024

jeanschmidt commented Apr 24, 2024

jeanschmidt commented May 15, 2024

zxiiro commented May 15, 2024 • edited Loading

malfet commented Apr 11, 2024 •

edited

Loading

zxiiro commented May 15, 2024 •

edited

Loading