Re-enable generic-worker for CPU tasks #561

bhearsum · 2024-05-03T17:59:50Z

Analysis done in #538 (comment) shows that the problems are intermittent, and largely related to spot terminations. We're seeing the latter in the existing workers anyways, so unless we find more serious issues with generic-worker for CPU tasks, we may as well go ahead with this.

bhearsum · 2024-05-06T13:03:42Z

This was previously reviewed by RelEng; doesn't need review by them again IMO. @eu9ene and/or @gregtatum it's up to you if/when we re-land this as far as I'm concerned.

eu9ene · 2024-05-06T23:19:26Z

The two issues I reported to you were OOMs and the third one was due to preemptions I think because the task has finished successfully after a couple of restarts.

The risk of landing this PR is running into "cache something" issue again and we want to stabilize the pipeline now. It would be definitely nice to enable monitoring for CPU machines to look at resource utilization. Should we try to reproduce the issues with caching again before landing this?

bhearsum · 2024-05-06T23:23:51Z

The two issues I reported to you were OOMs and the third one was due to preemptions I think because the task has finished successfully after a couple of restarts.

The risk of landing this PR is running into "cache something" issue again and we want to stabilize the pipeline now. It would be definitely nice to enable monitoring for CPU machines to look at resource utilization. Should we try to reproduce the issues with caching again before landing this?

To be honest, I don't think I'll have time in the next week weeks to debug that, as both bringing the Snakepit machines online and getting spot instances turned back on are higher priority.

However, I don't think what we saw in the run linked from #538 is representative of what we'll see. It's definitely possible that we'll get spot killed and get those strange cache failures, but those do fix themselves in the next run by clearing out the cache. What happened in that run was that we had:

spot kill
cache issue
spot kill
cache issue
etc.

So - we do lose a little bit of time when we have the cache issue, but it's not quite as serious as it first seemed IMO.

In any case - I'm happy to live with this status quo for now, or try to move forward this again - totally your / @gregtatum's call.

eu9ene · 2024-05-07T19:15:42Z

Let's hold for now and wait until we're in a more stable state with the training recipe and infra. Then we can give it another try. We can try training something from this branch and check that the cache issues don't impact training. It's important to merge this soon though because we want to monitor CPU resources to tune machine configs.

bhearsum · 2024-05-08T18:17:17Z

@eu9ene - Here's a Task that demonstrates successfully recovering from the cache issue: https://firefox-ci-tc.services.mozilla.com/tasks/WM8ahtXGQSCuvlljOU3O0Q/runs/0

Obviously this is very annoying, and if you don't want to land until we get to the bottom of it, that's fine, but hopefully this gives some confidence that it won't burn a tremendous amount of time. (We don't even download upstream artifacts before the cache fails the job, so these failures typically only cost us a few minutes.)

eu9ene · 2024-05-08T20:13:56Z

@eu9ene - Here's a Task that demonstrates successfully recovering from the cache issue: https://firefox-ci-tc.services.mozilla.com/tasks/WM8ahtXGQSCuvlljOU3O0Q/runs/0

Obviously this is very annoying, and if you don't want to land until we get to the bottom of it, that's fine, but hopefully this gives some confidence that it won't burn a tremendous amount of time. (We don't even download upstream artifacts before the cache fails the job, so these failures typically only cost us a few minutes.)

@bhearsum ok, if the risk is minimal and we just lose some retry attempts with this (should we increase them then?), then I guess being able to monitor CPU machine outweighs the extra restarts on pre-emption.

bhearsum · 2024-05-08T23:43:17Z

@eu9ene - Here's a Task that demonstrates successfully recovering from the cache issue: https://firefox-ci-tc.services.mozilla.com/tasks/WM8ahtXGQSCuvlljOU3O0Q/runs/0
Obviously this is very annoying, and if you don't want to land until we get to the bottom of it, that's fine, but hopefully this gives some confidence that it won't burn a tremendous amount of time. (We don't even download upstream artifacts before the cache fails the job, so these failures typically only cost us a few minutes.)

@bhearsum ok, if the risk is minimal and we just lose some retry attempts with this (should we increase them then?), then I guess being able to monitor CPU machine outweighs the extra restarts on pre-emption.

We can always back out again if it's problematic!

Analysis done in mozilla#538 (comment) shows that the problems are intermittent, and largely related to spot terminations. We're seeing the latter in the existing workers anyways, so unless we find more serious issues with generic-worker for CPU tasks, we may as well go ahead with this.

bhearsum requested a review from a team as a code owner May 3, 2024 17:59

bhearsum requested a review from ahal May 3, 2024 17:59

bhearsum requested review from gregtatum and eu9ene and removed request for a team and ahal May 6, 2024 13:03

bhearsum marked this pull request as draft May 8, 2024 13:27

bhearsum requested review from a team and ahal and removed request for gregtatum, eu9ene, a team and ahal May 8, 2024 13:27

eu9ene approved these changes May 8, 2024

View reviewed changes

eu9ene mentioned this pull request May 8, 2024

OOM looks like a preemption #562

Open

bhearsum force-pushed the d2g2 branch from ff0b837 to 8b2b003 Compare May 8, 2024 23:43

bhearsum marked this pull request as ready for review May 9, 2024 00:12

bhearsum force-pushed the d2g2 branch from 8b2b003 to cc73246 Compare May 9, 2024 00:13

bhearsum merged commit 4b63598 into mozilla:main May 9, 2024
5 checks passed

bhearsum mentioned this pull request May 13, 2024

[taskcluster:error] Error uploading "public/build/tmp/aln.fwd" artifact. ext.certificate.expiry < now #582

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-enable generic-worker for CPU tasks #561

Re-enable generic-worker for CPU tasks #561

bhearsum commented May 3, 2024

bhearsum commented May 6, 2024

eu9ene commented May 6, 2024

bhearsum commented May 6, 2024

eu9ene commented May 7, 2024

bhearsum commented May 8, 2024

eu9ene commented May 8, 2024

bhearsum commented May 8, 2024

Re-enable generic-worker for CPU tasks #561

Re-enable generic-worker for CPU tasks #561

Conversation

bhearsum commented May 3, 2024

bhearsum commented May 6, 2024

eu9ene commented May 6, 2024

bhearsum commented May 6, 2024

eu9ene commented May 7, 2024

bhearsum commented May 8, 2024

eu9ene commented May 8, 2024

bhearsum commented May 8, 2024