Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-enable generic-worker for CPU tasks #561

Merged
merged 1 commit into from
May 9, 2024
Merged

Conversation

bhearsum
Copy link
Collaborator

@bhearsum bhearsum commented May 3, 2024

Analysis done in #538 (comment) shows that the problems are intermittent, and largely related to spot terminations. We're seeing the latter in the existing workers anyways, so unless we find more serious issues with generic-worker for CPU tasks, we may as well go ahead with this.

@bhearsum bhearsum requested a review from a team as a code owner May 3, 2024 17:59
@bhearsum bhearsum requested a review from ahal May 3, 2024 17:59
@bhearsum
Copy link
Collaborator Author

bhearsum commented May 6, 2024

This was previously reviewed by RelEng; doesn't need review by them again IMO. @eu9ene and/or @gregtatum it's up to you if/when we re-land this as far as I'm concerned.

@bhearsum bhearsum requested review from gregtatum and eu9ene and removed request for a team and ahal May 6, 2024 13:03
@eu9ene
Copy link
Collaborator

eu9ene commented May 6, 2024

The two issues I reported to you were OOMs and the third one was due to preemptions I think because the task has finished successfully after a couple of restarts.

The risk of landing this PR is running into "cache something" issue again and we want to stabilize the pipeline now. It would be definitely nice to enable monitoring for CPU machines to look at resource utilization. Should we try to reproduce the issues with caching again before landing this?

@bhearsum
Copy link
Collaborator Author

bhearsum commented May 6, 2024

The two issues I reported to you were OOMs and the third one was due to preemptions I think because the task has finished successfully after a couple of restarts.

The risk of landing this PR is running into "cache something" issue again and we want to stabilize the pipeline now. It would be definitely nice to enable monitoring for CPU machines to look at resource utilization. Should we try to reproduce the issues with caching again before landing this?

To be honest, I don't think I'll have time in the next week weeks to debug that, as both bringing the Snakepit machines online and getting spot instances turned back on are higher priority.

However, I don't think what we saw in the run linked from #538 is representative of what we'll see. It's definitely possible that we'll get spot killed and get those strange cache failures, but those do fix themselves in the next run by clearing out the cache. What happened in that run was that we had:

  • spot kill
  • cache issue
  • spot kill
  • cache issue
  • etc.

So - we do lose a little bit of time when we have the cache issue, but it's not quite as serious as it first seemed IMO.

In any case - I'm happy to live with this status quo for now, or try to move forward this again - totally your / @gregtatum's call.

@eu9ene
Copy link
Collaborator

eu9ene commented May 7, 2024

Let's hold for now and wait until we're in a more stable state with the training recipe and infra. Then we can give it another try. We can try training something from this branch and check that the cache issues don't impact training. It's important to merge this soon though because we want to monitor CPU resources to tune machine configs.

@bhearsum bhearsum marked this pull request as draft May 8, 2024 13:27
@bhearsum bhearsum requested review from a team and ahal and removed request for gregtatum, eu9ene, a team and ahal May 8, 2024 13:27
@bhearsum
Copy link
Collaborator Author

bhearsum commented May 8, 2024

@eu9ene - Here's a Task that demonstrates successfully recovering from the cache issue: https://firefox-ci-tc.services.mozilla.com/tasks/WM8ahtXGQSCuvlljOU3O0Q/runs/0

Obviously this is very annoying, and if you don't want to land until we get to the bottom of it, that's fine, but hopefully this gives some confidence that it won't burn a tremendous amount of time. (We don't even download upstream artifacts before the cache fails the job, so these failures typically only cost us a few minutes.)

@eu9ene
Copy link
Collaborator

eu9ene commented May 8, 2024

@eu9ene - Here's a Task that demonstrates successfully recovering from the cache issue: https://firefox-ci-tc.services.mozilla.com/tasks/WM8ahtXGQSCuvlljOU3O0Q/runs/0

Obviously this is very annoying, and if you don't want to land until we get to the bottom of it, that's fine, but hopefully this gives some confidence that it won't burn a tremendous amount of time. (We don't even download upstream artifacts before the cache fails the job, so these failures typically only cost us a few minutes.)

@bhearsum ok, if the risk is minimal and we just lose some retry attempts with this (should we increase them then?), then I guess being able to monitor CPU machine outweighs the extra restarts on pre-emption.

@bhearsum
Copy link
Collaborator Author

bhearsum commented May 8, 2024

@eu9ene - Here's a Task that demonstrates successfully recovering from the cache issue: https://firefox-ci-tc.services.mozilla.com/tasks/WM8ahtXGQSCuvlljOU3O0Q/runs/0
Obviously this is very annoying, and if you don't want to land until we get to the bottom of it, that's fine, but hopefully this gives some confidence that it won't burn a tremendous amount of time. (We don't even download upstream artifacts before the cache fails the job, so these failures typically only cost us a few minutes.)

@bhearsum ok, if the risk is minimal and we just lose some retry attempts with this (should we increase them then?), then I guess being able to monitor CPU machine outweighs the extra restarts on pre-emption.

We can always back out again if it's problematic!

@bhearsum bhearsum marked this pull request as ready for review May 9, 2024 00:12
Analysis done in mozilla#538 (comment) shows that the problems are intermittent, and largely related to spot terminations. We're seeing the latter in the existing workers anyways, so unless we find more serious issues with generic-worker for CPU tasks, we may as well go ahead with this.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants