-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Re-enable generic-worker for CPU tasks #561
Conversation
This was previously reviewed by RelEng; doesn't need review by them again IMO. @eu9ene and/or @gregtatum it's up to you if/when we re-land this as far as I'm concerned. |
The two issues I reported to you were OOMs and the third one was due to preemptions I think because the task has finished successfully after a couple of restarts. The risk of landing this PR is running into "cache something" issue again and we want to stabilize the pipeline now. It would be definitely nice to enable monitoring for CPU machines to look at resource utilization. Should we try to reproduce the issues with caching again before landing this? |
To be honest, I don't think I'll have time in the next week weeks to debug that, as both bringing the Snakepit machines online and getting spot instances turned back on are higher priority. However, I don't think what we saw in the run linked from #538 is representative of what we'll see. It's definitely possible that we'll get spot killed and get those strange cache failures, but those do fix themselves in the next run by clearing out the cache. What happened in that run was that we had:
So - we do lose a little bit of time when we have the cache issue, but it's not quite as serious as it first seemed IMO. In any case - I'm happy to live with this status quo for now, or try to move forward this again - totally your / @gregtatum's call. |
Let's hold for now and wait until we're in a more stable state with the training recipe and infra. Then we can give it another try. We can try training something from this branch and check that the cache issues don't impact training. It's important to merge this soon though because we want to monitor CPU resources to tune machine configs. |
@eu9ene - Here's a Task that demonstrates successfully recovering from the cache issue: https://firefox-ci-tc.services.mozilla.com/tasks/WM8ahtXGQSCuvlljOU3O0Q/runs/0 Obviously this is very annoying, and if you don't want to land until we get to the bottom of it, that's fine, but hopefully this gives some confidence that it won't burn a tremendous amount of time. (We don't even download upstream artifacts before the cache fails the job, so these failures typically only cost us a few minutes.) |
@bhearsum ok, if the risk is minimal and we just lose some retry attempts with this (should we increase them then?), then I guess being able to monitor CPU machine outweighs the extra restarts on pre-emption. |
We can always back out again if it's problematic! |
Analysis done in mozilla#538 (comment) shows that the problems are intermittent, and largely related to spot terminations. We're seeing the latter in the existing workers anyways, so unless we find more serious issues with generic-worker for CPU tasks, we may as well go ahead with this.
Analysis done in #538 (comment) shows that the problems are intermittent, and largely related to spot terminations. We're seeing the latter in the existing workers anyways, so unless we find more serious issues with generic-worker for CPU tasks, we may as well go ahead with this.