Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

start_stage often reruns amost all "evaluate" tasks #728

Open
Tracked by #311
eu9ene opened this issue Jul 9, 2024 · 4 comments
Open
Tracked by #311

start_stage often reruns amost all "evaluate" tasks #728

eu9ene opened this issue Jul 9, 2024 · 4 comments
Labels
cost & perf Speeding up and lowering cost for the pipeline taskcluster Issues related to the Taskcluster implementation of the training pipeline

Comments

@eu9ene
Copy link
Collaborator

eu9ene commented Jul 9, 2024

I ran this one to do the export task but since "evaluate" tasks are not sequential it leads to rerunning them each time I use start_stage which wastes GPU resources.

https://firefox-ci-tc.services.mozilla.com/tasks/aAroslw9Ru-c5cam6SvLBg

@eu9ene eu9ene added the taskcluster Issues related to the Taskcluster implementation of the training pipeline label Jul 9, 2024
@eu9ene eu9ene added the cost & perf Speeding up and lowering cost for the pipeline label Jul 11, 2024
@bhearsum
Copy link
Collaborator

bhearsum commented Jul 29, 2024

As you've already identified, the problem here is that the evaluate tasks that are being rerun are not in any of the previous groups provided. ie: this is working as designed at the moment. The simple "fix" for this is to provide additional previous groups with the evaluate tasks.

Of course, that's annoying, and not ideal. This is another case we should consider when we discuss #719.

If we'd like to do something in the meantime, we could remove the evaluate tasks as dependencies on all, which is how they get pulled in. Doing so would mean that nothing depends on them, and they would have to be targeted explicitly through a target-stage. This might be fine if you're already targeting things like evaluate-teacher for the first phase of training. We might be able to make this less annoying by making target-stage into target-stages.

@bhearsum
Copy link
Collaborator

@eu9ene - Any thoughts on the two options above?

@eu9ene
Copy link
Collaborator Author

eu9ene commented Aug 19, 2024

Another option is to run them right after training and make the next task depend on them. It makes sense because this way we don't continue the pipeline until we have good evaluation results. There was even a suggestion to add a sanity check (see #78). They will not be the dependencies of the "all" task then.

@bhearsum
Copy link
Collaborator

bhearsum commented Sep 3, 2024

Sure, if you want to always run them when you run a training, adjusting the stage entries such as https://github.com/mozilla/firefox-translations-training/blob/f7247a60a095015d39fb73065830eb9e980147e2/taskcluster/kinds/evaluate/kind.yml#L118 to be the same as their associated training step would be a good idea. Doing that would mean they always end up in the same task groups, which would let you remove them from the all and all-pr dependencies with no real downside, and should fix this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cost & perf Speeding up and lowering cost for the pipeline taskcluster Issues related to the Taskcluster implementation of the training pipeline
Projects
None yet
Development

No branches or pull requests

2 participants