start_stage often reruns amost all "evaluate" tasks #728

eu9ene · 2024-07-09T18:44:42Z

I ran this one to do the export task but since "evaluate" tasks are not sequential it leads to rerunning them each time I use start_stage which wastes GPU resources.

https://firefox-ci-tc.services.mozilla.com/tasks/aAroslw9Ru-c5cam6SvLBg

bhearsum · 2024-07-29T11:25:32Z

As you've already identified, the problem here is that the evaluate tasks that are being rerun are not in any of the previous groups provided. ie: this is working as designed at the moment. The simple "fix" for this is to provide additional previous groups with the evaluate tasks.

Of course, that's annoying, and not ideal. This is another case we should consider when we discuss #719.

If we'd like to do something in the meantime, we could remove the evaluate tasks as dependencies on all, which is how they get pulled in. Doing so would mean that nothing depends on them, and they would have to be targeted explicitly through a target-stage. This might be fine if you're already targeting things like evaluate-teacher for the first phase of training. We might be able to make this less annoying by making target-stage into target-stages.

bhearsum · 2024-08-12T16:54:26Z

@eu9ene - Any thoughts on the two options above?

eu9ene · 2024-08-19T17:27:52Z

Another option is to run them right after training and make the next task depend on them. It makes sense because this way we don't continue the pipeline until we have good evaluation results. There was even a suggestion to add a sanity check (see #78). They will not be the dependencies of the "all" task then.

bhearsum · 2024-09-03T13:56:16Z

Sure, if you want to always run them when you run a training, adjusting the stage entries such as https://github.com/mozilla/firefox-translations-training/blob/f7247a60a095015d39fb73065830eb9e980147e2/taskcluster/kinds/evaluate/kind.yml#L118 to be the same as their associated training step would be a good idea. Doing that would mean they always end up in the same task groups, which would let you remove them from the all and all-pr dependencies with no real downside, and should fix this.

eu9ene added the taskcluster Issues related to the Taskcluster implementation of the training pipeline label Jul 9, 2024

eu9ene mentioned this issue Jul 9, 2024

[meta] Make the pipeline reliable enough to train many languages #311

Open

eu9ene added the cost & perf Speeding up and lowering cost for the pipeline label Jul 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

start_stage often reruns amost all "evaluate" tasks #728

start_stage often reruns amost all "evaluate" tasks #728

eu9ene commented Jul 9, 2024

bhearsum commented Jul 29, 2024 •

edited

Loading

bhearsum commented Aug 12, 2024

eu9ene commented Aug 19, 2024

bhearsum commented Sep 3, 2024

start_stage often reruns amost all "evaluate" tasks #728

start_stage often reruns amost all "evaluate" tasks #728

Comments

eu9ene commented Jul 9, 2024

bhearsum commented Jul 29, 2024 • edited Loading

bhearsum commented Aug 12, 2024

eu9ene commented Aug 19, 2024

bhearsum commented Sep 3, 2024

bhearsum commented Jul 29, 2024 •

edited

Loading