Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate removing teacher ensemble training #778

Closed
Tracked by #453 ...
gregtatum opened this issue Jul 30, 2024 · 2 comments
Closed
Tracked by #453 ...

Investigate removing teacher ensemble training #778

gregtatum opened this issue Jul 30, 2024 · 2 comments
Assignees
Labels
cost & perf Speeding up and lowering cost for the pipeline experiment A training experiment with hypothesis and results

Comments

@gregtatum
Copy link
Member

gregtatum commented Jul 30, 2024

Training a second teacher improves performance only slightly. It may be more cost efficient to take the quality hit and remove it.

Comet Change Average Type
+00.15 Mean
+00.14 Median

Spreadsheet

For instance, if we spent 1000 gpu hours synthesizing student data, it could drop it to 500 gpu hours. Then if we spent 100 gpu hours training teachers, this would drop it to 50 gpu hours. We also wouldn't have a gap of training time where we train 1 teacher first, determine the quality, and then have to train a second teacher before going to the student step.

It would be worth testing this on student training to see if we get an unexpected hit in the distillation quality gap.

@gregtatum gregtatum added the cost & perf Speeding up and lowering cost for the pipeline label Jul 30, 2024
@gregtatum gregtatum added the experiment A training experiment with hypothesis and results label Oct 30, 2024
@gregtatum gregtatum self-assigned this Oct 30, 2024
@gregtatum
Copy link
Member Author

gregtatum commented Dec 2, 2024

The results are in #931. We took a -0.25 COMET hit, which is barely below the ±0.12 standard deviation. CTranslate2 was moderately worse at -0.32.

@gregtatum
Copy link
Member Author

I think the general consensus here is that we can take the quality hit on removing the ensemble, especially with the gains in model quality from larger student models. CTranslate2 and Marian single models are equivalent since the difference is within the standard deviation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cost & perf Speeding up and lowering cost for the pipeline experiment A training experiment with hypothesis and results
Projects
None yet
Development

No branches or pull requests

1 participant