Investigate removing teacher ensemble training #778
Labels
cost & perf
Speeding up and lowering cost for the pipeline
experiment
A training experiment with hypothesis and results
Training a second teacher improves performance only slightly. It may be more cost efficient to take the quality hit and remove it.
Spreadsheet
For instance, if we spent 1000 gpu hours synthesizing student data, it could drop it to 500 gpu hours. Then if we spent 100 gpu hours training teachers, this would drop it to 50 gpu hours. We also wouldn't have a gap of training time where we train 1 teacher first, determine the quality, and then have to train a second teacher before going to the student step.
It would be worth testing this on student training to see if we get an unexpected hit in the distillation quality gap.
The text was updated successfully, but these errors were encountered: