Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experiment with distillation data inference #931

Closed
gregtatum opened this issue Nov 15, 2024 · 2 comments
Closed

Experiment with distillation data inference #931

gregtatum opened this issue Nov 15, 2024 · 2 comments
Assignees
Labels
experiment A training experiment with hypothesis and results

Comments

@gregtatum
Copy link
Member

gregtatum commented Nov 15, 2024

I'm going to do several experiments around distillation data decoding, and it will be easier to write up the results here as they are all related to the same part of the pipeline, translate-mono-src and translate-corpus.

My plan is to test on da-en since it was a good model result and should be indicative of quality drops.

The data for this experiment is available in this spreadsheet. I measured both the GPU utilization and how much data was being written into the target file in terms of bytes/sec. Each run operated on the same subset of the data. I measured 15 minutes of translations across 5 batches, and summarized the results to get a sample of how fast the translations were happening.

decoder precision teachers maxi-batch-words maxi-batch gpu utilization bytes/sec vs 500 vs Marian Best
marian float32 2 500 1,000 64.7 156,118
marian float16 2 500 1,000 57.5 183,627 118%
marian float16 2 4,000 1,000 61.5 329,703 211%
marian float16 2 5,000 1,000 57.4 338,798 217%
marian float16 2 5,000 10,000 57.2 348,805 223%
marian float16 1 5,000 10,000 55.9 601,635 385%
marian float16 2 8,000 1,000 - Out of Memory - -
ctranslate2 float16 1 5,000 - 97.2 1,187,192 760% 197%

Then run the experiments on the decoder/ensemble configurations.

inference teacher ensemble student comet vs baseline gpu hours vs baseline wall time vs baseline
marian (bad batching) 2 - - 597 hours 100% 78.0 hours 100%
marian 2 88.67 - 288 hours 48% 42.3 hours 54%
marian 1 88.42 -0.25 147 hours 27% 6.4 hours 8%
ctranslate2 1 88.35 -0.32 69 hours 12% 2.7 hours 3%

Wall time here refers to the time all of the parallelized translate-* tasks took from the start of the first one, the finish of the last.

@gregtatum gregtatum added the experiment A training experiment with hypothesis and results label Nov 15, 2024
@gregtatum gregtatum self-assigned this Nov 15, 2024
@gregtatum gregtatum changed the title Experiment with distillation data decoding Experiment with distillation data inference Nov 15, 2024
@gregtatum
Copy link
Member Author

gregtatum commented Nov 19, 2024

I think I got an OOM with maxi-batch-words: 5000 mini-batch-words: 5000

https://firefox-ci-tc.services.mozilla.com/tasks/Z2rfI9lLSNWKnUoQ-7FFLw

@eu9ene
Copy link
Collaborator

eu9ene commented Nov 19, 2024

Impressive results with CTranslate! I like ~100% GPU utilization and an order of magnitude speed up

maxi-batch-words is likely mini-batch-words

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
experiment A training experiment with hypothesis and results
Projects
None yet
Development

No branches or pull requests

2 participants