Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve GPU utilization for "translate" tasks #785

Open
Tracked by #453 ...
eu9ene opened this issue Jul 31, 2024 · 6 comments
Open
Tracked by #453 ...

Improve GPU utilization for "translate" tasks #785

eu9ene opened this issue Jul 31, 2024 · 6 comments
Labels
cost & perf Speeding up and lowering cost for the pipeline

Comments

@eu9ene
Copy link
Collaborator

eu9ene commented Jul 31, 2024

Currently, it's ~70%. We could try using a bigger batch but it also depends on language.

GCP console for translate-mono task

Screenshot 2024-07-31 at 2 30 12 PM
@eu9ene eu9ene added the cost & perf Speeding up and lowering cost for the pipeline label Jul 31, 2024
@eu9ene
Copy link
Collaborator Author

eu9ene commented Jul 31, 2024

It appears to be even lower for translate-corpus: GCP console

Screenshot 2024-07-31 at 2 36 06 PM

@gregtatum FYI

@gregtatum
Copy link
Member

Is it possible to dynamically determine this value? Like run N translations, measure and adjust?

@ZJaume
Copy link
Collaborator

ZJaume commented Oct 22, 2024

Noticed also this and it's been the same always. I think the bottleneck is decoding. Doing n-best with beam 8 it seems to make much less use of GPU than not doing n-best and about 6-4 beam.

This won't increment the use of GPU, but I've been using --fp16 during inference and training without any significant quality drop. Haven't compared n-best generation though.

@ZJaume
Copy link
Collaborator

ZJaume commented Oct 22, 2024

Another alternative would be comparing with ctranslate2, that has faster inference than marian.

@eu9ene
Copy link
Collaborator Author

eu9ene commented Oct 22, 2024

Related to #165

@gregtatum
Copy link
Member

Training uses dynamic batch sizes, so it changes the batch size over time to find the best value, so there's not really a need to adjust it. It starts somewhat inefficient, but quickly dials in the number to be as efficient as it can.

Translate tasks however are not dynamic for batching size. I played with the them in #931 and got it optimized to be about as efficient as training by adjust the batching behavior. I think this 70% is just the cap for Marian's ability to utilize the GPUs. CTranslate2 was able to get ~96% utilization and was much faster given the same beam size.

It'll take a bit more time to get COMET scores for using CTranslate2 to cross-compare. CTranslate2 doesn't support ensemble decoding, so we'll have to compare with Marian single teacher decoding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cost & perf Speeding up and lowering cost for the pipeline
Projects
None yet
Development

No branches or pull requests

3 participants