-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce GPU memory for Whisper models converted to ONNX #17378
Conversation
Use hash table to speed up like in https://github.com/huggingface/optimum/blob/7450ca30e295abc9e20d56d0aa741402322def0f/optimum/onnx/transformations_utils.py#L31-L54? In that implementation, they ignore those with dimension 1 with data type int32 or int64, or scalar with dimension 0. That could filter out small initializers. Might also help speed up a little. Regarding to |
### Description This PR changes the Whisper export scripts to further optimize the process of removing duplicate initializers from two subgraphs. The current Greedy approach is quicker by a large factor, but results in some duplicate initializers not being caught and removed. This not only results in a slightly larger Whisper model, but also a model that uses more GPU memory. The approach in this PR uses data hashes and caches to keep a quick export but no longer rely on a greedy approach. --------- Co-authored-by: Peter McAughan <[email protected]>
### Description This PR changes the Whisper export scripts to further optimize the process of removing duplicate initializers from two subgraphs. The current Greedy approach is quicker by a large factor, but results in some duplicate initializers not being caught and removed. This not only results in a slightly larger Whisper model, but also a model that uses more GPU memory. The approach in this PR uses data hashes and caches to keep a quick export but no longer rely on a greedy approach. --------- Co-authored-by: Peter McAughan <[email protected]>
Description
This PR changes the Whisper export scripts to further optimize the process of removing duplicate initializers from two subgraphs.
The current Greedy approach is quicker by a large factor, but results in some duplicate initializers not being caught and removed. This not only results in a slightly larger Whisper model, but also a model that uses more GPU memory.
The approach in this PR uses data hashes and caches to keep a quick export but no longer rely on a greedy approach.