Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce GPU memory for Whisper models converted to ONNX #17378

Merged
merged 3 commits into from
Sep 5, 2023

Conversation

petermcaughan
Copy link
Contributor

Description

This PR changes the Whisper export scripts to further optimize the process of removing duplicate initializers from two subgraphs.

The current Greedy approach is quicker by a large factor, but results in some duplicate initializers not being caught and removed. This not only results in a slightly larger Whisper model, but also a model that uses more GPU memory.

The approach in this PR uses data hashes and caches to keep a quick export but no longer rely on a greedy approach.

@tianleiwu
Copy link
Contributor

tianleiwu commented Sep 1, 2023

Use hash table to speed up like in https://github.com/huggingface/optimum/blob/7450ca30e295abc9e20d56d0aa741402322def0f/optimum/onnx/transformations_utils.py#L31-L54?

In that implementation, they ignore those with dimension 1 with data type int32 or int64, or scalar with dimension 0. That could filter out small initializers. Might also help speed up a little.

Regarding to results in some duplicate initializers not being caught and removed is likely caused by float16 tensor data is stored in int32_data field, or some tensor not loaded from external data file.

tianleiwu
tianleiwu previously approved these changes Sep 1, 2023
@petermcaughan petermcaughan merged commit fa28359 into main Sep 5, 2023
@petermcaughan petermcaughan deleted the petermca/whisper-gpu-memory branch September 5, 2023 23:24
tianleiwu pushed a commit that referenced this pull request Oct 31, 2023
### Description
This PR changes the Whisper export scripts to further optimize the
process of removing duplicate initializers from two subgraphs.

The current Greedy approach is quicker by a large factor, but results in
some duplicate initializers not being caught and removed. This not only
results in a slightly larger Whisper model, but also a model that uses
more GPU memory.

The approach in this PR uses data hashes and caches to keep a quick
export but no longer rely on a greedy approach.

---------

Co-authored-by: Peter McAughan <[email protected]>
kleiti pushed a commit to kleiti/onnxruntime that referenced this pull request Mar 22, 2024
### Description
This PR changes the Whisper export scripts to further optimize the
process of removing duplicate initializers from two subgraphs.

The current Greedy approach is quicker by a large factor, but results in
some duplicate initializers not being caught and removed. This not only
results in a slightly larger Whisper model, but also a model that uses
more GPU memory.

The approach in this PR uses data hashes and caches to keep a quick
export but no longer rely on a greedy approach.

---------

Co-authored-by: Peter McAughan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants