multiple GPUs #360

Michl87 · 2022-10-18T07:29:59Z

Michl87
Oct 18, 2022

Can multiple GPUs be used to load the model?

FE: I have 2 GPUs (2x 8gb), with 1 I can load the medium model, but the large model crashes due to lack of memory. The model uses only 1 GPU, so is it possible to set up whisper (1 model/record) for multiple GPUs?

Answered by jongwook

Nov 16, 2022

It's possible to load the encoder on one GPU and the decoder on the other, with a bit of hack:

First, please update the package so it has the latest commit (I made a minor modification for this):

pip install --upgrade --no-deps --force-reinstall git+https://github.com/openai/whisper.git

And then something like this is possible:

import whisper

model = whisper.load_model("large", device="cpu")

model.encoder.to("cuda:0")
model.decoder.to("cuda:1")

model.decoder.register_forward_pre_hook(lambda _, inputs: tuple([inputs[0].to("cuda:1"), inputs[1].to("cuda:1")] + list(inputs[2:])))
model.decoder.register_forward_hook(lambda _, inputs, outputs: outputs.to("cuda:0"))

model.transcribe("jfk.flac")

View full answer

turnkit · 2022-10-19T11:53:26Z

turnkit
Oct 19, 2022

bump

3 replies

a-ruban Oct 19, 2022

bump (actually believe that it's not possible)

Michl87 Nov 2, 2022
Author

bump

Dragonslayersakata Nov 6, 2022

bump!

jongwook · 2022-11-16T12:32:26Z

jongwook
Nov 16, 2022
Maintainer

It's possible to load the encoder on one GPU and the decoder on the other, with a bit of hack:

First, please update the package so it has the latest commit (I made a minor modification for this):

pip install --upgrade --no-deps --force-reinstall git+https://github.com/openai/whisper.git

And then something like this is possible:

import whisper

model = whisper.load_model("large", device="cpu")

model.encoder.to("cuda:0")
model.decoder.to("cuda:1")

model.decoder.register_forward_pre_hook(lambda _, inputs: tuple([inputs[0].to("cuda:1"), inputs[1].to("cuda:1")] + list(inputs[2:])))
model.decoder.register_forward_hook(lambda _, inputs, outputs: outputs.to("cuda:0"))

model.transcribe("jfk.flac")

The code above uses register_forward_pre_hook to move the decoder's input to the second GPU ("cuda:1") and register_forward_hook to put the results back to the first GPU ("cuda:0"). The latter is not absolutely necessary but added as a workaround because the decoding logic assumes the outputs are in the same device as the encoder.

In my 2-GPU machine, the VRAM usage after executing the snippet above is:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.39.01    Driver Version: 510.39.01    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| 24%   45C    P2    60W / 280W |   3733MiB / 11264MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  Off  | 00000000:0B:00.0 Off |                  N/A |
|  0%   50C    P2    58W / 280W |   4595MiB / 11264MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      5511      C   ...ook/miniconda3/bin/python     3731MiB |
|    1   N/A  N/A      5511      C   ...ook/miniconda3/bin/python     4593MiB |
+-----------------------------------------------------------------------------+

4 replies

tcl8273 Feb 3, 2023

Hi @jongwook
I want to run a multi-processing whisper with a model large on 16GB GPU. Could you help me how to set up whisper run with share memory?
Example like this:
model = load_model("large-V2")
model.share_memory()
processes = []
for rank in range(num_processes):
p = mp.Process(target=excution, args=(model,))
p.start()
processes.append(p)
for p in processes:
p.join()

Thank you so much!

jongwook Feb 5, 2023
Maintainer

Hi, I haven't tried multiprocessing using share_memory() on Whisper models. It may save you a few gigabytes of memory, but a larger chunk of memory is used by the activations, which are specific to each audio input. So it may be possible to run two processes sharing a large model concurrently in a 16GB GPU, but not very much. (Reducing the memory usage using techniques like flash attention is a possible future improvement)

It'll be helpful if you can share a specific error message either here or at the PyTorch forum!

silvacarl2 Feb 5, 2023

thanks, will check this. also, fyi, the g4dn types of instances can handle whisper large really well. they range from 53 cents an hour to $4.10 an hour depenidng on how fast you want it to be.

FNG654 Mar 1, 2024

Good point! Can you please try to explain how to use g4dn to run whisper on python?

silvacarl2 · 2023-01-18T22:51:14Z

silvacarl2
Jan 18, 2023

Did you see a large difference in transcription speed after this?

3 replies

richardliaw Jan 29, 2023

I ran the code above and saw that GPU memory pressure decreased but the runtime of the transcription did not decrease (which is probably expected).

kristopher-smith Feb 9, 2023

Encoding and Decoding happen sequentially and not in parallel. The code relieves memory on the GPU's by splitting up the memory allocation but the processes themselves do not happen in parallel(this explains the transcription not speeding up with two GPU's opposed to one).

Still a great hack and I have used it in production to keep from running diarization on separate GPU's.

jake1271 Mar 14, 2023

Have you guys taken a look at this wonder fork? https://github.com/Blair-Johnson/batch-whisper/tree/main It's taking the opposite approach, where you batch multiple audio files together, it scales really really well so far in my testings. I hope the official project incorporates this idea along with supporting multiple GPUs.

silvacarl2 · 2024-03-01T22:35:54Z

silvacarl2
Mar 1, 2024

nope have not tried that.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multiple GPUs #360

{{title}}

Replies: 4 comments 10 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

multiple GPUs #360

Replies: 4 comments · 10 replies

Michl87 Nov 2, 2022 Author

jongwook Nov 16, 2022 Maintainer

jongwook Feb 5, 2023 Maintainer

Replies: 4 comments 10 replies

Michl87 Nov 2, 2022
Author

jongwook
Nov 16, 2022
Maintainer

jongwook Feb 5, 2023
Maintainer