Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with CUDA #168

Closed
mmol67 opened this issue Jan 6, 2024 · 12 comments · Fixed by #196
Closed

Problem with CUDA #168

mmol67 opened this issue Jan 6, 2024 · 12 comments · Fixed by #196
Assignees
Labels
documentation Improvements or additions to documentation

Comments

@mmol67
Copy link

mmol67 commented Jan 6, 2024

Hello. I was having a problem with the 239 characters limit in Spanish (I've read an issue and a discussion about this thing in French), so I updated epub2tts from 2.2.14 to 2.3.4, just reinstalling from Github.

Now I'm getting a CUDA related error when trying an epub conversion.

Using GPU
VRAM: 8506114048
Loading model: /home/ubuntu/.local/share/tts/tts_models--multilingual--multi-dataset--xtts_v2
 > tts_models/multilingual/multi-dataset/xtts_v2 is already downloaded.
 > Using model: xtts
[2024-01-06 17:27:14,703] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-01-06 17:27:15,505] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed info: version=0.12.6, git-hash=unknown, git-branch=unknown
[2024-01-06 17:27:15,507] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter replace_method is deprecated. This parameter is no longer needed, please remove from your call to DeepSpeed-inference
[2024-01-06 17:27:15,507] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[2024-01-06 17:27:15,507] [INFO] [logging.py:96:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
Traceback (most recent call last):
  File "/home/ubuntu/.local/bin/epub2tts", line 8, in <module>
    sys.exit(main())
  File "/home/ubuntu/.local/lib/python3.10/site-packages/epub2tts.py", line 724, in main
    mybook.read_book(
  File "/home/ubuntu/.local/lib/python3.10/site-packages/epub2tts.py", line 379, in read_book
    self.model.load_checkpoint(
  File "/home/ubuntu/.local/lib/python3.10/site-packages/TTS/tts/models/xtts.py", line 783, in load_checkpoint
    self.gpt.init_gpt_for_inference(kv_cache=self.args.kv_cache, use_deepspeed=use_deepspeed)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/TTS/tts/layers/xtts/gpt.py", line 224, in init_gpt_for_inference
    self.ds_engine = deepspeed.init_inference(
  File "/home/ubuntu/.local/lib/python3.10/site-packages/deepspeed/__init__.py", line 342, in init_inference
    engine = InferenceEngine(model, config=ds_inference_config)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/deepspeed/inference/engine.py", line 158, in __init__
    self._apply_injection_policy(config)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/deepspeed/inference/engine.py", line 418, in _apply_injection_policy
    replace_transformer_layer(client_module, self.module, checkpoint, config, self.config)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 342, in replace_transformer_layer
    replaced_module = replace_module(model=model,
  File "/home/ubuntu/.local/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 586, in replace_module
    replaced_module, _ = _replace_module(model, policy, state_dict=sd)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 646, in _replace_module
    _, layer_id = _replace_module(child,
  File "/home/ubuntu/.local/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 646, in _replace_module
    _, layer_id = _replace_module(child,
  File "/home/ubuntu/.local/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 622, in _replace_module
    replaced_module = policies[child.__class__][0](child,
  File "/home/ubuntu/.local/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 298, in replace_fn
    new_module = replace_with_policy(child,
  File "/home/ubuntu/.local/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 247, in replace_with_policy   
    _container.create_module()
  File "/home/ubuntu/.local/lib/python3.10/site-packages/deepspeed/module_inject/containers/gpt2.py", line 20, in create_module
    self.module = DeepSpeedGPTInference(_config, mp_group=self.mp_group)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/deepspeed/model_implementations/transformers/ds_gpt.py", line 20, in __init__  
    super().__init__(config, mp_group, quantize_scales, quantize_groups, merge_count, mlp_extra_grouping)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/deepspeed/model_implementations/transformers/ds_transformer.py", line 58, in __init__
    inference_module = builder.load()
  File "/home/ubuntu/.local/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 458, in load
    return self.jit_load(verbose)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 461, in jit_load
    if not self.is_compatible(verbose):
  File "/home/ubuntu/.local/lib/python3.10/site-packages/deepspeed/ops/op_builder/transformer_inference.py", line 29, in is_compatible  
    sys_cuda_major, _ = installed_cuda_version()
  File "/home/ubuntu/.local/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 50, in installed_cuda_version
    raise MissingCUDAException("CUDA_HOME does not exist, unable to compile CUDA op(s)")
deepspeed.ops.op_builder.builder.MissingCUDAException: CUDA_HOME does not exist, unable to compile CUDA op(s)

No other changes or updates have been done, in the host or in the container (this is running in a LXC Ubuntu container)

Any easy fix?

And I want to comment that the with the last version using DeepSpeed and my modest GPU (GTX 1070) the conversion speed ratio is slightly under 1!!! Amazing!

@aedocw aedocw self-assigned this Jan 6, 2024
@aedocw
Copy link
Owner

aedocw commented Jan 6, 2024

Could you try pip show deepspeed and share the version you're using? I'm on 0.12.6 and have had no problems. I saw discussion of setting CUDA_HOME here, but I haven't had any issues so I did not follow those steps so I can't say if they're useful or not. I would start off by updating deepspeed if it's not at 0.12.6 and if that doesn't work, figure out what CUDA_HOME should be and make sure that environment variable is exported.

Please update this with what you find in case others hit this, and there may be other folks that have seen this who might provide some guidance as well.

@mmol67
Copy link
Author

mmol67 commented Jan 6, 2024

Hello. pip show deepspeed

Name: deepspeed
Version: 0.12.6
Summary: DeepSpeed library
Home-page: http://deepspeed.ai
Author: DeepSpeed Team
Author-email: [email protected]
License: Apache Software License 2.0
Location: /home/ubuntu/.local/lib/python3.10/site-packages
Requires: hjson, ninja, numpy, packaging, psutil, py-cpuinfo, pydantic, pynvml, torch, tqdm
Required-by: epub2tts

This is really strange. Yesterday's version was working using deepspeed. Now it gives this error. If I try export CUDA_HOME=/usr/local/cuda then I get

FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/cuda/bin/nvcc'

So it seems that now I need to install CUDA toolkit?

@aedocw
Copy link
Owner

aedocw commented Jan 6, 2024

Maybe? I had CUDA toolkit installed from other stuff, and I don't have a clean environment to test from unfortunately.

I can't think of anything that changed yesterday that would have triggered this. The only big change was just to epub2tts really, where it switched which method it used Coqui TTS if you are using one of the studio voices (basically using the same streaming method that it was already using for XTTS).

I wonder if pip install . --update pulled in something new from TTS or one of their requirements? Sorry I can't be of more help with this. If I get a chance I'll see about spinning up a VM on my GPU machine to see what happens if I start clean.

@danielw97
Copy link

Hi,
I've been working with this over the last few days after aquiring a new system, with a better gpu and getting setup on wsl as my main environment for running this.
I ended up using miniconda to setup my python environment for epub2tts, as even though I had the cuda toolkit installed I was getting errors with deepspeed although everything else seemed to work.
Nice thing with using anaconda to setup the cuda and torch environment is that it seems to handle the either library linking or dependency issues and allows things to run nicely.
Not sure if this is any help, although I'm happy to provide some more detail if that would help getting you setup.

@aedocw aedocw added the documentation Improvements or additions to documentation label Jan 8, 2024
@aedocw
Copy link
Owner

aedocw commented Jan 8, 2024

In case I didn't mention this anywhere else (now that I think about it, I probably did not).

I added a flag "--no-deepspeed", which disables use of deepspeed even if it finds that package is installed in the environment. Could you give that a try and see if you're able to use GPU just without deepspeed? It will help with troubleshooting this. Ultimately we'll figure it out so you get back to what was working (gpu + deepspeed).

@Nikanoru
Copy link

Nikanoru commented Jan 8, 2024

In case I didn't mention this anywhere else (now that I think about it, I probably did not).

I added a flag "--no-deepspeed", which disables use of deepspeed even if it finds that package is installed in the environment. Could you give that a try and see if you're able to use GPU just without deepspeed? It will help with troubleshooting this. Ultimately we'll figure it out so you get back to what was working (gpu + deepspeed).

Since the other user did not respond to your "try with --no-deepspeed" suggestion, I just tried it and it does indeed work it seems. It's at 25% currently, I will update once the process finished and I listened to the file.

Edit: The process finished successfully and the audio file sounds very nice :) Thank you!

@mmol67
Copy link
Author

mmol67 commented Jan 9, 2024

In case I didn't mention this anywhere else (now that I think about it, I probably did not).

I added a flag "--no-deepspeed", which disables use of deepspeed even if it finds that package is installed in the environment. Could you give that a try and see if you're able to use GPU just without deepspeed? It will help with troubleshooting this. Ultimately we'll figure it out so you get back to what was working (gpu + deepspeed).

Thank you very much. As @Nikanoru said, the flag is working. When I have some spare time, I will try it in a new container and tell you what happens.

@mmol67
Copy link
Author

mmol67 commented Jan 10, 2024

Hello again. I set up a new LXC(LXD) container, giving GPU permissions etc., installed all dependencies, and then installing epub2tts last version.

Again, the CUDA message. Using --no-deepspeed worked.

Then I installed CUDA toolkit inside the container. It worked. The process took 248 min + multiplex and so on, and the duration of the generated audio is 376 min aprox.

So my solution was to install CUDA toolkit to use DeepSpeed.

Thank you very much.

@Nikanoru
Copy link

Nikanoru commented Jan 12, 2024

Hello again. I set up a new LXC(LXD) container, giving GPU permissions etc., installed all dependencies, and then installing epub2tts last version.

Again, the CUDA message. Using --no-deepspeed worked.

Then I installed CUDA toolkit inside the container. It worked. The process took 248 min + multiplex and so on, and the duration of the generated audio is 376 min aprox.

So my solution was to install CUDA toolkit to use DeepSpeed.

Thank you very much.

Hey thank you for your message.

Can you go into more detail about "installed CUDA toolkit inside the container"? I am very new to Ubuntu/Linux and I had the same issue as you.

@mmol67
Copy link
Author

mmol67 commented Jan 12, 2024

Can you go into more detail about "installed CUDA toolkit inside the container"? I am very new to Ubuntu/Linux and I had the same issue as you.
I went to https://developer.nvidia.com/cuda-downloads?target_os=Linux and make my choice (https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=22.04&target_type=deb_network)
then followed instructions

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-3

restart the container (or the computer) and it is working.

I think you need NVIDIA proprietary drivers installed. In my case I have them installed in the host and set my container to allow it to use them and the GPU.

@Nikanoru
Copy link

Nikanoru commented Jan 12, 2024

Can you go into more detail about "installed CUDA toolkit inside the container"? I am very new to Ubuntu/Linux and I had the same issue as you.
I went to https://developer.nvidia.com/cuda-downloads?target_os=Linux and make my choice (https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=22.04&target_type=deb_network)
then followed instructions

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-3

restart the container (or the computer) and it is working.

I think you need NVIDIA proprietary drivers installed. In my case I have them installed in the host and set my container to allow it to use them and the GPU.

Thank you so much!
Your instructions work :) I did not even have to reboot my system.
Deepspeed cut down my time from 5:09 to 2:00! Nice :)

@aedocw aedocw linked a pull request Jan 15, 2024 that will close this issue
@aedocw
Copy link
Owner

aedocw commented Jan 15, 2024

Documentation update should cover this now.

@aedocw aedocw closed this as completed Jan 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants