Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inference in Colab fails #13

Open
PrithivirajDamodaran opened this issue Aug 27, 2021 · 8 comments
Open

Inference in Colab fails #13

PrithivirajDamodaran opened this issue Aug 27, 2021 · 8 comments

Comments

@PrithivirajDamodaran
Copy link

PrithivirajDamodaran commented Aug 27, 2021

Follow the message (added some print statements to debug and removed clear_output) - Please advise

chosen_model: https://www.dropbox.com/s/8mmgnromwoilpfm/16L_64HD_8H_512I_128T_cc12m_cc3m_3E.pt?dl=1 folder_ /content/outputs/Cucumber_on_a_brown_wooden_chair/ Traceback (most recent call last): File "/content/dalle-pytorch-pretrained/DALLE-pytorch/generate.py", line 18, in <module> from dalle_pytorch import DiscreteVAE, OpenAIDiscreteVAE, VQGanVAE, DALLE File "/content/dalle-pytorch-pretrained/DALLE-pytorch/dalle_pytorch/__init__.py", line 1, in <module> from dalle_pytorch.dalle_pytorch import DALLE, CLIP, DiscreteVAE File "/content/dalle-pytorch-pretrained/DALLE-pytorch/dalle_pytorch/dalle_pytorch.py", line 11, in <module> from dalle_pytorch.vae import OpenAIDiscreteVAE, VQGanVAE File "/content/dalle-pytorch-pretrained/DALLE-pytorch/dalle_pytorch/vae.py", line 14, in <module> **from taming.models.vqgan import VQModel, GumbelVQ** ImportError: cannot import name 'GumbelVQ' from 'taming.models.vqgan' (/usr/local/lib/python3.7/dist-packages/taming/models/vqgan.py)

@johnpaulbin
Copy link
Contributor

Fixed-- retry.

@PrithivirajDamodaran
Copy link
Author

Screenshot 2021-09-14 at 10 01 28 AM

Now I getting a different issue file/image not found. Under '/content/dalle-pytorch-pretrained/' there is no folder by the name DALLE-pytorch

@Vbansal21
Copy link

Vbansal21 commented Oct 4, 2021

Screenshot 2021-09-14 at 10 01 28 AM

Now I getting a different issue file/image not found. Under '/content/dalle-pytorch-pretrained/' there is no folder by the name DALLE-pytorch

The issue is in the !wget command in 2nd code block(2 Install required dependencies.), line 36 => !wget "https://github.com/lucidrains/DALLE-pytorch/archive/refs/tags/0.14.3.zip" -O /content/, the /content/ at end is a directory, change it to /content/0.14.3.zip. It solves the above issue.

After that there are new issues:

  File "/content/dalle-pytorch-pretrained/DALLE-pytorch/dalle_pytorch/attention.py", line 362, in forward
    out = self.attn_fn(q, k, v, attn_mask = attn_mask, key_padding_mask = key_pad_mask)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/deepspeed/ops/sparse_attention/sparse_self_attention.py", line 126, in forward
    assert query.dtype == torch.half, "sparse attention only supports training in fp16 currently, please file a github issue if you need fp32 support"
AssertionError: sparse attention only supports training in fp16 currently, please file a github issue if you need fp32 support
Finished generating images, attempting to display results...

And using deepspeed (replacing python to deepspeed and adding the --deepspeed, --fp16 args),
by changing the code block 4 (4 Try out the model.), line 25 to 28 to:

25|if chosen_model not in allow:
26|  !deepspeed /content/dalle-pytorch-pretrained/DALLE-pytorch/generate.py --dalle_path=$checkpoint_path --taming --text="$text" --num_images=$num_images --batch_size=$batch_size --outputs_dir="$_folder" --deepspeed --fp16; wait;
27|else:
28|  !deepspeed /content/dalle-pytorch-pretrained/DALLE-pytorch/generate.py --dalle_path=$checkpoint_path --taming --text="$text" --num_images=$num_images --batch_size=$batch_size --outputs_dir="$_folder" --bpe_path variety.bpe --deepspeed --fp16; wait;

won't help:

generate.py: error: unrecognized arguments: --local_rank=0 --deepspeed --fp16

the problem is that --local_rank=0 arg passed somewhere.

Edit: Update: after solving the local rank issue and fp16 attention issue (rough fix to the generate.py file (added a dummy local_rank parser), and in attention.py manually converting to fp16 and back to orignal(x.dtype)), a new issue arises:
ImportError: cannot import name 'MatMul' from 'deepspeed.ops.sparse_attention' (/usr/local/lib/python3.7/dist-packages/deepspeed/ops/sparse_attention/__init__.py)

@PrithivirajDamodaran
Copy link
Author

Screenshot 2021-09-14 at 10 01 28 AM

Now I getting a different issue file/image not found. Under '/content/dalle-pytorch-pretrained/' there is no folder by the name DALLE-pytorch

@johnpaulbin Sorry not trying to be a pest, please advice on this issue

@jonathanfrawley
Copy link

jonathanfrawley commented Oct 25, 2021

Hi, I think this issue is due to this line in the second code cell in the Collab notebook which fails (silently, as for some reason the output is cleared later in the cell!):

!wget "https://github.com/lucidrains/DALLE-pytorch/archive/refs/tags/0.14.3.zip" -O /content/

It seems you need to specify the full path to the output file for wget, like this:

!wget "https://github.com/lucidrains/DALLE-pytorch/archive/refs/tags/0.14.3.zip" -O /content/0.14.3.zip

I don't know who has access to that notebook, but if it could be updated that would be great.

@Vbansal21
Copy link

Vbansal21 commented Oct 25, 2021

Hi, I think this issue is due to this line in the second code cell in the Collab notebook which fails (silently, as for some reason the output is cleared later in the cell!):

!wget "https://github.com/lucidrains/DALLE-pytorch/archive/refs/tags/0.14.3.zip" -O /content/

It seems you need to specify the full path to the output file for wget, like this:

!wget "https://github.com/lucidrains/DALLE-pytorch/archive/refs/tags/0.14.3.zip" -O /content/0.14.3.zip

I don't know who has access to that notebook, but if it could be updated that would be great.

Sure that will solve this issue, but there are new issue after that. I've mentioned those above.It is a matmul import error of deepspeed.

@hamdjalil
Copy link

Follow the message (added some print statements to debug and removed clear_output) - Please advise

chosen_model: https://www.dropbox.com/s/8mmgnromwoilpfm/16L_64HD_8H_512I_128T_cc12m_cc3m_3E.pt?dl=1 folder_ /content/outputs/Cucumber_on_a_brown_wooden_chair/ Traceback (most recent call last): File "/content/dalle-pytorch-pretrained/DALLE-pytorch/generate.py", line 18, in <module> from dalle_pytorch import DiscreteVAE, OpenAIDiscreteVAE, VQGanVAE, DALLE File "/content/dalle-pytorch-pretrained/DALLE-pytorch/dalle_pytorch/__init__.py", line 1, in <module> from dalle_pytorch.dalle_pytorch import DALLE, CLIP, DiscreteVAE File "/content/dalle-pytorch-pretrained/DALLE-pytorch/dalle_pytorch/dalle_pytorch.py", line 11, in <module> from dalle_pytorch.vae import OpenAIDiscreteVAE, VQGanVAE File "/content/dalle-pytorch-pretrained/DALLE-pytorch/dalle_pytorch/vae.py", line 14, in <module> **from taming.models.vqgan import VQModel, GumbelVQ** ImportError: cannot import name 'GumbelVQ' from 'taming.models.vqgan' (/usr/local/lib/python3.7/dist-packages/taming/models/vqgan.py)

@johnpaulbin @PrithivirajDamodaran can you please share how this issue was resolved? I'm facing on my GPU

@Cyberes
Copy link

Cyberes commented Dec 25, 2021

On the DeepSpeed Sparse Attention doc page, there's this:

Note: Currently, DeepSpeed Sparse Attention can be used only on NVIDIA V100 or A100 GPUs using Torch >= 1.6 and CUDA 10.1, 10.2, 11.0, or 11.1.

I have access to a v100 and ran the notebook on it (after adding the fixes) but I encountered the same issue. I ran it on CUDA 11.4 rather than 11.1 so that may be an issue. Colab is on 11.1.

Is is possible to run the notebook on an older version of DeepSpeed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants