Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Very strange error while running LLaMa-2 with DeepSpeed-Chat #4229

Open
vinod-sarvam opened this issue Aug 29, 2023 · 22 comments
Open
Assignees
Labels
bug Something isn't working deepspeed-chat Related to DeepSpeed-Chat

Comments

@vinod-sarvam
Copy link

vinod-sarvam commented Aug 29, 2023

I am trying to run DeepSpeed-Chat almost out of the box albeit with some custom data and minor modifications to the code across logging, data loading, and model checkpointing. However, from the perspective of model we are using DeepSpeed as-is. I am facing strange errors while running which was working well previously.
The error is catalogued below (I am only pasting parts that I think are relevant):

Loading checkpoint shards: 50%|█████ | 1/2 [00:02<00:02, 2.30s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:03<00:00, 1.37s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:03<00:00, 1.51s/it]
You are resizing the embedding layer without providing a pad_to_multiple_of parameter. This means that the new embeding dimension will be 32008. This might induce some performance reduction as Tensor Cores will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide:https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc
[2023-08-28 17:41:40,445] [INFO] [partition_parameters.py:332:exit] finished initializing model - num_params = 292, num_elems = 6.87B
/.local/torch/nn/init.py:412: UserWarning: Initializing zero-element tensors is a no-op
warnings.warn("Initializing zero-element tensors is a no-op")


File "/.local/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/.local/lib/python3.8/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 306, in fetch_sub_module
assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary()
AssertionError: {'id': 292, 'status': 'NOT_AVAILABLE', 'numel': 0, 'ds_numel': 0, 'shape': (0,), 'ds_shape': (0, 4096), 'requires_grad': True, 'grad_shape': None, 'persist': True, 'active_sub_modules': {453}, 'ds_tensor.shape': torch.Size([0])}

The DS parameters are as follows:
'''
ZERO_STAGE=3
deepspeed main.py
--sft_only_data_path local/jsonfile
--data_split 10,0,0
--model_name_or_path meta-llama/Llama-2-7b-hf
--per_device_train_batch_size 8
--per_device_eval_batch_size 8
--gradient_checkpointing
--max_seq_len 512
--learning_rate 4e-4
--weight_decay 0.
--num_train_epochs 4
--gradient_accumulation_steps 1
--lr_scheduler_type cosine
--num_warmup_steps 100
--print_loss
--project 'deepspeed-eval'
--seed 1234
--zero_stage $ZERO_STAGE
--deepspeed
--output_dir $OUTPUT
&> $OUTPUT/training.log
'''

The versions of the essential libraries are as follows:

CUDA driver version - 530.30.02
CUDA / mvcc version - 12.1
deepspeed version - 0.10.1
transformers version - 4.29.2
torch version - 2.1.0.dev20230828+cu121
python version - 3.8
datasets version - 2.12.0

Please let us know how to proceed with this strange error.

@vinod-sarvam vinod-sarvam added bug Something isn't working deepspeed-chat Related to DeepSpeed-Chat labels Aug 29, 2023
@vinod-sarvam
Copy link
Author

PFA ds_report

[2023-08-29 02:06:58,321] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja
ninja .................. [OKAY]

op name ................ installed .. compatible

async_io ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1
[WARNING] using untested triton version (2.1.0+e6216047b8), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]

DeepSpeed general environment info:
torch install path ............... ['/.local/lib/python3.8/site-packages/torch']
torch version .................... 2.1.0.dev20230828+cu121
deepspeed install path ........... ['/.local/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.10.1, unknown, unknown
torch cuda version ............... 12.1
torch hip version ................ None
nvcc version ..................... 12.1
deepspeed wheel compiled w. ...... torch 0.0, cuda 0.0
shared memory (/dev/shm) size .... 433.06 GB

@GasolSun36
Copy link

seems we face the same error, see this issue.

@awan-10 awan-10 self-assigned this Aug 29, 2023
@awan-10
Copy link
Contributor

awan-10 commented Aug 29, 2023

@vinod-sarvam and @GasolSun36 -- thank you for reporting the error. I have responded in the other thread as well.

We have two scripts for llama2 models that we have tested quite extensively and they have worked without errors for us. Can you please try either of the scripts in this folder first?

https://github.com/microsoft/DeepSpeedExamples/tree/master/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/training_scripts/llama2

Please let me know if these work on your system or not. We will work closely with you to resolve these issues :)

@awan-10
Copy link
Contributor

awan-10 commented Aug 29, 2023

Also, kindly update your transformers package to 4.31.0.

@vinod-sarvam
Copy link
Author

Hi Awan,

We are currently focusing only on SFT (Step 1) and thus I am not sure if RLHF is relevant. However, if your suggestion is that running this will help unravel the error / bug for our case on SFT, I am happy to do that and revert back.

Thanks,
Vinod

@vinod-sarvam
Copy link
Author

Also, similar to the other thread, ZeRO Stage-2 seems to be operational. Only ZeRO Stage-3 seems to be failing.

@sagar-deepscribe
Copy link

I have been facing the same error:

  • Somehow when i run it in my local GPU setup it runs fine - hard to say why though. [Local setup is single instance multi GPU]
  • I made a docker image and was trying to run on a cloud cluster - I get the same error. [This setup is Multi instance multi GPU].

@skbacker
Copy link

skbacker commented Sep 1, 2023

I got the same error, but after downgrading transformers(4.32.0 -> 4.31.0) I could run it :)
pip install --upgrade transformers==4.31.0

@vinod-sarvam
Copy link
Author

I tried and this does not work for me.

@iamsile
Copy link

iamsile commented Sep 1, 2023

I tried and this does not work for me.

I tried and this does not work for me too

@AndyW-llm
Copy link

Got similar warning when running bash /training_scripts/opt/single_gpu/run_1.3b from DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning.

You are resizing the embedding layer without providing a pad_to_multiple_of parameter. This means that the new embeding dimension will be....

The warning disappeared after downgrading transformers by pip install --upgrade transformers==4.31.0

@GasolSun36
Copy link

update:
transformers 4.32.0->4.31.0 , it works.
more:
deepspeed==0.10.1
pytorch == 1.13.1
cuda == 11.6

@gkcng
Copy link

gkcng commented Sep 4, 2023

Ditto.

Transformers 4.32.0->4.31.0
torch 1.13.1+cu117
deepspeed 0.10.2

@iamsile
Copy link

iamsile commented Sep 4, 2023

I had re-installed transformers==4.31.0, but step3 still failed like before.

@vinod-sarvam
Copy link
Author

deepspeed==0.10.1 however does not have fused_adam and does not work with DS_BUILD_FUSED_ADAM=1 pip install deepspeed.

@awan-10
Copy link
Contributor

awan-10 commented Sep 8, 2023

Hi All, thank you for trying out various things. We are aware of this new problem with ZeRO Stage 3, HF transformers > 4.31.0. Closing this issue as the downgrade resolves the problem. Please reopen with a new error. Also, strongly encourage you to open a new issue if someone is still seeing errors despite downgrading to transformers 4.31.0.

@awan-10 awan-10 closed this as completed Sep 8, 2023
@shyoulala
Copy link

I had re-installed transformers==4.31.0, but step3 still failed like before.

@awan-10 awan-10 reopened this Sep 22, 2023
@iyume
Copy link

iyume commented Sep 24, 2023

I'm running Stable Diffusion training with diffusers and received this warning message.

But there is no deepspeed dependency in conda list | grep deepspeed, so maybe this is not an issue about deepspeed.

@andrew-zm-ml
Copy link

I, too, am seeing this issue even with transformers==4.31.0.

@Wang-Xiaodong1899
Copy link

same issue.

Phuong1908 added a commit to Phuong1908/acsqg that referenced this issue Sep 28, 2023
[c] Downgrade transformer to 4.31.0 to avoid unexpected warning

deepspeedai/DeepSpeed#4229
@kimlee1874
Copy link

I have been facing the same error:

  • Somehow when i run it in my local GPU setup it runs fine - hard to say why though. [Local setup is single instance multi GPU]
  • I made a docker image and was trying to run on a cloud cluster - I get the same error. [This setup is Multi instance multi GPU].

Similarly, it can run normally on a single node with multiple GPUs, but the same error will occur with multiple nodes and multiple GPUs. Is there any solution?

@amoyplane
Copy link

Have this issue been solved?
I meet this problem in 2025, but I can't downgrade transformers because I need new features of it.

CUDA Version: 12.4
torch==2.5.1
deepspeed==0.16.2
transformers==4.46.1

Is there any solution?
Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working deepspeed-chat Related to DeepSpeed-Chat
Projects
None yet
Development

No branches or pull requests