[BUG] Very strange error while running LLaMa-2 with DeepSpeed-Chat #4229

vinod-sarvam · 2023-08-29T02:04:59Z

I am trying to run DeepSpeed-Chat almost out of the box albeit with some custom data and minor modifications to the code across logging, data loading, and model checkpointing. However, from the perspective of model we are using DeepSpeed as-is. I am facing strange errors while running which was working well previously.
The error is catalogued below (I am only pasting parts that I think are relevant):

Loading checkpoint shards: 50%|█████ | 1/2 [00:02<00:02, 2.30s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:03<00:00, 1.37s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:03<00:00, 1.51s/it]
You are resizing the embedding layer without providing a pad_to_multiple_of parameter. This means that the new embeding dimension will be 32008. This might induce some performance reduction as Tensor Cores will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide:https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc
[2023-08-28 17:41:40,445] [INFO] [partition_parameters.py:332:exit] finished initializing model - num_params = 292, num_elems = 6.87B
/.local/torch/nn/init.py:412: UserWarning: Initializing zero-element tensors is a no-op
warnings.warn("Initializing zero-element tensors is a no-op")

File "/.local/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/.local/lib/python3.8/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 306, in fetch_sub_module
assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary()
AssertionError: {'id': 292, 'status': 'NOT_AVAILABLE', 'numel': 0, 'ds_numel': 0, 'shape': (0,), 'ds_shape': (0, 4096), 'requires_grad': True, 'grad_shape': None, 'persist': True, 'active_sub_modules': {453}, 'ds_tensor.shape': torch.Size([0])}

The DS parameters are as follows:
'''
ZERO_STAGE=3
deepspeed main.py
--sft_only_data_path local/jsonfile
--data_split 10,0,0
--model_name_or_path meta-llama/Llama-2-7b-hf
--per_device_train_batch_size 8
--per_device_eval_batch_size 8
--gradient_checkpointing
--max_seq_len 512
--learning_rate 4e-4
--weight_decay 0.
--num_train_epochs 4
--gradient_accumulation_steps 1
--lr_scheduler_type cosine
--num_warmup_steps 100
--print_loss
--project 'deepspeed-eval'
--seed 1234
--zero_stage $ZERO_STAGE
--deepspeed
--output_dir $OUTPUT
&> $OUTPUT/training.log
'''

The versions of the essential libraries are as follows:

CUDA driver version - 530.30.02
CUDA / mvcc version - 12.1
deepspeed version - 0.10.1
transformers version - 4.29.2
torch version - 2.1.0.dev20230828+cu121
python version - 3.8
datasets version - 2.12.0

Please let us know how to proceed with this strange error.

The text was updated successfully, but these errors were encountered:

vinod-sarvam · 2023-08-29T02:07:37Z

PFA ds_report

[2023-08-29 02:06:58,321] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja
ninja .................. [OKAY]

op name ................ installed .. compatible

async_io ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1
[WARNING] using untested triton version (2.1.0+e6216047b8), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]

DeepSpeed general environment info:
torch install path ............... ['/.local/lib/python3.8/site-packages/torch']
torch version .................... 2.1.0.dev20230828+cu121
deepspeed install path ........... ['/.local/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.10.1, unknown, unknown
torch cuda version ............... 12.1
torch hip version ................ None
nvcc version ..................... 12.1
deepspeed wheel compiled w. ...... torch 0.0, cuda 0.0
shared memory (/dev/shm) size .... 433.06 GB

GasolSun36 · 2023-08-29T04:42:33Z

seems we face the same error, see this issue.

awan-10 · 2023-08-29T16:59:40Z

@vinod-sarvam and @GasolSun36 -- thank you for reporting the error. I have responded in the other thread as well.

We have two scripts for llama2 models that we have tested quite extensively and they have worked without errors for us. Can you please try either of the scripts in this folder first?

https://github.com/microsoft/DeepSpeedExamples/tree/master/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/training_scripts/llama2

Please let me know if these work on your system or not. We will work closely with you to resolve these issues :)

awan-10 · 2023-08-29T17:05:02Z

Also, kindly update your transformers package to 4.31.0.

vinod-sarvam · 2023-08-30T11:26:20Z

Hi Awan,

We are currently focusing only on SFT (Step 1) and thus I am not sure if RLHF is relevant. However, if your suggestion is that running this will help unravel the error / bug for our case on SFT, I am happy to do that and revert back.

Thanks,
Vinod

vinod-sarvam · 2023-08-30T11:26:52Z

Also, similar to the other thread, ZeRO Stage-2 seems to be operational. Only ZeRO Stage-3 seems to be failing.

sagar-deepscribe · 2023-08-31T01:18:32Z

I have been facing the same error:

Somehow when i run it in my local GPU setup it runs fine - hard to say why though. [Local setup is single instance multi GPU]
I made a docker image and was trying to run on a cloud cluster - I get the same error. [This setup is Multi instance multi GPU].

skbacker · 2023-09-01T09:22:19Z

I got the same error, but after downgrading transformers(4.32.0 -> 4.31.0) I could run it :)
pip install --upgrade transformers==4.31.0

vinod-sarvam · 2023-09-01T12:18:51Z

I tried and this does not work for me.

iamsile · 2023-09-01T14:19:22Z

I tried and this does not work for me.

I tried and this does not work for me too

AndyW-llm · 2023-09-03T00:47:22Z

Got similar warning when running bash /training_scripts/opt/single_gpu/run_1.3b from DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning.

You are resizing the embedding layer without providing a pad_to_multiple_of parameter. This means that the new embeding dimension will be....

The warning disappeared after downgrading transformers by pip install --upgrade transformers==4.31.0

GasolSun36 · 2023-09-03T04:39:47Z

update:
transformers 4.32.0->4.31.0 , it works.
more:
deepspeed==0.10.1
pytorch == 1.13.1
cuda == 11.6

gkcng · 2023-09-04T01:04:55Z

Ditto.

Transformers 4.32.0->4.31.0
torch 1.13.1+cu117
deepspeed 0.10.2

iamsile · 2023-09-04T03:48:20Z

I had re-installed transformers==4.31.0, but step3 still failed like before.

vinod-sarvam · 2023-09-05T12:14:08Z

deepspeed==0.10.1 however does not have fused_adam and does not work with DS_BUILD_FUSED_ADAM=1 pip install deepspeed.

awan-10 · 2023-09-08T17:08:35Z

Hi All, thank you for trying out various things. We are aware of this new problem with ZeRO Stage 3, HF transformers > 4.31.0. Closing this issue as the downgrade resolves the problem. Please reopen with a new error. Also, strongly encourage you to open a new issue if someone is still seeing errors despite downgrading to transformers 4.31.0.

shyoulala · 2023-09-13T12:58:52Z

I had re-installed transformers==4.31.0, but step3 still failed like before.

iyume · 2023-09-24T12:33:14Z

I'm running Stable Diffusion training with diffusers and received this warning message.

But there is no deepspeed dependency in conda list | grep deepspeed, so maybe this is not an issue about deepspeed.

andrew-zm-ml · 2023-09-26T19:41:25Z

I, too, am seeing this issue even with transformers==4.31.0.

Wang-Xiaodong1899 · 2023-09-28T03:57:59Z

same issue.

[c] Downgrade transformer to 4.31.0 to avoid unexpected warning deepspeedai/DeepSpeed#4229

kimlee1874 · 2024-01-11T06:01:12Z

I have been facing the same error:

Somehow when i run it in my local GPU setup it runs fine - hard to say why though. [Local setup is single instance multi GPU]

I made a docker image and was trying to run on a cloud cluster - I get the same error. [This setup is Multi instance multi GPU].

Similarly, it can run normally on a single node with multiple GPUs, but the same error will occur with multiple nodes and multiple GPUs. Is there any solution?

amoyplane · 2025-02-05T03:42:08Z

Have this issue been solved?
I meet this problem in 2025, but I can't downgrade transformers because I need new features of it.

CUDA Version: 12.4
torch==2.5.1
deepspeed==0.16.2
transformers==4.46.1

Is there any solution?
Thank you.

vinod-sarvam added bug Something isn't working deepspeed-chat Related to DeepSpeed-Chat labels Aug 29, 2023

awan-10 self-assigned this Aug 29, 2023

This was referenced Sep 6, 2023

IndexError: argmax(): Expected reduction dim 1 to have non-zero size. [BUG] #4226

Closed

enable_hybrid_engine issue deepspeedai/DeepSpeedExamples#456

Open

awan-10 closed this as completed Sep 8, 2023

awan-10 reopened this Sep 22, 2023

andrew-zm-ml mentioned this issue Sep 27, 2023

Fix DeepSpeed ZeRO-3 in PPOTrainer huggingface/trl#730

Merged

5 tasks

Phuong1908 added a commit to Phuong1908/acsqg that referenced this issue Sep 28, 2023

[b] Downgrade transformer to 4.31.0

ecbb858

[c] Downgrade transformer to 4.31.0 to avoid unexpected warning deepspeedai/DeepSpeed#4229

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Very strange error while running LLaMa-2 with DeepSpeed-Chat #4229

[BUG] Very strange error while running LLaMa-2 with DeepSpeed-Chat #4229

vinod-sarvam commented Aug 29, 2023 •

edited

Loading

vinod-sarvam commented Aug 29, 2023

GasolSun36 commented Aug 29, 2023

awan-10 commented Aug 29, 2023

awan-10 commented Aug 29, 2023

vinod-sarvam commented Aug 30, 2023

vinod-sarvam commented Aug 30, 2023

sagar-deepscribe commented Aug 31, 2023

skbacker commented Sep 1, 2023

vinod-sarvam commented Sep 1, 2023

iamsile commented Sep 1, 2023

AndyW-llm commented Sep 3, 2023

GasolSun36 commented Sep 3, 2023

gkcng commented Sep 4, 2023

iamsile commented Sep 4, 2023

vinod-sarvam commented Sep 5, 2023

awan-10 commented Sep 8, 2023

shyoulala commented Sep 13, 2023

iyume commented Sep 24, 2023 •

edited

Loading

andrew-zm-ml commented Sep 26, 2023

Wang-Xiaodong1899 commented Sep 28, 2023

kimlee1874 commented Jan 11, 2024

amoyplane commented Feb 5, 2025

[BUG] Very strange error while running LLaMa-2 with DeepSpeed-Chat #4229

[BUG] Very strange error while running LLaMa-2 with DeepSpeed-Chat #4229

Comments

vinod-sarvam commented Aug 29, 2023 • edited Loading

vinod-sarvam commented Aug 29, 2023

[2023-08-29 02:06:58,321] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [OKAY]

op name ................ installed .. compatible

GasolSun36 commented Aug 29, 2023

awan-10 commented Aug 29, 2023

awan-10 commented Aug 29, 2023

vinod-sarvam commented Aug 30, 2023

vinod-sarvam commented Aug 30, 2023

sagar-deepscribe commented Aug 31, 2023

skbacker commented Sep 1, 2023

vinod-sarvam commented Sep 1, 2023

iamsile commented Sep 1, 2023

AndyW-llm commented Sep 3, 2023

GasolSun36 commented Sep 3, 2023

gkcng commented Sep 4, 2023

iamsile commented Sep 4, 2023

vinod-sarvam commented Sep 5, 2023

awan-10 commented Sep 8, 2023

shyoulala commented Sep 13, 2023

iyume commented Sep 24, 2023 • edited Loading

andrew-zm-ml commented Sep 26, 2023

Wang-Xiaodong1899 commented Sep 28, 2023

kimlee1874 commented Jan 11, 2024

amoyplane commented Feb 5, 2025

vinod-sarvam commented Aug 29, 2023 •

edited

Loading

NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja
ninja .................. [OKAY]

iyume commented Sep 24, 2023 •

edited

Loading