-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Very strange error while running LLaMa-2 with DeepSpeed-Chat #4229
Comments
PFA ds_report [2023-08-29 02:06:58,321] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)DeepSpeed C++/CUDA extension op reportNOTE: Ops not installed will be just-in-time (JIT) compiled at
|
seems we face the same error, see this issue. |
@vinod-sarvam and @GasolSun36 -- thank you for reporting the error. I have responded in the other thread as well. We have two scripts for llama2 models that we have tested quite extensively and they have worked without errors for us. Can you please try either of the scripts in this folder first? Please let me know if these work on your system or not. We will work closely with you to resolve these issues :) |
Also, kindly update your transformers package to 4.31.0. |
Hi Awan, We are currently focusing only on SFT (Step 1) and thus I am not sure if RLHF is relevant. However, if your suggestion is that running this will help unravel the error / bug for our case on SFT, I am happy to do that and revert back. Thanks, |
Also, similar to the other thread, ZeRO Stage-2 seems to be operational. Only ZeRO Stage-3 seems to be failing. |
I have been facing the same error:
|
I got the same error, but after downgrading transformers(4.32.0 -> 4.31.0) I could run it :) |
I tried and this does not work for me. |
I tried and this does not work for me too |
Got similar warning when running
The warning disappeared after downgrading transformers by |
update: |
Ditto. Transformers 4.32.0->4.31.0 |
I had re-installed transformers==4.31.0, but step3 still failed like before. |
deepspeed==0.10.1 however does not have fused_adam and does not work with DS_BUILD_FUSED_ADAM=1 pip install deepspeed. |
Hi All, thank you for trying out various things. We are aware of this new problem with ZeRO Stage 3, HF transformers > 4.31.0. Closing this issue as the downgrade resolves the problem. Please reopen with a new error. Also, strongly encourage you to open a new issue if someone is still seeing errors despite downgrading to transformers 4.31.0. |
I had re-installed transformers==4.31.0, but step3 still failed like before. |
I'm running Stable Diffusion training with But there is no |
I, too, am seeing this issue even with |
same issue. |
[c] Downgrade transformer to 4.31.0 to avoid unexpected warning deepspeedai/DeepSpeed#4229
Similarly, it can run normally on a single node with multiple GPUs, but the same error will occur with multiple nodes and multiple GPUs. Is there any solution? |
Have this issue been solved? CUDA Version: 12.4 Is there any solution? |
I am trying to run DeepSpeed-Chat almost out of the box albeit with some custom data and minor modifications to the code across logging, data loading, and model checkpointing. However, from the perspective of model we are using DeepSpeed as-is. I am facing strange errors while running which was working well previously.
The error is catalogued below (I am only pasting parts that I think are relevant):
Loading checkpoint shards: 50%|█████ | 1/2 [00:02<00:02, 2.30s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:03<00:00, 1.37s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [00:03<00:00, 1.51s/it]
You are resizing the embedding layer without providing a
pad_to_multiple_of
parameter. This means that the new embeding dimension will be 32008. This might induce some performance reduction as Tensor Cores will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this guide:https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc[2023-08-28 17:41:40,445] [INFO] [partition_parameters.py:332:exit] finished initializing model - num_params = 292, num_elems = 6.87B
/.local/torch/nn/init.py:412: UserWarning: Initializing zero-element tensors is a no-op
warnings.warn("Initializing zero-element tensors is a no-op")
File "/.local/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/.local/lib/python3.8/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 306, in fetch_sub_module
assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary()
AssertionError: {'id': 292, 'status': 'NOT_AVAILABLE', 'numel': 0, 'ds_numel': 0, 'shape': (0,), 'ds_shape': (0, 4096), 'requires_grad': True, 'grad_shape': None, 'persist': True, 'active_sub_modules': {453}, 'ds_tensor.shape': torch.Size([0])}
The DS parameters are as follows:
'''
ZERO_STAGE=3
deepspeed main.py
--sft_only_data_path local/jsonfile
--data_split 10,0,0
--model_name_or_path meta-llama/Llama-2-7b-hf
--per_device_train_batch_size 8
--per_device_eval_batch_size 8
--gradient_checkpointing
--max_seq_len 512
--learning_rate 4e-4
--weight_decay 0.
--num_train_epochs 4
--gradient_accumulation_steps 1
--lr_scheduler_type cosine
--num_warmup_steps 100
--print_loss
--project 'deepspeed-eval'
--seed 1234
--zero_stage $ZERO_STAGE
--deepspeed
--output_dir $OUTPUT
&> $OUTPUT/training.log
'''
The versions of the essential libraries are as follows:
CUDA driver version - 530.30.02
CUDA / mvcc version - 12.1
deepspeed version - 0.10.1
transformers version - 4.29.2
torch version - 2.1.0.dev20230828+cu121
python version - 3.8
datasets version - 2.12.0
Please let us know how to proceed with this strange error.
The text was updated successfully, but these errors were encountered: