Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cuda Memory Error #49

Open
Ashbajawed opened this issue Sep 24, 2024 · 3 comments
Open

Cuda Memory Error #49

Ashbajawed opened this issue Sep 24, 2024 · 3 comments

Comments

@Ashbajawed
Copy link

I have been using 4 GPUs (from ml.p4d.24xlarge / ml.p4de.24xlarge ) in aws server but still getting error

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB. GPU 0 has a total capacty of 79.15 GiB of which 161.62 MiB is free. Including non-PyTorch memory, this process has 78.98 GiB memory in 
use. Of the allocated memory 75.25 GiB is allocated by PyTorch, and 2.84 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.
  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF    

torchrun
--nnode 1
--nproc_per_node 4
--node_rank 0
--master_addr "localhost"
--master_port 12345
speechgpt/src/train/ma_pretrain.py
--bf16 True
--block_size 1024
--model_name_or_path "${METAROOT}"
--train_file ${DATAROOT}/train.txt
--validation_file ${DATAROOT}/dev.txt
--do_train
--do_eval
--output_dir "${OUTROOT}"
--preprocessing_num_workers 100
--overwrite_output_dir
--per_device_eval_batch_size 2
--per_device_train_batch_size 2
--gradient_accumulation_steps 16
--num_train_epochs 3
--log_level debug
--logging_steps 1
--save_steps 300
--cache_dir ${CACHEROOT}
--fsdp "full_shard auto_wrap"
--fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \

@anliyuan
Copy link

DId you solved this? Im facing the same problem.

@Ashbajawed
Copy link
Author

DId you solved this? Im facing the same problem.

unfortunately no

@LiuMY13
Copy link

LiuMY13 commented Nov 13, 2024

--preprocessing_num_workers 100 调小到4 --gradient_accumulation_steps 16调小到1 我是这样勉强能泡,但是训练效果不好,因为batchsize太小了。我使用的是4张a100,80G

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants