You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have been using 4 GPUs (from ml.p4d.24xlarge / ml.p4de.24xlarge ) in aws server but still getting error
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB. GPU 0 has a total capacty of 79.15 GiB of which 161.62 MiB is free. Including non-PyTorch memory, this process has 78.98 GiB memory in
use. Of the allocated memory 75.25 GiB is allocated by PyTorch, and 2.84 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.
See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
I have been using 4 GPUs (from ml.p4d.24xlarge / ml.p4de.24xlarge ) in aws server but still getting error
torchrun
--nnode 1
--nproc_per_node 4
--node_rank 0
--master_addr "localhost"
--master_port 12345
speechgpt/src/train/ma_pretrain.py
--bf16 True
--block_size 1024
--model_name_or_path "${METAROOT}"
--train_file ${DATAROOT}/train.txt
--validation_file ${DATAROOT}/dev.txt
--do_train
--do_eval
--output_dir "${OUTROOT}"
--preprocessing_num_workers 100
--overwrite_output_dir
--per_device_eval_batch_size 2
--per_device_train_batch_size 2
--gradient_accumulation_steps 16
--num_train_epochs 3
--log_level debug
--logging_steps 1
--save_steps 300
--cache_dir ${CACHEROOT}
--fsdp "full_shard auto_wrap"
--fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
The text was updated successfully, but these errors were encountered: