multi-node training out of memory during ViT training #818

frankxyy · 2022-12-18T11:03:31Z

I use 2node, 4gpus per node. The same training batch size work for single node while when applying to multi-node, this error is printed out:

zhisbug · 2022-12-19T03:12:37Z

@frankxyy are you saying the exactly same training setting can work on 1x4 GPUs but not on 2x4 GPUs?

frankxyy · 2022-12-19T06:58:32Z

@frankxyy are you saying the exactly same training setting can work on 1x4 GPUs but not on 2x4 GPUs?

@zhisbug yes

The same command of

NCCL_SOCKET_IFNAME=bond0 CUDA_VISIBLE_DEVICES=4,5,6,7  python -m pdb run_image_classification.py \
    --output_dir ./vit-base-patch16-imagenette \
    --train_dir="/home/xuyangyang/imagenette2/train" \
    --validation_dir="/home/xuyangyang/imagenette2/val" \
    --num_train_epochs 50 \
    --num_micro_batches 2 \
    --learning_rate 1e-3 \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 2 \
    --overwrite_output_dir \
    --preprocessing_num_workers 2 \
    --num_of_nodes 1

merrymercy changed the title ~~multi-node training out of memory~~ multi-node training out of memory during VIT training Jan 3, 2023

merrymercy changed the title ~~multi-node training out of memory during VIT training~~ multi-node training out of memory during ViT training Jan 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multi-node training out of memory during ViT training #818

multi-node training out of memory during ViT training #818

frankxyy commented Dec 18, 2022

zhisbug commented Dec 19, 2022

frankxyy commented Dec 19, 2022 •

edited

Loading

multi-node training out of memory during ViT training #818

multi-node training out of memory during ViT training #818

Comments

frankxyy commented Dec 18, 2022

zhisbug commented Dec 19, 2022

frankxyy commented Dec 19, 2022 • edited Loading

frankxyy commented Dec 19, 2022 •

edited

Loading