Skip to content
This repository has been archived by the owner on Oct 19, 2024. It is now read-only.

multi-node training out of memory during ViT training #818

Open
frankxyy opened this issue Dec 18, 2022 · 2 comments
Open

multi-node training out of memory during ViT training #818

frankxyy opened this issue Dec 18, 2022 · 2 comments

Comments

@frankxyy
Copy link
Contributor

I use 2node, 4gpus per node. The same training batch size work for single node while when applying to multi-node, this error is printed out:

0529dcfca48bf574eb1276045c17ed34

15d382e6dbc2e384e4e4d918a9f1cd06

@zhisbug
Copy link
Member

zhisbug commented Dec 19, 2022

@frankxyy are you saying the exactly same training setting can work on 1x4 GPUs but not on 2x4 GPUs?

@frankxyy
Copy link
Contributor Author

frankxyy commented Dec 19, 2022

@frankxyy are you saying the exactly same training setting can work on 1x4 GPUs but not on 2x4 GPUs?

@zhisbug yes

The same command of

NCCL_SOCKET_IFNAME=bond0 CUDA_VISIBLE_DEVICES=4,5,6,7  python -m pdb run_image_classification.py \
    --output_dir ./vit-base-patch16-imagenette \
    --train_dir="/home/xuyangyang/imagenette2/train" \
    --validation_dir="/home/xuyangyang/imagenette2/val" \
    --num_train_epochs 50 \
    --num_micro_batches 2 \
    --learning_rate 1e-3 \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 2 \
    --overwrite_output_dir \
    --preprocessing_num_workers 2 \
    --num_of_nodes 1

@merrymercy merrymercy changed the title multi-node training out of memory multi-node training out of memory during VIT training Jan 3, 2023
@merrymercy merrymercy changed the title multi-node training out of memory during VIT training multi-node training out of memory during ViT training Jan 3, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants