Issue about: torch.distributed.init_process_group #13

Zhouziyi828 · 2023-08-04T13:55:20Z

When I tried to use the DistributedDataParallel to run the project, I set the code like this:

    free_p = find_free_port()
    os.environ['MASTER_ADDR'] = '172.17.0.6'
    os.environ['MASTER_PORT'] = str(free_p)
    num_gpus = torch.cuda.device_count()
    # torch.distributed.init_process_group(backend='nccl', init_method='env://',rank=0,world_size =num_gpus)
    try:
        torch.distributed.init_process_group(backend='nccl', init_method='env://', rank=0, world_size=num_gpus)
    except Exception as e:
        print("Failed to initialize process group:", str(e))

But the program is stuck, with no output and no further progress. Do you no the reason for this?

The text was updated successfully, but these errors were encountered:

Chenrj233 · 2023-08-06T08:59:27Z

Sorry for the late reply. Actually, we haven't tried DistributedDataParallel training before. The project is implemented with reference to transfer-learning-conv-ai. We'll work on fixing this error soon as we can.

Zhouziyi828 · 2023-08-07T02:28:32Z

So may I ask how many GPUs you used for training? How big is the memory of each one? What is the batch_size? How long did you train?
I used two A40s with 48G video memory and a batch_size of 32. The training time was very long, averaging 50 minutes per epoch.

Chenrj233 · 2023-08-08T07:01:32Z

Our experiment is carried out on a single 24g RTX 3090. The batch size is 2, about 5 hours per epoch.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue about: torch.distributed.init_process_group #13

Issue about: torch.distributed.init_process_group #13

Zhouziyi828 commented Aug 4, 2023

Chenrj233 commented Aug 6, 2023

Zhouziyi828 commented Aug 7, 2023

Chenrj233 commented Aug 8, 2023

Issue about: torch.distributed.init_process_group #13

Issue about: torch.distributed.init_process_group #13

Comments

Zhouziyi828 commented Aug 4, 2023

Chenrj233 commented Aug 6, 2023

Zhouziyi828 commented Aug 7, 2023

Chenrj233 commented Aug 8, 2023