Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue about: torch.distributed.init_process_group #13

Open
Zhouziyi828 opened this issue Aug 4, 2023 · 3 comments
Open

Issue about: torch.distributed.init_process_group #13

Zhouziyi828 opened this issue Aug 4, 2023 · 3 comments

Comments

@Zhouziyi828
Copy link

When I tried to use the DistributedDataParallel to run the project, I set the code like this:

    free_p = find_free_port()
    os.environ['MASTER_ADDR'] = '172.17.0.6'
    os.environ['MASTER_PORT'] = str(free_p)
    num_gpus = torch.cuda.device_count()
    # torch.distributed.init_process_group(backend='nccl', init_method='env://',rank=0,world_size =num_gpus)
    try:
        torch.distributed.init_process_group(backend='nccl', init_method='env://', rank=0, world_size=num_gpus)
    except Exception as e:
        print("Failed to initialize process group:", str(e))

But the program is stuck, with no output and no further progress. Do you no the reason for this?

@Chenrj233
Copy link
Owner

Sorry for the late reply. Actually, we haven't tried DistributedDataParallel training before. The project is implemented with reference to transfer-learning-conv-ai. We'll work on fixing this error soon as we can.

@Zhouziyi828
Copy link
Author

So may I ask how many GPUs you used for training? How big is the memory of each one? What is the batch_size? How long did you train?
I used two A40s with 48G video memory and a batch_size of 32. The training time was very long, averaging 50 minutes per epoch.

@Chenrj233
Copy link
Owner

Our experiment is carried out on a single 24g RTX 3090. The batch size is 2, about 5 hours per epoch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants