Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can I change the master_port when using deepspeed for multi-GPU on single node, i.e. localhost #936

Open
lovedoubledan opened this issue Nov 20, 2024 · 4 comments

Comments

@lovedoubledan
Copy link

When I use default command, it seems to use 29500 as master_port.
However, the master_port seems unchangable,even when I use "--master_port 29501" or change it using "deepspeed.init_distributed(dist_backend='nccl', distributed_port=config.master_port)"

error message:
[W1120 21:36:50.764587163 TCPStore.cpp:358] [c10d] TCP client failed to connect/validate to host 127.0.0.1:29500 - retrying (try=3, timeout=1800000ms, delay=1496ms): Connection reset by peer
Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:667 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fc06bba0446 in /data/wujiahao/anaconda3/envs/gpt/lib/python3.10/site-packages/torch/lib/libc10.so)
...

@lovedoubledan
Copy link
Author

I find master_port is specified in constants.py as 29500 and can not be changed by any surface.
I hope this bug can be fixed.

@tjruwase
Copy link
Contributor

@lovedoubledan, can you share your full command-line to show the example code?

@lovedoubledan
Copy link
Author

@lovedoubledan, can you share your full command-line to show the example code?
my command line is :
deepspeed --include localhost:4,7 train_stage2.py
--config_file config/gptir3_notokenloss_plus.yaml
--deepspeed --deepspeed_config config/deepspeed_config/gptir.json --master_port 20815

and my code is like:
parser = argparse.ArgumentParser()
# Input Parameters
parser.add_argument('--config_file', type=str, default="config/gptir3_notokenloss_plus.yaml")
parser.add_argument("--local_rank",
type=int,
default=-1,
help="local_rank for distributed training on gpus")
parser.add_argument("--master_port",
type=int,
default=20815)
parser = deepspeed.add_config_arguments(parser)
# parser.add_argument('--deepspeed_config', type=str, default="config/deepspeed_config/gptir.json")
config = parser.parse_args()
...
model_engine, optimizer, _, _ = deepspeed.initialize(args=config,
model=net,
model_parameters=net.configure_parameters(),
distributed_port=config.master_port)

@loadams
Copy link
Contributor

loadams commented Nov 25, 2024

Hi @lovedoubledan - can you share a repro code snippet with us? Also do you see any warnings printed about the port? And could you try setting the master port in the ds_config as well to see if that works?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants