How can I change the master_port when using deepspeed for multi-GPU on single node, i.e. localhost #936

lovedoubledan · 2024-11-20T13:41:55Z

When I use default command, it seems to use 29500 as master_port.
However, the master_port seems unchangable,even when I use "--master_port 29501" or change it using "deepspeed.init_distributed(dist_backend='nccl', distributed_port=config.master_port)"

error message:
[W1120 21:36:50.764587163 TCPStore.cpp:358] [c10d] TCP client failed to connect/validate to host 127.0.0.1:29500 - retrying (try=3, timeout=1800000ms, delay=1496ms): Connection reset by peer
Exception raised from recvBytes at ../torch/csrc/distributed/c10d/Utils.hpp:667 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fc06bba0446 in /data/wujiahao/anaconda3/envs/gpt/lib/python3.10/site-packages/torch/lib/libc10.so)
...

lovedoubledan · 2024-11-20T14:00:11Z

I find master_port is specified in constants.py as 29500 and can not be changed by any surface.
I hope this bug can be fixed.

tjruwase · 2024-11-20T19:32:05Z

@lovedoubledan, can you share your full command-line to show the example code?

lovedoubledan · 2024-11-21T11:45:50Z

@lovedoubledan, can you share your full command-line to show the example code?
my command line is :
deepspeed --include localhost:4,7 train_stage2.py
--config_file config/gptir3_notokenloss_plus.yaml
--deepspeed --deepspeed_config config/deepspeed_config/gptir.json --master_port 20815

and my code is like:
parser = argparse.ArgumentParser()
# Input Parameters
parser.add_argument('--config_file', type=str, default="config/gptir3_notokenloss_plus.yaml")
parser.add_argument("--local_rank",
type=int,
default=-1,
help="local_rank for distributed training on gpus")
parser.add_argument("--master_port",
type=int,
default=20815)
parser = deepspeed.add_config_arguments(parser)
# parser.add_argument('--deepspeed_config', type=str, default="config/deepspeed_config/gptir.json")
config = parser.parse_args()
...
model_engine, optimizer, _, _ = deepspeed.initialize(args=config,
model=net,
model_parameters=net.configure_parameters(),
distributed_port=config.master_port)

loadams · 2024-11-25T18:29:15Z

Hi @lovedoubledan - can you share a repro code snippet with us? Also do you see any warnings printed about the port? And could you try setting the master port in the ds_config as well to see if that works?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can I change the master_port when using deepspeed for multi-GPU on single node, i.e. localhost #936

How can I change the master_port when using deepspeed for multi-GPU on single node, i.e. localhost #936

lovedoubledan commented Nov 20, 2024

lovedoubledan commented Nov 20, 2024

tjruwase commented Nov 20, 2024

lovedoubledan commented Nov 21, 2024

loadams commented Nov 25, 2024

How can I change the master_port when using deepspeed for multi-GPU on single node, i.e. localhost #936

How can I change the master_port when using deepspeed for multi-GPU on single node, i.e. localhost #936

Comments

lovedoubledan commented Nov 20, 2024

lovedoubledan commented Nov 20, 2024

tjruwase commented Nov 20, 2024

lovedoubledan commented Nov 21, 2024

loadams commented Nov 25, 2024