Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added port argument for ssh #4117

Merged
merged 18 commits into from
Aug 29, 2023
Merged

Conversation

Hiromasa-H
Copy link
Contributor

This is the PR for issue #4116.

@@ -564,6 +566,7 @@ def main(args=None):
else:
cmd = runner.get_cmd(env, active_resources)

env["PDSH_SSH_ARGS_APPEND"] += f" -p {args.port} "
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree w. this suggestion from @loadams. The deepspeed launcher supports different multi-node backends, pdsh is the default and most common but it would be good to move the PDSH_SSH_ARGS_APPEND variable into the PDSHRunner class inside multinode_runner.py.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your feedback. I've moved this part of the code into the PDSHRunner class as specified, and made the necessary changes.

@loadams loadams requested a review from mrwyattii August 9, 2023 16:48
@@ -432,7 +434,7 @@ def main(args=None):
if multi_node_exec and not args.no_ssh_check:
first_host = list(active_resources.keys())[0]
try:
subprocess.check_call(f'ssh -o PasswordAuthentication=no {first_host} hostname',
subprocess.check_call(f'ssh -o PasswordAuthentication=no -p {args.port} {first_host} hostname',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we only want to pass -p {args.port} iff --ssh_port is set. In many cases we have users setting up a ssh config that sets specific port/connection parameters. If we force passing -p 22 as the default then this will break a ssh config host that may have a port that is not 22.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your feedback. I've added an if statement to make sure the -p is only added on when args.ssh_port is specified.

@Hiromasa-H
Copy link
Contributor Author

@microsoft-github-policy-service agree

@loadams loadams linked an issue Aug 11, 2023 that may be closed by this pull request
@loadams
Copy link
Collaborator

loadams commented Aug 16, 2023

Hi @Hiromasa-H - FYI, you'll have to update test_pdsh_runner as well since it needs the changes from here.

@Hiromasa-H Hiromasa-H requested a review from tjruwase as a code owner August 23, 2023 15:55
@@ -193,6 +193,8 @@ def parse_args(args=None):
"numbers and range. i.e. 1,3-5,7 => [1,3,4,5,7]. When not "
"specified, all cores on system would be used rank binding")

parser.add_argument("--ssh_port", type=int, default=22, help="SSH port to use for remote connections")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the later checks for if args.ssh_port is not None will never trigger since the default here is 22. Can we set the default to None?

@Hiromasa-H Hiromasa-H requested review from jeffra and loadams August 27, 2023 07:25
@loadams
Copy link
Collaborator

loadams commented Aug 28, 2023

Hi @Hiromasa-H - apologies for the slow review, wanted to test one thing on our side first, but will review this today. Thanks for your patience.

Copy link
Contributor

@mrwyattii mrwyattii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's avoid copying code with subprocess.check_call (see my suggestion).

Otherwise, LGTM

@Hiromasa-H
Copy link
Contributor Author

Hiromasa-H commented Aug 29, 2023

@loadams Sorry, I did not intend to rush you. Thank you for your taking your time with a thorough review.

@mrwyattii mrwyattii enabled auto-merge August 29, 2023 16:43
@mrwyattii mrwyattii added this pull request to the merge queue Aug 29, 2023
Merged via the queue into deepspeedai:master with commit 8145b5e Aug 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[REQUEST] add ssh port argument for multi-node training/inference
4 participants