Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix SSH port by default in multinode container #214

Merged
merged 16 commits into from
Jul 8, 2024
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 4 additions & 19 deletions pytorch/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -108,7 +108,7 @@ RUN mkdir -p /var/run/sshd && \

ARG PYTHON_VERSION

COPY generate_ssh_keys.sh .
COPY multinode/generate_ssh_keys.sh /generate_ssh_keys.sh

# modify generate_ssh_keys to be a helper script
# print how to use helper script on bash startup
Expand All @@ -117,24 +117,9 @@ RUN echo "source /usr/local/lib/python${PYTHON_VERSION}/dist-packages/oneccl_bin
cat '/generate_ssh_keys.sh' >> ~/.startup && \
rm -rf /generate_ssh_keys.sh

# hadolint global ignore=SC3037
RUN echo -e "#!/bin/bash \n\
set -e \n\
set -a \n\
source ~/.startup \n\
set +a \n\
eval \"\$@\"" >> /usr/local/bin/dockerd-entrypoint.sh && \
chmod +x /usr/local/bin/dockerd-entrypoint.sh

RUN echo 'HostKey /etc/ssh/ssh_host_dsa_key' > /var/run/sshd_config && \
echo 'HostKey /etc/ssh/ssh_host_rsa_key' > /var/run/sshd_config && \
echo 'HostKey /etc/ssh/ssh_host_ecdsa_key' > /var/run/sshd_config && \
echo 'HostKey /etc/ssh/ssh_host_ed25519_key' > /var/run/sshd_config && \
echo 'AuthorizedKeysFile /etc/ssh/authorized_keys' > /var/run/sshd_config && \
echo '## Enable DEBUG log. You can ignore this but this may help you debug any issue while enabling SSHD for the first time' > /var/run/sshd_config && \
echo 'LogLevel DEBUG3' > /var/run/sshd_config && \
echo 'UsePAM yes' > /var/run/sshd_config && \
echo 'Subsystem sftp /usr/lib/openssh/sftp-server' > /var/run/sshd_config
COPY multinode/dockerd-entrypoint.sh /usr/local/bin/dockerd-entrypoint.sh
COPY multinode/sshd_config /etc/ssh/sshd_config
COPY multinode/ssh_config /etc/ssh/ssh_config

RUN mkdir -p /licensing

Expand Down
100 changes: 67 additions & 33 deletions pytorch/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -114,12 +114,8 @@ The images below additionally include [Intel® oneAPI Collective Communications
| `2.0.0-pip-multinode` | [v2.0.0] | [v2.0.0+cpu] | [v2.0.0][ccl-v2.0.0] | [v2.1.1] | [v0.1.0] |

> **Note:** Passwordless SSH connection is also enabled in the image.
> The container does not contain the SSH ID keys. The user needs to mount those keys at `/root/.ssh/id_rsa` and `/root/.ssh/id_rsa.pub`.
> User also need to append content of id_rsa.pub in `/etc/ssh/authorized_keys` in the SSH server container.
> Since the SSH key is not owned by default user account in docker, please also do "chmod 644 id_rsa.pub; chmod 644 id_rsa" to grant read access for default user account.
> Users could also use "/usr/bin/ssh-keygen -t rsa -b 4096 -N '' -f ~/mnt/ssh_key/id_rsa" to generate a new SSH Key inside the container.
> Users need to mount a config file to list all hostnames at location `/root/.ssh/config` on the SSH client container.
> Once all files are added
> The container does not contain the SSH ID keys. The user needs to mount those keys at `/root/.ssh/id_rsa` and `/etc/ssh/authorized_keys`.
> Since the SSH key is not owned by default user account in docker, please also do "chmod 600 authorized_keys; chmod 600 id_rsa" to grant read access for default user account.

#### Setup and Run IPEX Multi-Node Container

Expand All @@ -131,8 +127,7 @@ SSH Server (Worker)

SSH Client (Launcher)

1. *Config File with Host IPs* : `/root/.ssh/config`
2. *Private User Key* : `/root/.ssh/id_rsa`
1. *Private User Key* : `/root/.ssh/id_rsa`

To add these files correctly please follow the steps described below.

Expand All @@ -146,47 +141,33 @@ To add these files correctly please follow the steps described below.
cat id_rsa.pub >> authorized_keys
```

2. Add hosts to config

The launcher container needs to have the a config file with all hostnames and ports specified. An example of a hostfile is provided below.
2. Configure the permissions and ownership for all of the files you have created so far.

```bash
touch config
chmod 600 id_rsa config authorized_keys
chown root:root id_rsa.pub id_rsa config authorized_keys
```

3. Setup hostfile. The hostfile is needed for running torch distributed using `ipexrun` utility. If you're not using `ipexrun` you can skip this step.

```txt
Host host1
HostName <Hostname of host1>
IdentitiesOnly yes
Port <SSH Port>
Host host2
HostName <Hostname of host2>
IdentitiesOnly yes
Port <SSH Port>
<Host 1 IP/Hostname>
<Host 2 IP/Hostname>
...
```

3. Configure the permissions and ownership for all of the files you have created so far.

```bash
chmod 600 id_rsa.pub id_rsa config authorized_keys
chown root:root id_rsa.pub id_rsa config authorized_keys
```

4. Now start the workers and execute DDP on the launcher.

1. Worker run command:

```bash
export SSH_PORT=<SSH Port>
docker run -it --rm \
--net=host \
-v $PWD/authorized_keys:/root/.ssh/authorized_keys \
-v $PWD/authorized_keys:/etc/ssh/authorized_keys \
-v $PWD/tests:/workspace/tests \
-w /workspace \
-e SSH_PORT=${SSH_PORT} \
intel/intel-extension-for-pytorch:2.3.0-pip-multinode \
bash -c '/usr/sbin/sshd -D -p ${SSH_PORT} -f /var/run/sshd_config'
bash -c '/usr/sbin/sshd -D'
```

2. Launcher run command:
Expand All @@ -195,12 +176,65 @@ To add these files correctly please follow the steps described below.
docker run -it --rm \
--net=host \
-v $PWD/id_rsa:/root/.ssh/id_rsa \
-v $PWD/config:/root/.ssh/config \
-v $PWD/tests:/workspace/tests \
-v $PWD/hostfile:/workspace/hostfile \
-w /workspace \
intel/intel-extension-for-pytorch:2.3.0-pip-multinode \
bash -c 'ipexrun cpu --nnodes 2 --nprocs-per-node 1 --master-addr 127.0.0.1 --master-port 3022 /workspace/tests/ipex-resnet50.py --ipex --device cpu --backend ccl'
```

5. Start SSH server with a custom port.
If the user wants to define their own port to start the SSH server, it can be done so using the commands described below.

1. Worker command:

```bash
export SSH_PORT=<User SSH Port>
docker run -it --rm \
--net=host \
-v $PWD/authorized_keys:/etc/ssh/authorized_keys \
-v $PWD/tests:/workspace/tests \
-e SSH_PORT=${SSH_PORT} \
-w /workspace \
intel/intel-extension-for-pytorch:2.3.0-pip-multinode \
bash -c '/usr/sbin/sshd -D -p ${SSH_PORT}'
```

2. Add hosts to config. (**Note:** This is an optional step)

User can optionally mount their own custom client config file to define a list of hosts and ports where the SSH server is running inside the container. An example of a hostfile is provided below. This file is supposed to be mounted in the launcher container at `/etc/ssh/ssh_config`.

```bash
touch config
```

```txt
Host host1
HostName <Hostname of host1>
IdentitiesOnly yes
IdentityFile ~/.root/id_rsa
Port <SSH Port>
Host host2
HostName <Hostname of host2>
IdentitiesOnly yes
IdentityFile ~/.root/id_rsa
Port <SSH Port>
...
```

3. Launcher run command:

```bash
docker run -it --rm \
--net=host \
-v $PWD/id_rsa:/root/.ssh/id_rsa \
-v $PWD/config:/etc/ssh/ssh_config \
-v $PWD/hostfile:/workspace/hostfile \
-v $PWD/tests:/workspace/tests \
-e SSH_PORT=${SSH_PORT} \
-w /workspace \
intel/intel-extension-for-pytorch:2.3.0-pip-multinode \
bash -c 'ipexrun cpu /workspace/tests/ipex-resnet50.py --ipex --device cpu --backend ccl'
bash -c 'ipexrun cpu --nnodes 2 --nprocs-per-node 1 --master-addr 127.0.0.1 --master-port ${SSH_PORT} /workspace/tests/ipex-resnet50.py --ipex --device cpu --backend ccl'
```

> [!NOTE]
Expand Down
21 changes: 21 additions & 0 deletions pytorch/multinode/dockerd-entrypoint.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
#!/bin/bash
# Copyright (c) 2024 Intel Corporation
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

set -e
set -a
# shellcheck disable=SC1091
source "$HOME/.startup"
set +a
"$@"
File renamed without changes.
3 changes: 3 additions & 0 deletions pytorch/multinode/ssh_config
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
Host *
Port 3022
IdentityFile ~/.ssh/id_rsa
10 changes: 10 additions & 0 deletions pytorch/multinode/sshd_config
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
HostKey /etc/ssh/ssh_host_dsa_key
HostKey /etc/ssh/ssh_host_rsa_key
HostKey /etc/ssh/ssh_host_ecdsa_key
HostKey /etc/ssh/ssh_host_ed25519_key
AuthorizedKeysFile /etc/ssh/authorized_keys
## Enable DEBUG log. You can ignore this but this may help you debug any issue while enabling SSHD for the first time
LogLevel DEBUG3
Port 3022
UsePAM yes
Subsystem sftp /usr/lib/openssh/sftp-server
Loading