Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

This adds logic to pass through the file descriptors when using openmpi #95

Merged
merged 2 commits into from
Nov 9, 2023

Conversation

scanon
Copy link
Member

@scanon scanon commented Nov 1, 2023

PMI relies on passing through open file descriptors. Podman supports this but there is some extra steps needed to make it work.

PMI2 needs to pass through an open file descriptor.
There is a way to do this with podman but the file descriptors
need to be consecutive.  This fix will dup the file descriptor
to fd 3 (the first one after the defaults of stdin, out, and err).
It then set PMI_FD to point to the dupped fd (3).
Copy link
Collaborator

@lastephey lastephey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Shane. The overall setup seems ok. It's good to have something working in the short term even if it's a little hacky. I'll copy it over to muller to test.

I wonder if eventually we should fold this into some kind of --pmi module, which could set shared-run and the PMI_FD variable internally. Users might eventually stack it with an --openmpi module, although I admit I haven't thought this all the way through yet.

For now if we leave it as an environment variable, it would be good to document it on the README.

@lastephey
Copy link
Collaborator

Tested on muller with an openmpi helper module. I'll open a separate MR for the helper module.

stephey@nid001005:/mscratch/sd/s/stephey/openmpi> srun -n 2 --mpi=pmi2 podman-hpc run --rm --openmpi -v $(pwd):/work -w /work registry.nersc.gov/library/nersc/mpi4py:3.1.3-openmpi python3 -m mpi4py.bench helloworld
Hello, World! I am process 0 of 2 on nid001005.
Hello, World! I am process 1 of 2 on nid001005.

@lastephey lastephey merged commit 3d66960 into main Nov 9, 2023
@lastephey
Copy link
Collaborator

lastephey commented Nov 10, 2023

Found out this was wrong, see next comment.

Update: this seems ok on 1 node but fails on 2 nodes. In my test it's because the number of the file descriptor differs.

stephey@nid001003:/mscratch/sd/s/stephey/openmpi> srun -n 2 --mpi=pmi2 podman-hpc run --rm --openmpi-pmi2 -v $(pwd):/work -w /work registry.nersc.gov/library/nersc/mpi4py:3.1.3-openmpi ./print.sh
PMI_FD:
3
PMI_SIZE=2
PMI_FD=3
ENABLE_OPENMPI_PMI2=1
PMI_SHARED_SECRET=12576424787332504083
PMI_RANK=0
PMI_JOBID=488177.2
contents of /proc/self/fd
0
1
2
255
3
PMI_FD:
12
PMI_SIZE=2
PMI_FD=12
ENABLE_OPENMPI_PMI2=1
PMI_SHARED_SECRET=12576424787332504083
PMI_RANK=1
PMI_JOBID=488177.2
contents of /proc/self/fd
0
1
2
255
stephey@nid001003:/mscratch/sd/s/stephey/openmpi> 

@lastephey
Copy link
Collaborator

False alarm, I hadn't updated the podman_hpc.py in the second node of my reservation. Sorry about that.

@lastephey
Copy link
Collaborator

stephey@nid001003:/mscratch/sd/s/stephey/openmpi> srun -n 2 --mpi=pmi2 podman-hpc run --rm --openmpi-pmi2 -v $(pwd):/work -w /work registry.nersc.gov/library/nersc/mpi4py:3.1.3-openmpi ./print.sh
nid001005
PMI_FD:
3
PMI_SIZE=2
PMI_FD=3
ENABLE_OPENMPI_PMI2=1
PMI_SHARED_SECRET=12576424787332504083
PMI_RANK=1
PMI_JOBID=488177.14
contents of /proc/self/fd
0
1
2
255
3
nid001003
PMI_FD:
3
PMI_SIZE=2
PMI_FD=3
ENABLE_OPENMPI_PMI2=1
PMI_SHARED_SECRET=12576424787332504083
PMI_RANK=0
PMI_JOBID=488177.14
contents of /proc/self/fd
0
1
2
255
3
stephey@nid001003:/mscratch/sd/s/stephey/openmpi> srun -n 2 --mpi=pmi2 podman-hpc run --rm --openmpi-pmi2 -v $(pwd):/work -w /work registry.nersc.gov/library/nersc/mpi4py:3.1.3-openmpi python3 -m mpi4py.bench helloworld
Hello, World! I am process 0 of 2 on nid001003.
Hello, World! I am process 1 of 2 on nid001005.
stephey@nid001003:/mscratch/sd/s/stephey/openmpi>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants