-
Notifications
You must be signed in to change notification settings - Fork 871
WeeklyTelcon_20210622
- Brendan Cunningham (Cornelis Networks)
- David Bernholdt (ORNL)
- Edgar Gabriel (UH)
- Geoffrey Paulsen (IBM)
- Harumi Kuno (HPE)
- Hessam Mirsadeghi (NVIDIA))
- Howard Pritchard (LANL)
- Jeff Squyres (Cisco)
- Josh Hursey (IBM)
- Matthew Dosanjh (Sandia)
- Michael Heinz (Cornelis Networks)
- Naughton III, Thomas (ORNL)
- Sam Gutierrez (LANL)
- Tomislav Janjusic (NVIDIA)
- Akshay Venkatesh (NVIDIA)
- Artem Polyakov (NVIDIA)
- Aurelien Bouteiller (UTK)
- Austen Lauria (IBM)
- Brandon Yates (Intel)
- Brian Barrett (AWS)
- Charles Shereda (LLNL)
- Christoph Niethammer (HLRS)
- Erik Zeiske (HPE)
- Geoffroy Vallee (ARM)
- George Bosilca (UTK)
- Joseph Schuchart (HLRS)
- Joshua Ladd (NVIDIA)
- Marisa Roman (Cornelius)
- Mark Allen (IBM)
- Matias Cabral (Intel)
- Nathan Hjelm (Google)
- Noah Evans (Sandia)
- Raghu Raja
- Ralph Castain (Intel)
- Scott Breyer (Sandia?)
- Shintaro iwasaki
- Todd Kordenbrock (Sandia)
- William Zhang (AWS)
- Xin Zhao (NVIDIA)
- v4.0.6 shipped last week. Looking good.
- Mpool PR, waiting for review and to go into master first.
- 8919 nVidia cannot link. Some users may have already hit this.
- Tomislav will try to find someone to look at it.
- Planning on late August for accumulated bugfixes.
-
PMIX / PRRTE plan to release in next few weeks
- wil
-
Need to do a v5.0 rc as soon as PRRTE v2 ships.
- Need feedback if we've missed an important one.
-
PMIx Tools support is still not functional. Opened tickets in PRRTE.
- Not a common case for most users.
- This also impacts the MPIR shim.
- PRRTE v2 will probably ship with broken tool support.
-
Is the driving force for PRRTE v2.0 OMPI?
- So we'd be indirectly/directly responsible for PRRTE shipping with broken tool support?
- Ralph would like to retire, and really wants to finish PRRTE v2.0 before he retires.
- Or just fix it in PRRTE v2.0?
- Is broken tool support a blocker for PRRTE v2.0?
- Don't ship OMPI v5.0 with broken Tools support.
-
Is there any objections to delaying
- Either we resource this
-
https://github.com/openpmix/pmix-tests/issues/88#issuecomment-861006665
- Current state of PMIx tool support.
- We'd like to get Tool support in CI, but need it to be working to enable the CI.
-
https://github.com/openpmix/prrte/issues/978#issuecomment-856205950
- Blocking issue for Open-MPI
- Brian
-
PR 9014 - new blocker.
- fix should just be a couple of lines of code... hard to decide what we want.
- Ralph, Jeff and Brian started talking.
- Simplest solution was to have our own
-
Need people working on v5.0 stuff.
-
Need some configury changes in before we RC.
-
Issue 8850, 8990 and more
-
Brian will file 3-ish issues
- One is configure pmix
-
Dynamic Windows fix in for UCX.
-
Any update on debugger support?
-
Need some documentation that Open MPI v5.0 supports PMIx based debuggers, and that if
-
MPIR Shim - pushed up fixes, and enabled CI.
- Could add it to some more CI, to ensure that PMIx doesn't break
- IBM is working on some CI testing with MPIR (typically very brittle)
- Need some guidance on pmix version.
- Right not, probably not a big deal, but perhaps in 2 years when we have 3 release branches with different pmix versions on different release branches, it might make sense to do open-mpi CI testing.
- Shouldn't be too much work to do.
-
UCC coll component updating to just set to be default when UCX is selected. PR 8969
- Intent is that this will eventually replace hcoll.
- Solid progress happening, on Read the docs.
- These docs would be on the readthedocs.io site, or on our site?
- Haven't thought either way yet.
- No strong opinion yet.
-
Issue 8884 - ROMIO detects CUDA differently.
- Giles proposed a quick fix for now.
-
Now released.
-
Virtual Face to face.
-
Persistant Collectives
- So nice to get MPIX_ rename into v5.0
- Don't think this was planned for v5.0
- Don't know if anyone asked them this. - Might not matter to them
- Virtual face to face -
-
a bunch of stuff in pipeline. Then details.
-
Plan to open Sessions pull request.
- Big, almost all in OMPI.
- Some of it are more impacted by clang format changes.
- New functions.
- Considerably more functions can be called before MPI_Init/Finalize
- Don't want to do sessions in v5.0
- Hessam Miradeghi is interested in trying MPI_Sessions.
- Interested in a timeline of a release that will contain MPI_Sessions.
- Sessions working group meets every monday at noon central time.
- https://github.com/mpiwg-sessions/sessions-issues/wiki
- Several of the tools tests are busted on master.
- Sessions branch fixes some of these.
- Initialize tools after finalize MPI
-
We don't KNOW that OMPI v6.0 may not be an ABI break
-
Would be NICE to get MPIX symbols into a seperate library.
- What's left in MPIX after persistant collectives?
- Short Float,
- Pcall_req - persistant collective
- Affinity
- If they're NOT built by default, it's not too high of a priority.
- Should just be some code-shuffling.
- On the surface shouldn't be too much.
- If they use wrapper compilers, or official mechanism
- Top level library, since app -> MPI and app -> MPIX lib.
- libmpi_x library can then be versioned differently.
- Should just be some code-shuffling.
- What's left in MPIX after persistant collectives?
-
Dont change to build MPIX by default.
-
Open an issue to track all of our MPI 4.0 items
- MPI Forum will want, certainly before supercomputing.
-
Do we want an MPI 4.0 Design meeting in place of a Tuesday meeting.
- In person meeting is off the table for many of us. We might want an out of sequence meeting.
- Lets doodle something a couple of weeks out.
- Doodle and send it out
- trivial wiki page in style of other in person wiki.
- A lot of failures in Finalize in cisco
- A lot of segfaults in UCX 1sided in IBM
- Howard Pritchard Does someone at nVidia have a good set of test for GPU
- Can ask around.
- Only tests is OSC tests.
- ECP - worried we're going to get so far behind MPICH because all 3 major exascale systems are using essentially the same technology and their vendors use MPICH. They're racing ahead with integrating GPU offloaded code with MPICH. Just a heads up.
- A thread on The GPU can trigger something to happen in MPI.
- CUDA_Async Not sure of
- Jeff did some work on Cisco MTT.
- There are a bunch of one-sided issues across node.
- Austen and Jeff looking into.
- Narrowed it down to strange results from MPI_Comm_split
- Local Peers value appears to be set wrong under PRRTE
- Joseph see when he installed hwloc in installation path, which leads to warnings if using another hwloc.
- We changed how all of this worked a few weeks ago.
- We shouldn't be installing one unless we can't find an external one.
- Problem is if you link the application to a different hwloc, it now complains.
- This has always been true, we just warn now. Don't do this.
- Austen filed a couple of issues from MTT.
- No discussion
- No update
- No discussion.