-
Notifications
You must be signed in to change notification settings - Fork 871
WeeklyTelcon_20210202
- Dialup Info: (Do not post to public mailing list or public wiki)
- Jeff Squyres (Cisco)
- Howard Pritchard (LANL)
- Ralph Castain (Intel)
- Geoffrey Paulsen (IBM)
- Austen Lauria (IBM)
- Joseph Schuchart
- Hessam Mirsadeghi (UCX/nVidia)
- Edgar Gabriel (UH)
- Brendan Cunningham (Cornelis Networks)
- Josh Hursey (IBM)
- Matthew Dosanjh (Sandia)
- Naughton III, Thomas (ORNL)
- Raghu Raja (AWS)
- Todd Kordenbrock (Sandia)
- William Zhang (AWS)
- George Bosilca (UTK)
- Aurelien Bouteiller (UTK)
- Christoph Niethammer (HLRS)
- Harumi Kuno (HPE)
- Brian Barrett (AWS)
- David Bernhold (ORNL)
- Joshua Ladd (nVidia/Mellanox)
- Michael Heinz (Cornelis Networks)
- Akshay Venkatesh (NVIDIA)
- Artem Polyakov (nVidia/Mellanox)
- Brandon Yates (Intel)
- Charles Shereda (LLNL)
- Erik Zeiske
- Geoffroy Vallee (ARM)
- Mark Allen (IBM)
- Matias Cabral (Intel)
- Nathan Hjelm (Google)
- Noah Evans (Sandia)
- Scott Breyer (Sandia?)
- Shintaro iwasaki
- Tomislav Janjusic
- Xin Zhao (nVidia/Mellanox)
- v4.0 release, would like to take this ROMIO one-off fix instead of
- https://github.com/open-mpi/ompi/pull/8370 - Fixes HDF5 on LUSTRE
- Proposing take this one-off for v4.0.6, as a whole new ROMIO is a big change.
- Waiting on v4.0.6rc2 until we get an answer.
- Everyone seems okay with taking this into release branch, and waiting for ROMIO update on master.
- Merged
- Schedule - If we could get something for Issue 8321, we can do an RC soon.
-
Jeff pushed a few commits to PR 8376
- Pro - if using Intel compiler, it'll . nice runtime option along with configure option.
- Nice
- Also added supported read-only, but what you use may be different.
- Also used Enum flags to see/set what is there.
- George is against the complete disabling of AVX, since it's overly agressively disabling things.
- Consistancy is good, and George wants to be consistant, so is against these commits.
- This commit prevents all users from
- problem is only seen on certain processors. Can't reproduce in other family of processors.
- And compiler versions seem to matter.
- Hoping then people can white-list certain processors as
- v4.1 is the first series this went into, so perhaps we need more time to bake.
- Historically if it's broken in one case, we generally just put a warning out in that case.
- v4.1.0 is out there already, and could put in a bigger hammer option in v4.2.0
- Jeff will ammend the 2 commits he added, and then remove one restrictions.
- Pro - if using Intel compiler, it'll . nice runtime option along with configure option.
-
PR 8435 - Moved a feature from Tuned to base, and use it in libnbc.
- George will write up a how to use this, and Jeff will get into doc.
-
Will do a v4.1.1 RC
-
Issue 8334 - a performance regression with AVX512 on Skylake. Still digging into.
- Blocker for v4.1 release. Performance regression.
- Unclear if the scope is isolated to LAAMPS or all Allreduce.
- VASP - with lots of allreduce didn't see much perf difference AVX on/off
- Weird it wouldn't help, but AVX perf only helps large vector reductions.
- Horivod saw about 20% improvement in original UTK paper @ EURO-MPI
- Blocker for v4.1 release. Performance regression.
-
Issue 8410 - Build Failure on Apple Silicon.
- Do we just need new updated string, or is that just one of the issues.
- Code changes we need in v4.1.
- Will have exact same problem in PMIx and PRRTE
- Performance with Atomic FIFO is another issue, might not need to backport to v4.1
- Closed
-
Issue 8367 - will take to UCX community
- Not yet brought up to UCX community. Josh will take up
-
Issue 8379 - UCT appears to be default and not UCX
- Jeff repinged for request
- Does UCT BTL even get built?
- Still in discussion in Issue 8102.
- Common missconception that people can install over existing install.
- Jeff repinged for request
-
Might be an older mca component from
-
We had a PR to have a Unique signature for each build.
- If we had this, we could use this signature in the modules themselves, but then we'd avoid this issue at runtime, and only open mca if from same build.
- We currently have something for mca VERSION, but we never update the mca version.
- So maybe we want to add OMPI version into this mca version check.
- But this might not be enough, as recompiles might have different configure.
- We need something to have something to identify the configure itself.
- If we had this, we could use this signature in the modules themselves, but then we'd avoid this issue at runtime, and only open mca if from same build.
-
8431 - git commit checks as action.
-
hwloc are we tracking the usage of the hwloc topology loads?
- George wants to take a stab at it. Using it in HAN and Treematch
- MTT is showing that the master branch is pretty good. We don't need to wait for PRRTE to be complet to branch v5.0.x in OMPI
- Raghu added an entry for libFabric entry.
- One-sided tests are still busted. Do we keep running these if they're failing?
- Nathan is actively working on, so hopeful we'll get this.
- Adding ULFM tests to new public repo
- Are we Feature Complete?
- PRRTE should be ready end of Q1.
- Based on v5.0 tracker, there is a bunch of stuff not in.
- GPU Direct support for OFI MTL
- AWS working on now. Need to rebase, and upstream.
- OFI BTL changes need to get upstreamed.
- Weeks for MTL
- Edgar atomicity issue for OMPIO. Not sure if it's a full feature, but need to have on radar.
- ETA: a few days after Edgar finds time. 2-3 weeks.
- Any other big features?
- Branch Date will discuss next week.
- How to implement so that
./configure --help
presents all configure options to users? - Brian sent a good summary to devel mailing list. Presented a summary and 3 different options
- Document (possibly at the end of the default –help output, definitely in the README) that you need to run –help=recursive to see the options for PMIx and PRRTE
- Nice, but very complex. Also GIANT output and not super friendly.
- This won't show any
hwloc
help, probably because it's not git submodules. - This would need to be fixed, and especially if we go this direction.
- This won't show any
- This way will STILL warn at high level that it doesn't understand an argument, and then it'll be picked up at a lower level.
- Nice, but very complex. Also GIANT output and not super friendly.
- Add “dummy” help options for the parameters from PMIx and PRRTE we think are worth exporting. This is likely prime for bit rot
- This is frowned upon... We've been hit by bit-rot quite a bit.
- Josh’s script to create a dummy help option for each argument in PMIx and PRRTE not in the top level configure.
- email states incorrect PR. Correct PR is: PR 8409
- Sometimes there are options we want to pass to one subcomponent but not the other.
- We have
--with-feature-X
but by keeping these seperate, then they won't be "mixed" as they might mean different things. - In this way, configure options that ompi configure doesn't recognize (3rd party args), won't warn.
- We have
- Document (possibly at the end of the default –help output, definitely in the README) that you need to run –help=recursive to see the options for PMIx and PRRTE
- Probably don't want to give users TOO much control of subcomponents.
- If users want more control they can use External. This allows us to not need to get too complex in configure.
- gcc also has similar problem, and they're okay without prefixing.
- Dont want to sever connectivity to embedded packages.
- Arguments against #3.
- You would not have visibility of which subcomponent a particular subcomponent belongs to.
- But users shouldn't have to worry about where a configure flag is implemented.
- You would not have visibility of which subcomponent a particular subcomponent belongs to.
- Process returns wrong result unless pml is
^ucx
. - Should we release a v4.0.6 with a PR that would disable building ucx against older than UCX 1.9 (current UCX)
- This is blocking v4.0.6
- This would be a drastic change to deny all UCX before current 1.9
- Hassam / Yossi are looking into this.
- Please get back this week.
- We'd like to ship v4.0.6 soon, and getting a more specific fix would be better than the big hammer.
- Would be good to do both configure time and runtime
- Assume this affects v4.1 as well.
- Should be straight forward to chase down, and
- Possibly an issue with collective and UCX in this runmode.
* PR 8406 - Technically not needed.
* This PR is redundant with prior fix already in.
* Already in v4.1
*
- Jeff can setup so we have single point of contact in github, that many members of organizations can watch
- Don't go crazy to start, just setup a few
- PR 8329 - convert README, HACKING, and possibly Manpages to restructured text.
- Uses https://www.sphinx-doc.org/en/master/ (Python tool, can pip install)
- Has a built from this PR, so we can see what it looks like.
- Have a look. It's a different approach to have one document that's the whole thing.
- FAQ, README, HACKING.
- Do people even use manpages anymore? Do we need/want them in our tarballs?
- Useful for tools, ofline deployments.
- not really for APIs.
- Useful for tools, ofline deployments.
- 2/2 Update Going well.
- It's going slowly going through FAQ. Validating and freshing the content.
- Aimed at v5.0
- Probably Rearrange this. No longer need FAQ, but now that if we're going to have Docs, will rearrange into sections.
- May want to contact archive existing content.
- What do we want to do about ROMIO in general.
- OMPIO is the default everywhere.
- Giles is saying the changes we made are integration changes.
- There have been some OMPI specific changes put into ROMIO, meaning upstream maintainers refuse to help us with it.
- We may be able to work with upstream to make a clear API between the two.
- As a 3rd party package, should we move it upto the 3rd party packaging area, to be clear that we shouldn't make changes to this area?
- Need to look at this treematch thing. Upstream package that is now inside of Open-MPI.
- Might want a CI bot to watch a set of files, and flag PRs that violate principles like this.
How's the state of https://github.com/open-mpi/ompi-tests-public/
- Putting new tests there
- Very little there so far, but working on adding some more.
- Should have some new Sessions tests
-
What's the general state? Any known issues?
-
AWS would like to get.
-
Josh Ladd - Will take internally to see what they have to say.
-
From nVidia/Mellanox, Cuda Support is through UCX, SM Cuda isn't tested that much.
-
Hessam Mirsadeg - All Cuda awareness through UCX
-
May ask George Bosilica about this.
-
Don't want to remove a BTL if someone is interested in it.
-
UCX also supports TCP via CUDA
-
PRRTE CLI on v5.0 will have some GPU functionality that Ralph is working on
-
Update 11/17/2020
- UTK is interested in this BTL, and maybe others.
- Still gap in the MTL use-case.
- nVidia is not maintaining SMCuda anymore. All CUDA support will be through UCX
- What's the state of the shared memory in the BTL?
- This is the really old generation Shared Memory. Older than Vader.
- Was told after a certain point, no more development in SM Cuda.
- One option might be to
- Another option might be to bring that SM in SMCuda to Vader(now SM)
-
Discussion on:
- Didn't get to this week. :(
- Draft Request Make default static https://github.com/open-mpi/ompi/pull/8132
- One con is that many providers hard link against libraries, which would then make libmpi dependent on this.
- Non-Homogenous clusters (GPUs on some nodes, and non-GPUs on some other)
- ECP Community days ( March 30-April 1st )
- David Bernholdt and/or George Bosilica
- Each day 90 minute time slots.
- Get proposal in by this Friday.