Skip to content

WeeklyTelcon_20201215

Geoffrey Paulsen edited this page Jan 19, 2021 · 1 revision

Open MPI Weekly Telecon ---

  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees (on Web-ex)

  • NOT-YET-UPDATED

Web-Ex

  • link has changed for 2021. Please see email from Jeff Squyres to [email protected] on 12/15/2020 for the new link

4.0.x

  • v4.0.6rc1 - built, please test.

v4.1

  • SLURM fix
  • Talk about Edgar
  • Shooting for a release THIS week (12/15)
  • Want SLURM and possible OMPIO default change?

Open-MPI v5.0

What's the state of ULFM (PR 7740) for v5.0?

  • Does the community want this ULFM PR 7740 for OMPI v5.0? If so, we need a PRRTE v3.0
  • On or off by default?
  • in PPRTE, think can disable by default?
    • PRRTE had a bunch of issue turning this off.
    • Is the bar to bring into master, if it's off, it's REALLY off?
  • runtime or configure time enablement?
  • Some folks want one release of PRRTE without this, but others thinks it's production ready.
  • LARGE PR, quite disruptive. Would want it in soon as possible, so we can shake out bugs.
  • Might want some Test cases for this as well. Different application
    • Think they have some tests in the other ULFM branch, not sure about this branch.

Jeff Squyres want the v5.0 RMs to generate a list of versions it'll support, to document.

  • Still need to coirdinate on this. He'd like this, this week.

  • PMIx v4.0 working on Tools, hopefully done soon.

    • PMIx go through python bindings.
    • a new Shmem component to replace
    • Still working on.
  • Dave Wooten pushed up some PRRTE patches, and making some progress there.

    • Slow but steady progress.
    • Once tool work is more stabilized on PMIx v4.0, will add some tool tests to CI.
    • Probably won't start until first of the year.
  • How is the submodule reference updatees on Open-MPI master

    • Josh was still looking to see about adding some cross checking CI
    • When making a PRTE PR, could add some comment to the PR and it'll trigger Open-MPI CI with that PR.

This is the last Tuesday call of December.

  • New web-ex for January

New items

SLURM 2020.11

  • Slurm is now always using a Cgroup, and always setting default number of cores in cgroup to 1.
  • So when using mpirun with orted/prrted in slurm, orted/prrted can't
  • Ralph working on PR from user comment (PR 8288)
  • Issue, and possibly in README (will catch a lot of people)

ROMIO issue on Lustre

  • Too latest ROMIO from and it failed on both
  • But then he took LAST week's 3.4 BETA ROMIO and it passed. But it's a little too new.
    • He gave a bit more info about the stuff he integrates, and stuff he moves forward.
        1. ROMIO modernization (don't use MPI1 based things)
        1. ROMIO integration items.
    • We're hesitant to put this into 4.1.0 because it's NOT yet release from MPICH
    • hesitant to even update ROMIO in v4.0.6 since it's a big change.
    • If we delay and pickup newer ROMIO in the next minor, would there be backwards compatibility issues?
      • Need to ask about compatibility between ROMIO 3.2.2 and 3.4
        • If fully compatibile, then only one ROMIO
    • We could ship multiple ROMIOs, but that has a lot of problems.

Edgar hunted down performance issue of OMPIO

  • Just got resources to test, and root caused the issue in OMPIO
  • So, given some more time Edgar will get a fix, and OMPIO can be default

ROMIO Long Term (12/8)

  • What do we want to do about ROMIO in general.
    • There have been some OMPI specific changes put into ROMIO, meaning upstream maintainers refuse to help us with it.
    • Long Term we need to figure out what to do about this.
    • We may be able to work with upstream to make a clear API between the two.
  • Need to look at this treematch thing. Upstream package that is now inside of Open-MPI.
  • Putting new tests there

  • Very little there so far, but working on adding some more.

  • Should have some new Sessions tests

  • What's going to be the state of the SM Cuda BTL and CUDA support in v5.0?

    • What's the general state? Any known issues?
    • AWS would like to get.
    • Josh Ladd - Will take internally to see what they have to say.
    • From nVidia/Mellanox, Cuda Support is through UCX, SM Cuda isn't tested that much.
    • Hessam Mirsadeg - All Cuda awareness through UCX
    • May ask George Bosilica about this.
    • Don't want to remove a BTL if someone is interested in it.
    • UCX also supports TCP via CUDA
    • PRRTE CLI on v5.0 will have some GPU functionality that Ralph is working on
  • Update 11/17/2020

    • UTK is interested in this BTL, and maybe others.
    • Still gap in the MTL use-case.
    • nVidia is not maintaining SMCuda anymore. All CUDA support will be through UCX
    • What's the state of the shared memory in the BTL?
      • This is the really old generation Shared Memory. Older than Vader.
    • Was told after a certain point, no more development in SM Cuda.
    • One option might be to
    • Another option might be to bring that SM in SMCuda to Vader(now SM)
  • Restructure Tech Doc (more features than Markdown, including crossrefrences)

    • Jeff had a first stab at this, but take a look. Sent it out to devel-list.
    • All work for master / v5.0
      • Might just be useful to do README for v4.1.? (don't block v4.1.0 for this)
    • Sphynx is tool to generate docs from restructured doc.
      • can handle current markdown manpages together with new docs.
    • readthedocs.io encourages "restructured text" format over markdown.
      • They also support a hybrid for projects that have both.
    • Thomas Naughton has done the restructured text, and it allows
    • LICENSE question - what license would the docs be available under? Open-MPI BSD license, or
  • Ralph tried the Instant on at scale:

    • 10,000 nodes x 32PPN
    • Ralph verified Open-MPI could do all of that in < 5 seconds, Instant-On.
    • Through MPI_Init() (if using Instant-On)
    • TCP and Slingshot (OFI provider private now)
    • PRRTE with PMIx v4.0 support
    • SLURM has some of the integration, but hasn't taken this patch yet.
  • Discussion on:

    • Draft Request Make default static https://github.com/open-mpi/ompi/pull/8132
    • One con is that many providers hard link against libraries, which would then make libmpi dependent on this.
    • Talking about amending to request MCAs to know if it should be slurped in.
      • (if the component hard links or dlopens their libraries)
    • Roadrunner experiments... The Bottleneck in launching was I/O in loading all the .sos
      • spindle, and burst buffer reduce this, but still
    • Still going through function pointers, no additional inlining.
      • can do this today.
    • Still different than STATIC (sharing this image across process), just not calling dlopen that many times.
    • New proposal is to have a 3rd option where component decides it's default is to be slurped into libmpi
      • It's nice to have fabric provider's not bring their dependencies into libmpi so that the main libmpi can be run on nodes that may not have the provider's dependencies installed.
    • Low priority thing anyway, if we get it in for v5.0 it'd be nice, but not critical.

Video Presentation

  • George and Jeff are leading
  • No new updates this week (see last week)
Clone this wiki locally