Skip to content

WeeklyTelcon_20191203

Jeff Squyres edited this page Dec 3, 2019 · 2 revisions

Open MPI Weekly Telcon


  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees

  • Jeff Squyres (Cisco)
  • Harumi Kuno (HPE)
  • Howard Pritchard (LANL)
  • Thomas Naughton (ORNL)
  • Todd Kordenbrock (Sandia)
  • Ralph Castain (Intel)
  • William Zhang (AWS)
  • Akshay Venkatesh (NVIDIA)
  • David Bernholdt (ORNL)
  • George Bosilca (UTK)
  • Joshua Ladd (Mellanox)

Convenient list for copy-n-paste

  • Akshay Venkatesh (NVIDIA)
  • Artem Polyakov (Mellanox)
  • Austen Lauria (IBM)
  • Brandon Yates (Intel)
  • Brendan Cunningham (Intel)
  • Brian Barrett (AWS)
  • Charles Shereda (LLNL)
  • David Bernholdt (ORNL)
  • Edgar Gabriel (UH)
  • Erik Zeiske (HPE)
  • Harumi Kuno (HPE)
  • Howard Pritchard (LANL)
  • Geoffrey Paulsen (IBM)
  • George Bosilca (UTK)
  • Jeff Squyres (Cisco)
  • Josh Hursey (IBM)
  • Joshua Ladd (Mellanox)
  • Mark Allen (IBM)
  • Matthew Dosanjh (Sandia)
  • Michael Heinz (Intel)
  • mohan (AWS)
  • Nathan Hjelm (Google)
  • Noah Evans (Sandia)
  • Ralph Castain (Intel)
  • Thomas Naughton (ORNL)
  • Todd Kordenbrock (Sandia)
  • William Zhang (AWS)
  • Xin Zhao (Mellanox)

Agenda

Just released v3.0.5.

Bar is now a bit higher to accept PRs into v3.0.x. We should be targeting master/v4.0.x these days.

Just released v3.1.5.

Bar is now a bit higher to accept PRs into v3.1.x. We should be targeting master/v4.0.x these days.

PRs:

  • Not a ton happening because of past release, SC, and Thanksgiving.
  • 7117: IPv6. Waiting on reply.
  • 7151: small enough enhancement that was ok into v4.0.x.
  • But adding JSM launch support seems like a large enough feature that it should wait for vNEXT.
  • "Pepto bismal" label (target v4.1.x): these are parked right now -- they are new enhancements/features on the v4.0.x. These new enhancements/features are hopefully not ever going to be applied to v4.0.x -- but if v5.0.x is delayed, we may need to have a conversation.
  • Target late Jan for v4.0.3 -- some fixes that have been found post v4.0.2.

Open question: what do we want to do about COMM_SPAWN problems in v4.0.x? There's a bunch of them (~10 or so) from the mailing list and the issue tracker. E.g., #6962, #7094, #6902, ...

A bunch are issues with hostfile issues with spawn (e.g., "too many resources for the slots you have"), similar info key issues, etc.

Discussion of difficulty testing for COMM_SPAWN. Ralph suggests that -- in the Python MTT -- they spin up PRRTE and then do all their tests under PRRTE (including COMM_SPAWN tests).

Other opens:

  • PR 7174: OFI MTL issue: need AWS to think about this and make sure it's ok.
  • SC meeting PRRTE vs. ORTE:
    • Come to conclusion that removing ORTE and replacing it with PRRTE would be a good thing. Let's move ahead with it.
    • Ralph/Gilles started #7202:
      • makes PMIx 1st-class citizen,
        • PMIx symbols are exported / available for users to call in their application.
        • NOTE: This is not a regression -- even with v3.x/v4.x, if you try to run an app with a different version of external PMIx than OMPI is compiled with, kaboom.
      • removes ORTE,
      • put in embedded PRRTE (i.e., replace mpirun/mpiexec).
      • Removed all PMI-1 and PMI-2 support.
      • Did leave PMIx framework in OPAL -- it's now static (selecting internal vs. external).
      • Aiming for end of Dec / Jan-ish before it's done.
      • Several of these items are NEWS-worthy.
  • There's currently a problem with this branch an UCX PML: https://github.com/open-mpi/ompi/issues/6982. Ralph mentions that this will be a problem when we bring in PMIx as a 1st-class citizen.
  • Git submodule?
  • Github bots (lockbot, etc.)?
    • Have no information because Brian is the one driving for these things.
  • Need some review on the reachable stuff: PRs 7167 and 7134.
  • Want to have a custom tuning collectives tuning file for EFA. What's the best way to do that?
    • Custom file for coll/tuned.
    • Just ship it in install etc dir and load it in default MCA param file.
    • How to detect EFA NIC and use that config file automatically?
    • George wants to think about this.
    • George, William, and Jeff discussed this a bit. William will investigate something along the lines of:
      • Change the default value of the MCA param of the coll/tuned decision filename to be some sentinel value (e.g., empty string).
      • Move the reading of that MCA param to a later point in time (compared to when it is called today)
      • If the value is the sentinel value, call some hooks to other routines to see if they want to supply a decision filename (e.g., if an EFA hook has been registered, it can look to see if an EFA NIC is available, and if so, return an EFA decision filename). A default hook should probably also be available that is always called last / lowest priority / whatever that supplies the default decision file is no other file was provided by any other hook.
      • Need to think about how to register hooks (e.g., this only applies to coll/tuned -- so it may not be suitable for coll/base...?).

Face to face

  • It's official! Portland Oregon, Feb 17, 2020.
    • Safe to begin booking travel now.
  • Please register on Wiki page, since Jeff has to register you.
  • Date looks good. Feb 17th right before MPI Forum
    • 2pm monday, and maybe most of Tuesday
    • Cisco has a portland facility and is happy to host.
    • But willing to step asside if others want to host.
    • about 20-30 min drive from MPI Forum, will probably need a car.

MTT Dev status:


Exceptional topics


Back to 2019 WeeklyTelcon-2019

Clone this wiki locally