Skip to content

WeeklyTelcon_20200811

Geoffrey Paulsen edited this page Jan 19, 2021 · 1 revision

Open MPI Weekly Telecon ---

  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees (on Web-ex)

  • NOT-YET-UPDATED

New

HWLOC initializiation thing. (Issue #7937)

  • trivial to fix in master.
  • Once Brian gets his configure stuff in.
  • May need someone else to finish.
  • Should be able to call PMIx Init, and ____ init, don't need opal init at begining of MPI_Init.
    • This won't work going back into releases.
    • buried in mca system.
    • need
  • What to do about fixing release branches.
  • Can't give local topology without ___
  • Don't run it at scale.
  • The portable way to get it, is hwloc.

George has some pt2pt

  • Summary: We committed some code
    • Race condition we always win (because it happens at finalize and haven't cared), but now in ULFM (and possibly Sessions)
    • We switched the configury logic so we always prefer external libevent (above a certain level of external libevent).
      • Most OSes are above that level, so almost always prefer external libevent.
      • If we get the fix into our internal libevent,
        • Concern is that unless we or users explicitly request internal libevent, we'll almost never get this fix.
      • One solution would be
    • Can't think of another solution.
    • Packagers don't like to use our internal component
    • Only thing we can think of is if you want ULFM, you can't use external libevent.
  • Progress of getting PR accepted upstream?
    • Yes, prepared an upstream libevent PR.
      • They want a non-open-mpi reproducer.
      • Have ideas on how to create this reproducer, but not sure if it's very easy.
      • Original code writer added some protection, but has since retired. This PR removes this protection.
        • Actually "we" added this race condition protection in libevent. It delays removal of file descriptor until too late.
          • The fix validates the FD before handling. Sounds right to all.
    • Not started yet. Creating
    • May be a way to code around this on ULFM, but not really sure, because things get into a bad state, and only way might be to ruin our performance.
  • If we protect this with configure (when building ULFM and have to use internal libevent).
    • It means we move to submodules for libevent, we'd have to "mirror" libevent ourselves
  • Only master / v5.0
    • If we have TCP it could happen, but we disable errors in Finalize so don't hit this issue.
  • libevent patch to this OLD internal libevent 2022
    • It's possible that the problem goes away in newer libevent. But updating libevent was a major hassle.
    • George check if code is gone or has been modified in libevent.
      • Code is still there in latest libevent (so still need fix).
    • updating libevent would be a much better solution.
  • If upgrading to new libevent is answer.

Annual review of OMPI

  • Jeff will send out Once a year, make sure those who have commit access should
    • Have not reviewed yet:
      • Amazon, Fujitsu, Google, HPE, Los Alamos, nVidia/mellanox, IBM
    • Need to update the spreadsheet saying "looked at".

Face to face

  • August 10th, 11th, Monday and Tuesday that week.
  • List of Topics to discuss, and presenters.
    • On the wiki, start filling in.
  • Need to figure out snacks.

Open MPI Weekly Telecon ---

  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees (on Web-ex)

  • Call in user - Thomas

not there today (I keep this for easy cut-n-paste for future notes)

  • Jeff Squyres (Cisco)
  • Artem Polyakov (nVidia/Mellanox)
  • Aurelien Bouteiller (UTK)
  • Austen Lauria (IBM)
  • Barrett, Brian (AWS)
  • Brendan Cunningham (Intel)
  • Christoph Niethammer (HLRS)
  • Edgar Gabriel (UH)
  • Geoffrey Paulsen (IBM)
  • George Bosilca (UTK)
  • Howard Pritchard (LANL)
  • Joseph Schuchart
  • Josh Hursey (IBM)
  • Joshua Ladd (nVidia/Mellanox)
  • Matthew Dosanjh (Sandia)
  • Noah Evans (Sandia)
  • Ralph Castain (Intel)
  • Naughton III, Thomas (ORNL)
  • Todd Kordenbrock (Sandia)
  • Tomislav Janjusic
  • William Zhang (AWS)
  • Akshay Venkatesh (NVIDIA)
  • Brandon Yates (Intel)
  • Charles Shereda (LLNL)
  • David Bernhold (ORNL)
  • Erik Zeiske
  • Geoffroy Vallee (ARM)
  • Harumi Kuno (HPE)
  • Mark Allen (IBM)
  • Matias Cabral (Intel)
  • Michael Heinz (Intel)
  • Nathan Hjelm (Google)
  • Scott Breyer (Sandia?)
  • Shintaro iwasaki
  • William Zhang (AWS)
  • Xin Zhao (nVidia/Mellanox)
  • mohan (AWS)

New

  • Obtaining cache line size from hwloc topo info.
    • trivial to fix in master.
    • Once Brian gets his configure stuff in.
    • May need someone else to finish.
    • Should be able to call PMIx Init, and ____ init, don't need opal init at begining of MPI_Init.
      • This won't work going back into releases.
      • buried in mca system.
      • need
    • What to do about fixing release branches.
    • Can't give local topology without ___
    • Don't run it at scale.
    • The portable way to get it, is hwloc.
  • Summary: We committed some code

    • Race condition we always win (because it happens at finalize and haven't cared), but now in ULFM (and possibly Sessions)
    • Aurelien Bouteiller posted a nice summary of the situation and we discussed mitigation
      • Doesn't really affect Linux, just Mac-OS
    • Would like a user visible message if we know we can run, rather than crash.
  • George isn't here today.

    • Picking it up too late
    • PMIX may or may not have this ordering issue.
      • PMIx doesn't depend on hwloc (and not using that)
  • Should we upgrade our internal libevent to latest 2.1.12?

    • Reasons for or against?
    • Maybe hold off until we get configure code to change it to a submodule.
  • If we make libevent a submodule pointer, then we wouldn't be able to fix problems even if we have bigger problems than this.

    • For OMPI v5.0, The earliest version of libevent we're going to support out of the box 2.0.21 (RHEL7)
      • Issue 7666
    • Logic if the version installed on system, is older we'll use our bundled
  • There is a hypothetical risk that we can't ship patches, and we

    • MAC configury work
  • ULFM configury work is independent of libevent configury work.

  • Do we still merge 7940 to revert it since submodule will replace it completely?

    • Might be nice for git

EFA

  • AWS backend uses verbs interface in OFI.
    • If OFI BTL is there, it initializes first.
    • If EFA device is there, initialize OFI BTL before openib BTL won't cause issues.
      • If EFA device isn't there, then openib BTL
      • But this means mucking around with base initializiation code.
    • Calling ibv_fork_safe() by default.

Face to face

  • August 10th, 11th, Monday and Tuesday that week.

  • List of Topics to discuss, and presenters.

    • On the wiki, start filling in.
  • Many companies are not allowing a face to face travel until 2021 due to COVID19.

    • Instead lets do a series of virtual-face to face?
  • Yes this summer to discuss for v5.0

    • Maybe we can do it by topic?
    • Maybe not 4 or 8 hour things.
  • Different topics on different days.

  • Do a doodle poll of least-worse days in late July/August.

    • August 10th-14th - 3 hour block of time 8-11 Pacific time.
    • Jeff will do another doodle for days of the week (vote for 2)
  • Start a list of topics.

MPI Forum was last week.

  • Sessions is now in in.
  • Partition communication voted in.

Thread local storage issue

  • OpalTSDCreate - takes a thread storage local key that would be tracked locally in opal.
    • But when we go to delete, it's not being deleted.
    • But want flexibility to destroy on our own or explicitly
    • George thinks the mode we have today, since tracking all keys to be released by main thread.
    • George thinks Artem's approach is the correct approach.
  • Would have to change the way that keys are USED, and different components are using it in a different way.
  • Something similar should be done in different places.
  • If you do it just for UCX, then others can see how you did it and check for their code.
  • So we think current PR is good, but it leaves old API and new API.
    • But it might be better to remove OLD way and make broken components do SOMETHING to update their code.
    • Should be easy for components to add explicit cleanup calls
  • Master branch only.
  • Opened a new PUll Request yesterday that addresses the problem as discussed last week.
  • Tracking of TLS in common code.
    • Have a low level thread specific keys (very simple based on thread implementation)
    • Tracked key, probably what you want to use if you want to ensure all TLS is accounted and released at destruction of key.
    • Tommy chaged all of the places in OMPI where those keys are used. Just use tracked key instead of regular key.
    • Changed set_specific and get_specific to just set and get.
    • Please review and give suggestions.
  • Does it even make sense to do TLS in OPAL at all?
    • May indicate that we have an abstraction wrong somewhere.
    • If MPI depends on this in OPAL, then it depends on them in PMIx and other layers?
    • Not sure if there is a problem, but at a high level, sounds problematic.
  • Baking in pthread assumptions in general is not a good idea.
    • That's what this PR does is abstract pthread semantics.
  • May be some confusion, no problem with porting this API anywhere.
    • Issue raised before is that if you're relying on a certain type of thread in MPI layer.
    • But we don't, because there's a framework.
    • But Application is linked against PMIx and libevent and to use other threading models is dangerous.
      • To make this work, you have to make changes to event polling, etc.
  • Not saying we shouldn't take these patches, these make things better.
    • But we do have a problem that other thread components just aren't going to "just work", because PMIx and libevent with uses pthreads conflict with other threading models.
      • argobots actually uses pthreads, not sure about qthreads.
      • Working on a way to configure libevent to make this combo work.

C11 atomic usage is a mess

  • Last week:
    • George needs some input on PR
    • We don't need _atomic_ in most cases just need volatile
    • patch linked to the issue PR7914
    • We're not breaking things, we just get alot of valid complaints from intel compiler.
      • STDOUT of make is ~16 MB due to all intel compiler warnings without this fix
  • There is a PR pending

Open Source Parent organization

  • Since Open-MPI is a registered non-profit.
  • If we log volunteer time we can
    • Software in the Public Interest (Parent non-profit)
  • A week or two

Release Branches

Review v4.0.x Milestones v4.0.5

  • Blocked on a PR from George Issue 7937
  • 7968 is marked as a blocker, but this is more of a UCX issue, than OMPI issue.

Review v4.1.x Milestones v4.1.0

  • A couple of pending issues:

    • OFI issue Amazon is working on.
  • Need Review on PR7991

  • Need some more cycles on HAN and Adapt in master, before pull it into v4.1

    • AWS will run tests before next week.
  • Schedule: Want to release end-of-July

    • A minimum of a week, need changes from George on collective components
    • Posted a v4.1.0 rc1 to go through mechanisms to ensure we can release.
  • A number of PRs for v4.1 have not yet gone into master.

  • PRs against v4.1.x need reviews (and need corrisponding PRs to go into master)

    • A UCX init PR out for 4 weeks, still need a review
  • Release Engineers: Brian (AWS) Jeff Squyres (Cisco)

  • The fact that we removed pt2pt in OSC, is causing One-sided.

    • Nathan agreed to take a look.
  • George found an SM BTL issue at Init on master. Jeff filed Issue 7937

    • Summary: Cacheline size is set very late after modex, everything that uses cacheline before modex.
    • This is a correctness issue (not optimization) - George on today's call
      • At one point we switched to only pulling in topology when we had to, that probably introduced this issue.
        • Affects all the way back to v2.x
    • Backing file for shared memory is allocated by one proc, and used by many. Creates before MODEX (uses 128), but later when users try to use, they use a different 128 cacheline.
    • Looking at the code, we do this other places as well, but not as dramatic.
    • May need an alternative for master (after Brian finishes configury, to bring hwloc out of frameworks).
    • How do we fix this?
      • Can we just get the cacheline size before we get the rest of topology information? Brice said no.
      • Only solution we can see is creating an opal function to do this.
        • Would work, but might be slightly less optimal. If every rank does this in parallel, does this cause issues?
        • George can look for it, but can't do it before end of week.
        • Probably easiest to look in dstore in PMIx that's pretty clean (mentioned in ticket)
    • Who can do this work?
      • Showing itself in CUDA issue.
        • Tomislav Janjusic (nVidia) will ask some of his colleges.
    • Because we align some structs based on that, but
      • It would be associated with getting the topology (but not retreived until after the modex)
      • Only cuda btl calls the function directly, everyone else extracts from PMIx.
        • What we ought to do, no harm in getting topology earlier, just need to ensure PMIx is intialized.
        • On v4.1, we don't get the topology before someone requests it much later.
          • Must also affect v4.0.x
      • George put a fix into master, but making a better change to load it as soon as PMIx is intialized, would be much better.
        • Con is that if we're not in a PMIx environment to share this pointer, then every process will go do this discovery, even if they don't need it later.
        • Problem is that the process that creates the backing file, creates it very early.
    • Someone should review all the branches to Look to see if we got topology before someone uses the cacheline size.
    • George saw it in SM BTL structures. Deadlock.
    • This isn't tested by our CI infrastructure.
  • Still want:

    • George's Collectives
      • George is still working on master version of coll
      • Next thing he's working on today.
    • Will probably need to do something to CI to enable these for testing.
      • CI not really executing
      • IBM will do some testing of this.
      • Will need some docs on how users to select this.
    • Tunings for tuned coll
    • AVX
      • Went in this morning.
    • UCX PRs awaiting review.
  • Past: We've come to consensus for a v4.1.0 release

    • Need include/exclude selection, worried about consistent selection.
    • Alot of PRs outstanding, but can't merge until
      • Patch for OFI stuff messed up v4.1.x branch.
      • Howard has a fix PR, Jeff is looking at.
    • Howard changed new OFI BTL parameters to be consistent with MTL
    • Not breaking ABI or backwards compatibility.
    • v4.1.x branch, branched from v4.0.4 tag.
    • NOT touching runtime!!!
    • Not going to be pulling in a new PMIx version.
  • All MTT is online on v4.1.x branch

  • Not compiling under SLURM EFA test. (OFI BTL issue)

Review v5.0.0 Milestones v5.0.0

  • No update this week other than master discussion.

  • Need to put OSC pt2pt

    • OS RDMA requires a single BTL that can contact every single process.
      • This didn't use to be the case. (Comment in the code)
  • We can't use the OSC pt2pt.

    • It is not thread safe. Doesn't conform to MPI4 standard. Not safe.
    • This is just a testing falicy. Could add tests to show this, but still at same boat.
    • Either product A or B is broken and we need to fix it.
  • RDMA Onesided should fall back to "my atomics" because TCP will never have rdma atomics.

    • The idea was to put the atomics into the BTL base, which could do all of the one-sided atomics under the covers.
  • Jeff will close the PR, and

  • Jeff will Nathan will fetching, get, compare and swap.

  • Two new PRs for MPI4.0 Error handling - new PRs from Aurelien Bouteiller.

  • Does UCX support iWarp?

    • Does libFabric support iWarp via verbs provider?
    • https://github.com/openucx/ucx/issues/2507 suggest it doesn't.
    • Brian thinks that libFabric
    • OFI can support iWarp, just need to specify the provider in the include list.
    • This person who's asking is a partner not a customer
  • PMIX

    • Working on PMIx v4.0.0 which is what Open MPI v5.0 will use.
    • Sessions needs something from PMIx v4
    • ULFM - not sure if it needs PMIx, think it needs PRRTE changes.
    • PPN scaling issue - simple algorithmic issue in this function
      • PMIX talked about it. Artem might know someone who might be interested in working on it.
      • Algorithm behind one of the interfaces doesn't scale well.
      • Not a regression. Above ~ 4K nodes, becomes quadratic.
  • PRRTE

    • Nothing's happening there.

master

  • Mostly discussed above.

ompi-tests-public

  • We now have a new publicly visible test repo, for new tests
    • Haven't tried to do two checkouts (of both public and private test repos) in one MTT run yet.
    • Should probably update instructions on how to setup mtt
    • Can add new PR based tests if we want. We'll need to add new infrastructure.

Super Computing Birds-of-a-feather

  • George and Jeff will help plan and come to community.
    • Done / Submitted.
    • Probably won't hear back until Sept.
  • Probably after super computing.

Infrastructure

  • scale-testing, PRs have to opt-into it.

Review Master Master Pull Requests

CI status


Depdendancies

PMIx Update

ORTE/PRRTE

MTT


Back to 2020 WeeklyTelcon-2020

  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees (on Web-ex)

  • Call in user - Thomas

not there today (I keep this for easy cut-n-paste for future notes)

  • Jeff Squyres (Cisco)
  • Artem Polyakov (nVidia/Mellanox)
  • Aurelien Bouteiller (UTK)
  • Austen Lauria (IBM)
  • Barrett, Brian (AWS)
  • Brendan Cunningham (Intel)
  • Christoph Niethammer (HLRS)
  • Edgar Gabriel (UH)
  • Geoffrey Paulsen (IBM)
  • George Bosilca (UTK)
  • Howard Pritchard (LANL)
  • Joseph Schuchart
  • Josh Hursey (IBM)
  • Joshua Ladd (nVidia/Mellanox)
  • Matthew Dosanjh (Sandia)
  • Noah Evans (Sandia)
  • Ralph Castain (Intel)
  • Naughton III, Thomas (ORNL)
  • Todd Kordenbrock (Sandia)
  • Tomislav Janjusic
  • William Zhang (AWS)
  • Akshay Venkatesh (NVIDIA)
  • Brandon Yates (Intel)
  • Charles Shereda (LLNL)
  • David Bernhold (ORNL)
  • Erik Zeiske
  • Geoffroy Vallee (ARM)
  • Harumi Kuno (HPE)
  • Mark Allen (IBM)
  • Matias Cabral (Intel)
  • Michael Heinz (Intel)
  • Nathan Hjelm (Google)
  • Scott Breyer (Sandia?)
  • Shintaro iwasaki
  • William Zhang (AWS)
  • Xin Zhao (nVidia/Mellanox)
  • mohan (AWS)

New

HWLOC initializiation thing.

  • trivial to fix in master.
  • Once Brian gets his configure stuff in.
  • May need someone else to finish.
  • Should be able to call PMIx Init, and ____ init, don't need opal init at begining of MPI_Init.
    • This won't work going back into releases.
    • buried in mca system.
    • need
  • What to do about fixing release branches.
  • Can't give local topology without ___
  • Don't run it at scale.
  • The portable way to get it, is hwloc.
  • Summary: We committed some code
    • Race condition we always win (because it happens at finalize and haven't cared), but now in ULFM (and possibly Sessions)
    • We switched the configury logic so we always prefer external libevent (above a certain level of external libevent).
      • Most OSes are above that level, so almost always prefer external libevent.
      • If we get the fix into our internal libevent,
        • Concern is that unless we or users explicitly request internal libevent, we'll almost never get this fix.
      • One solution would be
    • Can't think of another solution.
    • Packagers don't like to use our internal component
    • Only thing we can think of is if you want ULFM, you can't use external libevent.
  • Progress of getting PR accepted upstream?
    • Yes, prepared an upstream libevent PR.
      • They want a non-open-mpi reproducer.
      • Have ideas on how to create this reproducer, but not sure if it's very easy.
      • Original code writer added some protection, but has since retired. This PR removes this protection.
        • Actually "we" added this race condition protection in libevent. It delays removal of file descriptor until too late.
          • The fix validates the FD before handling. Sounds right to all.
    • Not started yet. Creating
    • May be a way to code around this on ULFM, but not really sure, because things get into a bad state, and only way might be to ruin our performance.
  • If we protect this with configure (when building ULFM and have to use internal libevent).
    • It means we move to submodules for libevent, we'd have to "mirror" libevent ourselves
  • Only master / v5.0
    • If we have TCP it could happen, but we disable errors in Finalize so don't hit this issue.
  • libevent patch to this OLD internal libevent 2022
    • It's possible that the problem goes away in newer libevent. But updating libevent was a major hassle.
    • George check if code is gone or has been modified in libevent.
      • Code is still there in latest libevent (so still need fix).
    • updating libevent would be a much better solution.
  • If upgrading to new libevent is answer.

Annual review of OMPI

  • Jeff will send out Once a year, make sure those who have commit access should
    • Have not reviewed yet:
      • Amazon, Bull, Google, Los Alamos, nVidia/Mellanox
    • Need to update the spreadsheet saying "looked at".

Face to face

  • August 10th, 11th, Monday and Tuesday that week.
  • Put stuff on the agenda wiki (URL HERE)
  • List of Topics to discuss, and presenters.
    • On the wiki, start filling in.

Super Computing Birds-of-a-feather

  • George and Jeff will help plan and come to community.
    • Done / Submitted.
  • May not have Super Computing conference at ALL this year.
  • Many other projects are doing a virtual state of the union type meeting to try to cover what they'd usually do in a Birds of a feather meeting.
  • Then this works pretty well, and do this a couple of times a year.
  • Not constrained to Super Computing
  • Almost certain that it will be virtual
    • Not sure the cost.
    • Ralph and Jeff have been doing ABCs of Open MPI - SO many people. Done 2 of 3 sessions (each went 1.5 hours, lots of questions)
      • Slides and Youtube are on website, and will send link to userlist.
      • Part 3 is August 5th
    • Also want an indept walk through of PMIx initialization / wireup

MPI Forum was last week.

  • Sessions is now in in.
  • Partition communication voted in.

Thread local storage issue

  • OpalTSDCreate - takes a thread storage local key that would be tracked locally in opal.
    • But when we go to delete, it's not being deleted.
    • But want flexibility to destroy on our own or explicitly
    • George thinks the mode we have today, since tracking all keys to be released by main thread.
    • George thinks Artem's approach is the correct approach.
  • Would have to change the way that keys are USED, and different components are using it in a different way.
  • Something similar should be done in different places.
  • If you do it just for UCX, then others can see how you did it and check for their code.
  • So we think current PR is good, but it leaves old API and new API.
    • But it might be better to remove OLD way and make broken components do SOMETHING to update their code.
    • Should be easy for components to add explicit cleanup calls
  • Master branch only.
  • Opened a new PUll Request yesterday that addresses the problem as discussed last week.
  • Tracking of TLS in common code.
    • Have a low level thread specific keys (very simple based on thread implementation)
    • Tracked key, probably what you want to use if you want to ensure all TLS is accounted and released at destruction of key.
    • Tommy chaged all of the places in OMPI where those keys are used. Just use tracked key instead of regular key.
    • Changed set_specific and get_specific to just set and get.
    • Please review and give suggestions.
  • Does it even make sense to do TLS in OPAL at all?
    • May indicate that we have an abstraction wrong somewhere.
    • If MPI depends on this in OPAL, then it depends on them in PMIx and other layers?
    • Not sure if there is a problem, but at a high level, sounds problematic.
  • Baking in pthread assumptions in general is not a good idea.
    • That's what this PR does is abstract pthread semantics.
  • May be some confusion, no problem with porting this API anywhere.
    • Issue raised before is that if you're relying on a certain type of thread in MPI layer.
    • But we don't, because there's a framework.
    • But Application is linked against PMIx and libevent and to use other threading models is dangerous.
      • To make this work, you have to make changes to event polling, etc.
  • Not saying we shouldn't take these patches, these make things better.
    • But we do have a problem that other thread components just aren't going to "just work", because PMIx and libevent with uses pthreads conflict with other threading models.
      • argobots actually uses pthreads, not sure about qthreads.
      • Working on a way to configure libevent to make this combo work.

C11 atomic usage is a mess

  • Last week:
    • George needs some input on PR
    • We don't need _atomic_ in most cases just need volatile
    • patch linked to the issue PR7914
    • We're not breaking things, we just get alot of valid complaints from intel compiler.
      • STDOUT of make is ~16 MB due to all intel compiler warnings without this fix
  • There is a PR pending

Discuss Open-MPI binding when direct-launched

  • Schizo SLURM binding detection - Might not need a solution on v4.0.x
  • PRs have gone into v4.0.x and v4.1.x

Open Source Parent organization

  • Since Open-MPI is a registered non-profit.
  • If we log volunteer time we can
    • Software in the Public Interest (Parent non-profit)
  • A week or two

Release Branches

Review v4.0.x Milestones v4.0.5

  • Discussing CUDA init in UCX PML PR 7898
    • Looks like a bugfix, so should be okay to put into a release branch.
    • Is there a better place to initialize the CUDA hooks?
    • If we request a BTL or PML to be loaded, if configured with cuda
    • CUDA library is loaded by BTL that requires it.
    • Some questions about possibly making it more generic for all PMLs that use CUDA.
      • Don't want to load cuda if using only using TCP or Shared Mem
    • We'll take this PR once it passes CI and is reviewed.
  • v4.0.5 schedule: End of July
    • Will create RC1 today after PR7898 goes in.
    • Two potential drivers for a quick v4.0.5 turn-around.
      • OSC RDMA Bug - May drive a v4.0.5 release.
      • Program Aborts on detach.

Review v4.1.x Milestones v4.1.0

  • Schedule: Want to release end-of-July

    • A minimum of a week, need changes from George on collective components
  • Posted a v4.1.0 rc1 to go through mechanisms to ensure we can release.

  • Release Engineers: Brian (AWS) Jeff Squyres (Cisco)

  • Jeff is reviewing Collective components

    • Yoseph also reviewing.
  • George found an SM BTL issue at Init on master. Jeff filed Issue 7937

    • Summary: Cacheline size is set very late after modex, everything that uses cacheline before modex.
    • This is a correctness issue (not optimization) - George on today's call
      • At one point we switched to only pulling in topology when we had to, that probably introduced this issue.
        • Affects all the way back to v2.x
    • Backing file for shared memory is allocated by one proc, and used by many. Creates before MODEX (uses 128), but later when users try to use, they use a different 128 cacheline.
    • Looking at the code, we do this other places as well, but not as dramatic.
    • May need an alternative for master (after Brian finishes configury, to bring hwloc out of frameworks).
    • How do we fix this?
      • Can we just get the cacheline size before we get the rest of topology information? Brice said no.
      • Only solution we can see is creating an opal function to do this.
        • Would work, but might be slightly less optimal. If every rank does this in parallel, does this cause issues?
        • George can look for it, but can't do it before end of week.
        • Probably easiest to look in dstore in PMIx that's pretty clean (mentioned in ticket)
    • Who can do this work?
      • Showing itself in CUDA issue.
        • Tomislav Janjusic (nVidia) will ask some of his colleges.
    • Because we align some structs based on that, but
      • It would be associated with getting the topology (but not retreived until after the modex)
      • Only cuda btl calls the function directly, everyone else extracts from PMIx.
        • What we ought to do, no harm in getting topology earlier, just need to ensure PMIx is intialized.
        • On v4.1, we don't get the topology before someone requests it much later.
          • Must also affect v4.0.x
      • George put a fix into master, but making a better change to load it as soon as PMIx is intialized, would be much better.
        • Con is that if we're not in a PMIx environment to share this pointer, then every process will go do this discovery, even if they don't need it later.
        • Problem is that the process that creates the backing file, creates it very early.
    • Someone should review all the branches to Look to see if we got topology before someone uses the cacheline size.
    • George saw it in SM BTL structures. Deadlock.
    • This isn't tested by our CI infrastructure.
  • Still want:

    • George's Collectives
      • George is still working on master version of coll
      • Next thing he's working on today.
    • Will probably need to do something to CI to enable these for testing.
      • CI not really executing
      • IBM will do some testing of this.
      • Will need some docs on how users to select this.
    • Tunings for tuned coll
    • AVX
      • Went in this morning.
    • UCX PRs awaiting review.
  • Past: We've come to consensus for a v4.1.0 release

    • Need include/exclude selection, worried about consistent selection.
    • Alot of PRs outstanding, but can't merge until
      • Patch for OFI stuff messed up v4.1.x branch.
      • Howard has a fix PR, Jeff is looking at.
    • Howard changed new OFI BTL parameters to be consistent with MTL
    • Not breaking ABI or backwards compatibility.
    • v4.1.x branch, branched from v4.0.4 tag.
    • NOT touching runtime!!!
    • Not going to be pulling in a new PMIx version.
  • All MTT is online on v4.1.x branch

  • Not compiling under SLURM EFA test. (OFI BTL issue)

Review v5.0.0 Milestones v5.0.0

  • No update this week other than master discussion.

  • Need to put OSC pt2pt

    • OS RDMA requires a single BTL that can contact every single process.
      • This didn't use to be the case. (Comment in the code)
  • We can't use the OSC pt2pt.

    • It is not thread safe. Doesn't conform to MPI4 standard. Not safe.
    • This is just a testing falicy. Could add tests to show this, but still at same boat.
    • Either product A or B is broken and we need to fix it.
  • RDMA Onesided should fall back to "my atomics" because TCP will never have rdma atomics.

    • The idea was to put the atomics into the BTL base, which could do all of the one-sided atomics under the covers.
  • Jeff will close the PR, and

  • Jeff will Nathan will fetching, get, compare and swap.

  • Two new PRs for MPI4.0 Error handling - new PRs from Aurelien Bouteiller.

  • Does UCX support iWarp?

    • Does libFabric support iWarp via verbs provider?
    • https://github.com/openucx/ucx/issues/2507 suggest it doesn't.
    • Brian thinks that libFabric
    • OFI can support iWarp, just need to specify the provider in the include list.
    • This person who's asking is a partner not a customer
  • PMIX

    • Working on PMIx v4.0.0 which is what Open MPI v5.0 will use.
    • Sessions needs something from PMIx v4
    • ULFM - not sure if it needs PMIx, think it needs PRRTE changes.
    • PPN scaling issue - simple algorithmic issue in this function
      • PMIX talked about it. Artem might know someone who might be interested in working on it.
      • Algorithm behind one of the interfaces doesn't scale well.
      • Not a regression. Above ~ 4K nodes, becomes quadratic.
  • PRRTE

    • Nothing's happening there.

master

  • Mostly discussed above.

Infrastructure

  • scale-testing, PRs have to opt-into it.

Review Master Master Pull Requests

CI status


Depdendancies

PMIx Update

ORTE/PRRTE

MTT


Back to 2020 WeeklyTelcon-2020

  • Many companies are not allowing a face to face travel until 2021 due to COVID19.
    • Instead lets do a series of virtual-face to face?
  • Yes this summer to discuss for v5.0
    • Maybe we can do it by topic?
    • Maybe not 4 or 8 hour things.
  • Different topics on different days.
  • Do a doodle poll of least-worse days in late July/August.
    • August 10th-14th - 3 hour block of time 8-11 Pacific time.
    • Jeff will do another doodle for days of the week (vote for 2)
  • Start a list of topics.

MPI Forum was last week.

  • Sessions is now in in.
  • Partition communication voted in.

Thread local storage issue

  • OpalTSDCreate - takes a thread storage local key that would be tracked locally in opal.
    • But when we go to delete, it's not being deleted.
    • But want flexibility to destroy on our own or explicitly
    • George thinks the mode we have today, since tracking all keys to be released by main thread.
    • George thinks Artem's approach is the correct approach.
  • Would have to change the way that keys are USED, and different components are using it in a different way.
  • Something similar should be done in different places.
  • If you do it just for UCX, then others can see how you did it and check for their code.
  • So we think current PR is good, but it leaves old API and new API.
    • But it might be better to remove OLD way and make broken components do SOMETHING to update their code.
    • Should be easy for components to add explicit cleanup calls
  • Master branch only.
  • Opened a new PUll Request yesterday that addresses the problem as discussed last week.
  • Tracking of TLS in common code.
    • Have a low level thread specific keys (very simple based on thread implementation)
    • Tracked key, probably what you want to use if you want to ensure all TLS is accounted and released at destruction of key.
    • Tommy chaged all of the places in OMPI where those keys are used. Just use tracked key instead of regular key.
    • Changed set_specific and get_specific to just set and get.
    • Please review and give suggestions.
  • Does it even make sense to do TLS in OPAL at all?
    • May indicate that we have an abstraction wrong somewhere.
    • If MPI depends on this in OPAL, then it depends on them in PMIx and other layers?
    • Not sure if there is a problem, but at a high level, sounds problematic.
  • Baking in pthread assumptions in general is not a good idea.
    • That's what this PR does is abstract pthread semantics.
  • May be some confusion, no problem with porting this API anywhere.
    • Issue raised before is that if you're relying on a certain type of thread in MPI layer.
    • But we don't, because there's a framework.
    • But Application is linked against PMIx and libevent and to use other threading models is dangerous.
      • To make this work, you have to make changes to event polling, etc.
  • Not saying we shouldn't take these patches, these make things better.
    • But we do have a problem that other thread components just aren't going to "just work", because PMIx and libevent with uses pthreads conflict with other threading models.
      • argobots actually uses pthreads, not sure about qthreads.
      • Working on a way to configure libevent to make this combo work.

C11 atomic usage is a mess

  • Last week:
    • George needs some input on PR
    • We don't need _atomic_ in most cases just need volatile
    • patch linked to the issue PR7914
    • We're not breaking things, we just get alot of valid complaints from intel compiler.
      • STDOUT of make is ~16 MB due to all intel compiler warnings without this fix
  • There is a PR pending

Discuss Open-MPI binding when direct-launched

  • Schizo SLURM binding detection - Might not need a solution on v4.0.x
  • PRs have gone into v4.0.x and v4.1.x

Open Source Parent organization

  • Since Open-MPI is a registered non-profit.
  • If we log volunteer time we can
    • Software in the Public Interest (Parent non-profit)
  • A week or two

Release Branches

Review v4.0.x Milestones v4.0.5

  • Discussing CUDA init in UCX PML PR 7898
    • Looks like a bugfix, so should be okay to put into a release branch.
    • Is there a better place to initialize the CUDA hooks?
    • If we request a BTL or PML to be loaded, if configured with cuda
    • CUDA library is loaded by BTL that requires it.
    • Some questions about possibly making it more generic for all PMLs that use CUDA.
      • Don't want to load cuda if using only using TCP or Shared Mem
    • We'll take this PR once it passes CI and is reviewed.
  • v4.0.5 schedule: End of July
    • Will create RC1 today after PR7898 goes in.
    • Two potential drivers for a quick v4.0.5 turn-around.
      • OSC RDMA Bug - May drive a v4.0.5 release.
      • Program Aborts on detach.

Review v4.1.x Milestones v4.1.0

  • Schedule: Want to release end-of-July

    • A minimum of a week, need changes from George on collective components
  • Posted a v4.1.0 rc1 to go through mechanisms to ensure we can release.

  • Release Engineers: Brian (AWS) Jeff Squyres (Cisco)

  • Jeff is reviewing Collective components

    • Yoseph also reviewing.
  • George found an SM BTL issue at Init on master. Jeff filed Issue 7937

    • Summary: Cacheline size is set very late after modex, everything that uses cacheline before modex.
    • This is a correctness issue (not optimization) - George on today's call
      • At one point we switched to only pulling in topology when we had to, that probably introduced this issue.
        • Affects all the way back to v2.x
    • Backing file for shared memory is allocated by one proc, and used by many. Creates before MODEX (uses 128), but later when users try to use, they use a different 128 cacheline.
    • Looking at the code, we do this other places as well, but not as dramatic.
    • May need an alternative for master (after Brian finishes configury, to bring hwloc out of frameworks).
    • How do we fix this?
      • Can we just get the cacheline size before we get the rest of topology information? Brice said no.
      • Only solution we can see is creating an opal function to do this.
        • Would work, but might be slightly less optimal. If every rank does this in parallel, does this cause issues?
        • George can look for it, but can't do it before end of week.
        • Probably easiest to look in dstore in PMIx that's pretty clean (mentioned in ticket)
    • Who can do this work?
      • Showing itself in CUDA issue.
        • Tomislav Janjusic (nVidia) will ask some of his colleges.
    • Because we align some structs based on that, but
      • It would be associated with getting the topology (but not retreived until after the modex)
      • Only cuda btl calls the function directly, everyone else extracts from PMIx.
        • What we ought to do, no harm in getting topology earlier, just need to ensure PMIx is intialized.
        • On v4.1, we don't get the topology before someone requests it much later.
          • Must also affect v4.0.x
      • George put a fix into master, but making a better change to load it as soon as PMIx is intialized, would be much better.
        • Con is that if we're not in a PMIx environment to share this pointer, then every process will go do this discovery, even if they don't need it later.
        • Problem is that the process that creates the backing file, creates it very early.
    • Someone should review all the branches to Look to see if we got topology before someone uses the cacheline size.
    • George saw it in SM BTL structures. Deadlock.
    • This isn't tested by our CI infrastructure.
  • Still want:

    • George's Collectives
      • George is still working on master version of coll
      • Next thing he's working on today.
    • Will probably need to do something to CI to enable these for testing.
      • CI not really executing
      • IBM will do some testing of this.
      • Will need some docs on how users to select this.
    • Tunings for tuned coll
    • AVX
      • Went in this morning.
    • UCX PRs awaiting review.
  • Past: We've come to consensus for a v4.1.0 release

    • Need include/exclude selection, worried about consistent selection.
    • Alot of PRs outstanding, but can't merge until
      • Patch for OFI stuff messed up v4.1.x branch.
      • Howard has a fix PR, Jeff is looking at.
    • Howard changed new OFI BTL parameters to be consistent with MTL
    • Not breaking ABI or backwards compatibility.
    • v4.1.x branch, branched from v4.0.4 tag.
    • NOT touching runtime!!!
    • Not going to be pulling in a new PMIx version.
  • All MTT is online on v4.1.x branch

  • Not compiling under SLURM EFA test. (OFI BTL issue)

Review v5.0.0 Milestones v5.0.0

  • No update this week other than master discussion.

  • Need to put OSC pt2pt

    • OS RDMA requires a single BTL that can contact every single process.
      • This didn't use to be the case. (Comment in the code)
  • We can't use the OSC pt2pt.

    • It is not thread safe. Doesn't conform to MPI4 standard. Not safe.
    • This is just a testing falicy. Could add tests to show this, but still at same boat.
    • Either product A or B is broken and we need to fix it.
  • RDMA Onesided should fall back to "my atomics" because TCP will never have rdma atomics.

    • The idea was to put the atomics into the BTL base, which could do all of the one-sided atomics under the covers.
  • Jeff will close the PR, and

  • Jeff will Nathan will fetching, get, compare and swap.

  • Two new PRs for MPI4.0 Error handling - new PRs from Aurelien Bouteiller.

  • Does UCX support iWarp?

    • Does libFabric support iWarp via verbs provider?
    • https://github.com/openucx/ucx/issues/2507 suggest it doesn't.
    • Brian thinks that libFabric
    • OFI can support iWarp, just need to specify the provider in the include list.
    • This person who's asking is a partner not a customer
  • PMIX

    • Working on PMIx v4.0.0 which is what Open MPI v5.0 will use.
    • Sessions needs something from PMIx v4
    • ULFM - not sure if it needs PMIx, think it needs PRRTE changes.
    • PPN scaling issue - simple algorithmic issue in this function
      • PMIX talked about it. Artem might know someone who might be interested in working on it.
      • Algorithm behind one of the interfaces doesn't scale well.
      • Not a regression. Above ~ 4K nodes, becomes quadratic.
  • PRRTE

    • Nothing's happening there.

master

  • Mostly discussed above.

Super Computing Birds-of-a-feather

  • George and Jeff will help plan and come to community.
    • Done / Submitted.
  • May not have Super Computing conference at ALL this year.
  • Many other projects are doing a virtual state of the union type meeting to try to cover what they'd usually do in a Birds of a feather meeting.
  • Then this works pretty well, and do this a couple of times a year.
  • Not constrained to Super Computing
  • Almost certain that it will be virtual
    • Not sure the cost.
    • Ralph and Jeff have been doing ABCs of Open MPI - SO many people. Done 2 of 3 sessions (each went 1.5 hours, lots of questions)
      • Slides and Youtube are on website, and will send link to userlist.
      • Part 3 is August 5th
    • Also want an indept walk through of PMIx initialization / wireup

Infrastructure

  • scale-testing, PRs have to opt-into it.

Review Master Master Pull Requests

CI status


Depdendancies

PMIx Update

ORTE/PRRTE

MTT


Back to 2020 WeeklyTelcon-2020

Clone this wiki locally