Skip to content

WeeklyTelcon_20151215

Geoff Paulsen edited this page Dec 15, 2015 · 5 revisions

Open MPI Weekly Telcon Minutes 12/15/2015


  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees

  • Jeff Squyres
  • Edgar Gabriel
  • Geoffroy Vallee
  • Howard
  • Joshua Ladd
  • Nathan Hjelm
  • Ralph
  • Ryan Grant
  • Todd Kordenbrock

Agenda

Review 1.10

  • Milestones: https://github.com/open-mpi/ompi-release/milestones/v1.10.2
  • 1.10.2 - still 3 PRs waiting to go into 1.10.2
    • Ibarrier thrown to Nathan. Found through MPICH test suite.
    • Jim Sharp reported, Ralph cleaned it over and put into 1.10, and threw this to Jeff S.
    • Integer Overflow - Thrown to George. Ralph will ping him.
      • in coll/allreduce - From Jeff Hammonds big MPI thing
      • Recasts to size_t to do math, and then recasts down to int.
      • Nathan - should doublecheck math, since might still overflow.
      • Should evaluate these codepaths a bit better.
    • PR on master, but tagged with 1.10.2
      • Jeff S will look at today, and may then be able to PR to 1.10.2
    • Nathan has one more, unmemmap a pointer belonging to OSHEM.
      • Oneline change, and will bring it over soon.
    • Subarray 1191 on master. Jeff hasn't been following.
      • Need to fork off to George.
    • Edgar - Email about ROMIO / Luster issue
      • Issue is fixed in OMPIO Master, but not on 1.10. 1.10 OMPIO is vastly out of sync with Master.
      • QUESTION: should we update OMPIO on 1.10?
        • Some changes in the Framework stuff, but if we pull it over it will drag a lot of other items.
        • DECISION: Lets NOT update OMPIO on 1.10.x for now, encourage people to
  • After these are done will roll an RC later this week.

Review 2.0.x

  • Wiki: https://github.com/open-mpi/ompi/wiki/Releasev20

  • Blocker Issues: https://github.com/open-mpi/ompi/issues?utf8=%E2%9C%93&q=is%3Aopen+milestone%3Av2.0.0+label%3Ablocker

    • PMIx is Howards #1 blocker right now. We need to decide what we want to do.
      • Putting off supporting external PMIx in Release Canidate.
        • Distros won't pick it up if we don't support it in 2.0.0
      • What can we do with PMI-x for 2.0 RC?
        • Ralph - It's relatively clean. Ralph will pull Master 1.1.2 over to OMPI 2.0 branch later today.
    • News and shlib version stuff.
      • Howard will do News, and share with others to review.
    • Addprocs == 0 discovery.
      • Running out of resources in a different way.
      • Only happens with perpair Queues in openib.
        • Thought we'd gotten rid of those years ago, no performance advantage.
      • Not really a blocker then, the blocker would be, ensure that we got rid of non-srq mode.
        • Nathan will review old email and code.
        • Anywhere we have free_list_wait, we get into infinite loops.
    • Debugger attachment issue broken on master and 2.x branch
      • Processes don't progress until debugger progresses, which can't attach until proctable has been created.
      • Some debuggers provide a flag that they turn on when they're attached.
      • So we added code that rank0 won't progress until it gets an RML message from mpirun saying that debuger has attached to mpirun.
        • So with PMI-x we removed that RML message, which we only need for this.
      • Ralph is proposing we use PMI-x error handling to get this message.
        • Problem is that PMI-x error hanlding code is in PMI-x 1.2.0, but Master / 2.0 is currently at 1.1.2.
      • This is an issue tagged somewhere else. Might be on the Totalview side.
      • QUESTION: SHOULD we update PMI-x on master / 2.0 branch to be PMI-x 1.2.0?
      • Jeff's been using DDT, but we missed that we broke totalview.
        • Totalview still uses MPIR_Being_deubugged.
        • two different ways to tell MPI implementation that we're being debugged.
      • Ralph will try to setup a time on Doodle to further discuss.
  • RFC: remove embedded libevent and hwloc

    • PMI-x also linking against libevent, so when it's internal to OMPI, some symbols come from internal and some come from external linkage.
    • When we build OMPI AND PMI-x both against external libevent and hwloc, then everything works.
      • Nathan, Do we rename EVERY name in INTERNAL? Ralph YES, renamed EVERYTHING and still doesn't work.
    • Packagers keep asking to take out embedded versions.
      • If we do that, we need to tighten up our libevent and hwloc version testing.
      • Can we test if libevent has treading on? We do that NOW with external libevent usage.
    • Howard likes this in general, but sees how sysadmins or MAC users won't like having to download other packages.
    • PROPOSAL - in configure if you say with-external-pmix then you have to have the other two external.
      • Ralph will add this to 2.0 branch with configure logic
  • Milestones: https://github.com/open-mpi/ompi-release/milestones/v2.0.0

    • Ryan and Todd fixing some portals disconnect between 2.0 and master.
    • some I collectives for Nathan. - DONE
    • Ralph did you try Group changes to see if it fixes your MTTs? PR #829 on release.

RFCs

  • RFC: remove embedded libevent and hwloc (see above under 2.0.x section)

Development Tips

  • Protip - You can add "Fixes: ISSUE#" to PR, then when PR is merged, it will close the issue.
    • Yes it works across github repos.
  • Howard, When creating a downstream PR, they add a

Review Master?

MTT status:

  • Jenkins runs being done at Nersk. So, that means we're down to one slave node on LANL, so turning Jenkins test off tomorrow.
  • Used to always have 2 systems, so it was balancing between dhopper and edison. Edison down for 4 weeks.
  • Ralph now has a static IP port on firewall.
    • Howard is specifically using Cray software stack.
  • Travis testing https://docs.travis-ci.com/user/pull-requests
    • Jeff checked it out, created a PR.
    • Travis seems easier then Jenkins. Upgrading Jenkins is harder. And GITHUB plugin for Jenkins not great.
      • Ralph concerned that it came out of nowhere.
    • Very simple, just Make tests.
      • Intel played with it, but walked away from it, because it would kick back and say that something's wrong, but since we don't have access to the system, then just play wack-a-mole. Eat up time trying to get a commit out.
    • Howard - The travis that jeff setup, is very open can look at logs.
      • Want to look at entire log file. Need to go to right side to open it up.
    • Travis is taking a long time today, because of server maintenance.
    • Ryan has a VM identical to Travis machine, so he can rapidly compile it and send it up to Travis.
      • So if we do this, then Travis doesn't do anything useful.
    • wack-a-mole problem is just the nature of the way we do testing, and we don't all have access to machine.
    • The drawback to Travis is that we don't have an owner.
    • Howard - Travis is MUCH faster if we use containers. We need to look into that.
      • Ryan didn't see much time different.
    • MAC-OS X - Still need some coverage here.
    • We'll try using Travis for a while, and see "how bad is it".
      • Bad failures aren't useful.
      • if it continue to take over an hour, just to play wack-a-mole, very frustrating.
      • We're using legacy mode, and over time this might be an issue.

Status Updates:

  • Didn't get to these this week.

NO MEETING NEXT WEEK (12/22/2015)


Status Update Rotation

  1. Cisco, ORNL, UTK, NVIDIA
  2. Mellanox, Sandia, Intel
  3. LANL, Houston, HLRS, IBM

Back to 2015 WeeklyTelcon-2015

Clone this wiki locally