Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix so that MPMD tests run #2452

Merged
merged 9 commits into from
Sep 29, 2020
Merged

Fix so that MPMD tests run #2452

merged 9 commits into from
Sep 29, 2020

Conversation

eisenhauer
Copy link
Member

PR #2223 introduced the CMake variable ADIOS2_RUN_MPMD_TESTS, but with an inconsistent name. This PR changes it consistently to ADIOS2_RUN_MPI_MPMD_TESTS.

Closes issue #2451

@eisenhauer
Copy link
Member Author

@pnorbert and @JasonRuonanWang . So, nothing is easy. This restores tests that got turned off inadvertently in May. But the world has changed since then, and turning those tests back on introduces failures with MSMPI (so far) in SSC and burst buffer tests. Can you guys please propose fixes? Perhaps just push to this PR?

@JasonRuonanWang
Copy link
Member

I am working on it. This is really weird. I can understand the Fortran part, as I don't have any Fortran tests. But for the C++ part, I don't think the SSC tests are very different from the common staging tests, but they are still failing...

@eisenhauer
Copy link
Member Author

eisenhauer commented Sep 4, 2020 via email

@JasonRuonanWang
Copy link
Member

Generally the FORTRAN bit is testing to see that you do the adios array dimension reordering between c/c++ and FORTRAN. (Odd that my autocorrect insists on capitalizing FORTRAN...)

Yes, that's what I figured too, so I said I can kind of understand that part. But there is also some non-Fortran stuff failing too, which is probably more problematic...

@JasonRuonanWang
Copy link
Member

@eisenhauer The current fortran reader only reads one step and then calls close directly, whereas the C++ writer writes multiple steps. This cannot work with any MPI based engines as the writer's and reader's behaviors do not match. I have added step status checking in the fortran reader and it solved some of the failed tests. I am still trying to fix others.

@eisenhauer
Copy link
Member Author

Hmm. Yeah, I probably knew that about the fortran reader. I think that reader prog has been like that since it was created, so the behavior isn't new. Did this somehow work before these tests were disabled?
Also, it's probably best to separate this "early close" behavior from a "does reading from fortran" work test, but what would be the restriction on the reader behavior then? Do they have to read all the events the writer supplies, regardless? Any way to stop?

@JasonRuonanWang
Copy link
Member

Hmm. Yeah, I probably knew that about the fortran reader. I think that reader prog has been like that since it was created, so the behavior isn't new. Did this somehow work before these tests were disabled?
Also, it's probably best to separate this "early close" behavior from a "does reading from fortran" work test, but what would be the restriction on the reader behavior then? Do they have to read all the events the writer supplies, regardless? Any way to stop?

I am not sure actually why it worked before. But it shouldn't anyway. In any engine that uses MPI, it will have to issue a collective call (or a matched pair of MPI_send and MPI_recv) across the writers and the readers at every step. So if their behaviors do not match then it won't work by definition. Checking the step status on the reader side is the only way to match them. If somehow it worked before then I would consider it a bug.

I understand your point but I have several reasons to have this common fortran reader having step status checking.

First, the common C++ reader has step status checking. If the common fortran reader does not have it then the 1x1 test and 1x1.Fortran test will do completely different things, which in my opinion they shouldn't. If there is a need to keep this early stop reader feature, then we probably should create it for both C++ and fortran, for example 1x1.EarlyStopReader and 1x1.Fortran.EarlyStepReader. In this case, different engines can choose what they want explicitly, rather than having this mismatched behaviors across languages.

Second, I would consider a reader code that has step status checking the most standard way of using ADIOS2, especially for staging engines where the reader does not have any idea how many steps there is, until BeginStep returns EndOfStream. I would be surprised if many staging codes need to stop the reader at a different point than EndOfStream.

@eisenhauer
Copy link
Member Author

I agree completely with upgrading the fortran reader. It exists as it does because it was thrown together so we had some check to see if the dimension reordering was happening. It got committed in that way and hasn't been looked at much since.
WRT the staging codes that might need to stop the stream cleanly before EOS, I guess the obvious one is visualization, but that's our connect-and-disconnect at will scenario. I guess we can argue that that won't ever be possible with MPI staging engines. In any case, separating the testing of the two features (Fortran and pre-EOS close()) makes a ton of sense.

@mw70
Copy link

mw70 commented Sep 9, 2020

An additional "early exit" scenario that might be worth explicitly trying to test for in MPMD mode is clean shutdown in coupled codes, since that will involve both reader and writer functions. It would be good to know that a clean exit is possible in that scenario regardless of ordering of the closes, even if one of the codes went on to do some mop-up file writing, etc. at the end.

@JasonRuonanWang
Copy link
Member

@eisenhauer I have fixed all SSC related problems. Is anyone working on the other bugs?

@JasonRuonanWang
Copy link
Member

@pnorbert Have you looked at these errors in Utils.IOTest.BurstBuffer.1to1.BP4.Stream.Run? Any suggestions?

@eisenhauer
Copy link
Member Author

FYI Norbert, the failures are all windows and all the same:
ADIOS ERROR: couldn't read from file pipe2_write.bp\md.idx, in call to fstream read
: iostream stream error

…ata repetitively, and passed the staging-common tests
@JasonRuonanWang
Copy link
Member

@eisenhauer There comes another failed test Engine.Staging.TestThreads.Basic.SST.BP.Serial on Mac. I didn't touch it, but it started failing somehow...

@eisenhauer eisenhauer merged commit 994ae04 into ornladios:master Sep 29, 2020
@eisenhauer eisenhauer deleted the MPMDFix branch September 29, 2020 21:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants