Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable serialization and shared memory #410

Merged
merged 56 commits into from
Sep 6, 2024
Merged

Enable serialization and shared memory #410

merged 56 commits into from
Sep 6, 2024

Conversation

Yurlungur
Copy link
Collaborator

@Yurlungur Yurlungur commented Aug 24, 2024

PR Summary

This PR enables both serialization and shared memory for all equations of state. This is for use with MPI Windows shared memory on, e.g., Rocinante and other limited environments. Note there is still an outstanding issue with the EOSPAC backend, see #408 , which I will need to resolve in a future MR. I'm still thinking about how best to resolve that particular issue, and insight is welcome.

Another outstanding issue is that the CI does not currently test against EOSPAC with shared memory enabled. When possible, we should update the CI, and thread this through the spackage. @rbberger maybe you can handle that in a separate PR at some point? #409.

When shared memory is not enabled in EOSPAC, the machinery all still works. An EOSPAC EOS will report that it cannot utilize any shared memory, but the serialization and de-serialization routines will still work. The CI on Darwin will test this and I've also tested it by hand.

In the meantime, I plan to test the shared memory case by hand after I get a more up-to-date eospac environment set up through spack.

PR Checklist

  • Adds a test for any bugs fixed. Adds tests for new features.
  • Format your changes by using the make format command after configuring with cmake.
  • Document any new features, update documentation for changes made.
  • Make sure the copyright notice on any files you modified is up to date.
  • After creating a pull request, note it in the CHANGELOG.md file.
  • LANL employees: make sure tests pass both on the github CI and on the Darwin CI

If preparing for a new release, in addition please check the following:

  • Update the version in cmake.
  • Move the changes in the CHANGELOG.md file under a new header for the new release, and reset the categories.
  • Ensure that any when='@main' dependencies are updated to the release version in the package.py

@Yurlungur
Copy link
Collaborator Author

Thanks for the feedback @jhp-lanl ! I think it should all be addressed now... modulo (a) CI issues and (b) the EOSPAC question we discussed. I will resolve the latter in a separate MR after doing some testing in a self-contained reproducer.

I'm guessing you're also missing an ifdef in the serialization tests which is causing the minimal tests to fail.

Nope, it was actually unsanitized input in test_sg_eos after MR #373 . Not sure why it only showed up now. Should have shown up in earlier MRs.

@jhp-lanl
Copy link
Collaborator

Nope, it was actually unsanitized input in test_sg_eos after MR #373 . Not sure why it only showed up now. Should have shown up in earlier MRs.

wow! Thanks for the find and fix!

// extract each individual EOS and do something with it
std::vector<EOS> eos_host_vec = deserializer.eos_objects;
// get on device if you want
for (auto EOS : eos_host_vec) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be a can of worms at some point when we want to have 1 copy per GPU rather than 1 copy per node. Some host codes may want to hand us GPU memory so we may need to have a facility to handle that case at some point.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What this will currently produce is 1 copy per MPI rank on device and 1 copy per node on host, because GetOnDevice will do its own allocation. The case where we want device shared memory is one I'm aware of but haven't thought deeply about. My suggestion is we punt on this but agree there's an engineering problem to solve here.

@@ -32,11 +32,21 @@ constexpr unsigned long all_values = (1 << 7) - 1;
} // namespace thermalqs

constexpr size_t MAX_NUM_LAMBDAS = 3;
enum class DataStatus { Deallocated = 0, OnDevice = 1, OnHost = 2 };
enum class DataStatus { Deallocated = 0, OnDevice = 1, OnHost = 2, UnManaged = 3 };
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have sufficient coverage in our testing to make sure we are catching the potential issues? I recall this took up many of @Yurlungur's cycles to get this to work correctly back in the day.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe so. I think our tests should catch this if there's an issue. But I can't promise that. The UnManaged tag just makes Finalize a no-op.

}

rhoT_orig.Finalize();
rhoSie_orig.Finalize();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I knew this once and forgot, but I'm realizing that this will likely require substantial modifications in the fortran interface to account for the new ways of initialization and finalization.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will need to expose a serialization/deserialization interface on the fortran side, yes. That will need to be on the tasking for next FY.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I'll start thinking about this when the work packages are more finalized.

@Yurlungur
Copy link
Collaborator Author

Tests pass on re-git (thanks @rbberger for fixing the CI!) and integration tested in riot downstream (thanks @AstroBarker !) So I'm calling this good and pulling the trigger.

@Yurlungur Yurlungur merged commit 54c4081 into main Sep 6, 2024
5 checks passed
@Yurlungur Yurlungur deleted the jmm/serialization branch September 6, 2024 16:35
@Yurlungur Yurlungur mentioned this pull request Sep 11, 2024
9 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request interface Library interface work
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants