-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adapt libfabric dataplane of SST to Cray CXI provider #3672
Conversation
Thanks Franz, happy to help look at this too. Might not have time until later this week, but I'll be watching in case you get more info before then. |
Thanks Greg. I am also willing to put time into this, but this concerns several systems that I don't really know about, so I'm rather limited in what I can do, and I'm currently somewhat stuck on this issue of "loading zero data". The output of the writer is the following:
The reader:
|
Are you just concerned about the zero bytes in the "summary info" at the end? If so, that's more of an informational thing and it appears that it wasn't actually implemented in the original RDMA data plane, so that value never gets updated and you can ignore it. (I can sort actually implementing it, but perhaps not until Friday because I have ORNL visitors here today and tomorrow.). So the question is whether or not you're getting the right data in the read buffer. |
It's not only that, no. In the call to static int WaitForAnyPull(CP_Services Svcs, Rdma_RS_Stream Stream)
{
FabricState Fabric = Stream->Fabric;
RdmaCompletionHandle Handle_t;
struct fi_cq_data_entry CQEntry = {0};
ssize_t rc;
rc = fi_cq_sread(Fabric->cq_signal, (void *)(&CQEntry), 1, NULL, -1);
if (rc < 1)
{
struct fi_cq_err_entry error;
fi_cq_readerr(Fabric->cq_signal, &error, 0);
Svcs->verbose(Stream->CP_Stream, DPCriticalVerbose,
"failure while waiting for completions WaitForAnyPull "
"(%d (%s - %s)).\n",
rc, fi_strerror(error.err),
fi_cq_strerror(Fabric->cq_signal, error.err,
error.err_data, NULL, error.len));
return 0;
}
else
{
Svcs->verbose(
Stream->CP_Stream, DPTraceVerbose,
"got completion for request with handle %p (flags %li).\n",
CQEntry.op_context, CQEntry.flags);
Handle_t = (RdmaCompletionHandle)CQEntry.op_context;
Handle_t->Pending--;
Stream->PendingReads--;
// TODO: maybe reuse this memory registration
if (Fabric->local_mr_req)
{
fi_close((struct fid *)Handle_t->LocalMR);
}
}
return 1;
} Latching onto this with GDB:
|
Ah, OK, puzzling. I have limited time today and tomorrow, but I'll look when I can and let you know if I have some kind of insight... |
OK, I'm trying to run this, but I'm getting somewhat different output on the reader (sst2.txt): That there's a failure in WaitForAnyPull is consistent with not getting any data, but I haven't sorted out a possible cause yet. Also trying to sort FI_LOG_LEVEL to get internal info from libfabric. I thought I had output from that, but then I switched to starting from your script and I don't seem to be getting anything. Will keep poking. |
Thank you for trying this out
I did get this PERM_VIOLATION error in two different situations:
My guess: That this runs without an apparent error at a low vector length for me is probably a fluke?
In my tests, the FI_LOG_LEVEL=Debug output was pretty much useless except when connecting to the network. Maybe there is some secret Cray debugging flag that I don't know, but their implementation seems to mostly ignore the FI_LOG_LEVEL.. |
What just came to my mind: The MPI dataplane only works when launching at least two processes each for writer and reader. I will try out next week if this might make a difference here. Also, I generally played a lot with |
Using two processes per writer and per reader did not make a difference. |
Well, worth a shot. Unfortunately I'm on travel this week and not likely to get back to this until I return. |
I have isolated a minimal example of libfabric as it is used in SST here: https://codebase.helmholtz.cloud/poesch58/libfabric-minimal-example |
Alright, the minimal example does reproduce the exact same issue that I see in this PR. I have two tags in the Git repo:
There is no direct failure in the CXI run, no error message, there is even a completion even from |
I think I have good news: I received feedback from HPE on the minimal reproducer and it seems like it was not too far from a working implementation. A first attempt shows this result from the minimal example on Crusher:
(The length of received data is apparently just not reported in this call) |
Well, at least somewhat promising. Next steps? I can poke at this at some point, but things are getting complex with next week being a holiday week. |
205c9f8
to
14ffd0d
Compare
I tried to integrate the minimal example thing back into ADIOS2 just now and had the first functioning libfabric-based SST run on Frontier (I used Frontier since the Crusher queue is currently full). So it seems like this is the right approach and the fix suggested by Cray/HPE was the right one. I pushed my changes (and also rebased this PR onto the
This is fine. I will continue working on this PR and on the points mentioned above, but this is not super urgent for me and the code is currently far from being mergeable anyway. |
@franzpoeschel thanks for your contribution, I will review this next week (running out of cycles for adios2 this week). In the meanwhile please fix the git conflicts. |
This PR is not yet really in a reviewable state. EDIT: I just saw that something I did triggered a review request somehow. No idea how that happened, you can ignore this for now. |
@franzpoeschel Happy to lend a hand with integration too (after the US holiday). |
I have pushed a commit that ideally adds those fixes for speculative preload mode as well. |
While the Preload modes had confirmed positive effects on the sockets-based data plane (big improvements on simple bandwidth tests), I was never as clear that they would be a win in an RDMA situation. But as long as the RDMA parts work, we can work on Preload at our leisure... |
I was just wondering since the RDMA/libfabric implementation has a rather complex preload logic which I was not actually able to activate. |
I successfully ran a 128-node setup with this using PIConGPU on Frontier, so this is beginning to look like a breakthrough. Do you know if there is any logic in SST to evenly distribute the network cards found on a node among jobs? On Frontier, there are generally four per node (cxi0, cxi1, cxi2, cxi3) and it would probably be good to use them evenly. |
So far as I know, no, there is no SST logic to select network interfaces. We have relied upon the ability to set the FI_INTERFACE variable when necessary, but that is only sufficient to specify a functional interface, not to distribute load between multiple. So if that's necessary, we'll have to add something. |
Ok, that's what I suspected. Choices that I see for this:
I think that the more complex options can be done in a later PR while this PR should focus on getting CXI runnable. |
I successfully ran a benchmark at 4096 nodes now. The performance of libfabric-based SST seems to be slightly better than that of MPI-based SST. What I'm measuring is the "perceived" throughput, i.e. the throughput based on the time from load request to load completion on the reader site. This figure is skewed by communication overhead. These figures are from a single benchmark each for now, but the results are roughly what I expected. |
Fantastic! I was paging through the changes to rdma_dp and it shouldn't be hard to integrate in such a way that we support the prior providers as well. Quick question, I notice that you're using manual progress rather than auto (which I recall was an earlier problem with cxi). Do you have to have an provision to check the completion queue in the background so that transfers continue to progress even if ADIOS is busy doing something other than rdma calls? |
CXI only supports manual progress, specifying automatic progress will not let you select the CXI provider. From how I understand the SST implementation in ADIOS2, automatic progress is not needed anyway: On the writer side, SST runs in its own thread and uses blocking I/O within that thread; on the reader side, loading data is written in blocking way as well (ref.
The biggest challenge is probably to not accidentally break libfabric for other systems because some configuration detail changed. These things are hard to test automatically. |
e9fd5f9
to
580cf23
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now enabled automatically via CMake by checking if the CXI header can be included.
I've done a small test on my local computer, this now works with the sockets provider as well as with the CXI provider.
Remaining todo:
- Rebase back onto master
- Test on Summit, maybe do another somewhat larger test on Frontier if this is still working properly (though I don't see why not)
- Remove changes from SST example again?
Before I rebase this back onto master, this diff is more precise than the one shown in the PR: release_29...franzpoeschel:ADIOS2:libfabric-cray
Todo after this PR:
- Some form of polling for better supporting
FI_PROGRESS_MANUAL
- Maybe add an environment variable that lets you select the provider directly. Testing SST against the sockets provider was a bit annoying, since the
FABRIC_IFACE=eno1
was matched by multiple providers and the tcp provider (selected by the dataplane) did not work. I can do that in a new PR, it should be a simple enough change.
6518dad
to
7428848
Compare
Looks like clang-format wants to reorder your includes: #ifdef SST_HAVE_CRAY_CXI If stdbool.h is required for rdma/fi_cxi_ext.h (and so must come first), you can either move it before the #ifdef (shouldn't hurt for other includes), or put a blank line between the two (so clang-format won't mess with it). |
It is required for that header, yes. I'll add a little comment in between, that usually keeps clang-format from touching the include order. |
OK, @vicentebolea might have opinions about the CMake modes, but if they work, they're a start. @ax3l , you've tested on Frontier and Perlmutter? @pnorbert, before merging we probably need to test there everywhere we can that uses RDMA to make sure we don't have a regression. I can do summit, but I don't have as much access to other platforms as you probably do... |
There does seem to be a regression that affects Summit. I'll try to figure out where it goes wrong. |
Fixed now and tested on Summit. |
@franzpoeschel If you can fix the formatting and rebase this on current master, we can probably merge this. |
FI_MR_BASIC is special and mode bits cannot be compared against it.
5b0e1ec
to
6474b6a
Compare
done |
* master: Update readme for heat transfer example with new location and build instructions Ignore tests with defects for now Adapt libfabric dataplane of SST to Cray CXI provider (ornladios#3672) ci: fix path to lsan suppressions, fix broken gh status post Use adios2_mode_readRandomAccess in matlab open to make it work for BP5 (ornladios#3956) Add Global Array Capabilities and Limitations Add Section for Anatomy of an ADIOS Program Enable Shell-Check for gh-actions scripts Enable Shell-Check for circle CI scripts Enable Shell-Check for tau contract scripts Enable Shell-Check for scorpio contract scripts Enable Shell-Check for lammps contract scripts Delete VTK code in examples Fix MATLAB bindings for MacOS (ornladios#3950) Set the compiler for the Kokkos DataMan example to what is used to build Kokkos Fix the HIP architecture CMAKE variable (ornladios#3931) perfstubs 2023-11-27 (845d0702) (ornladios#3944) Revert "Only rank 0 should print the initialization message in perfstub"
* master: Update readme for heat transfer example with new location and build instructions Ignore tests with defects for now Adapt libfabric dataplane of SST to Cray CXI provider (ornladios#3672) ci: fix path to lsan suppressions, fix broken gh status post Use adios2_mode_readRandomAccess in matlab open to make it work for BP5 (ornladios#3956) Add Global Array Capabilities and Limitations Add Section for Anatomy of an ADIOS Program Enable Shell-Check for gh-actions scripts Enable Shell-Check for circle CI scripts Enable Shell-Check for tau contract scripts Enable Shell-Check for scorpio contract scripts Enable Shell-Check for lammps contract scripts Delete VTK code in examples Fix MATLAB bindings for MacOS (ornladios#3950) Set the compiler for the Kokkos DataMan example to what is used to build Kokkos Fix the HIP architecture CMAKE variable (ornladios#3931) perfstubs 2023-11-27 (845d0702) (ornladios#3944) Revert "Only rank 0 should print the initialization message in perfstub" CI Contract: Build examples with external ADIOS Example using DataMan with Kokkos buffers Propagating the GPU logic inside the DataMan engine ci: Use mpich built with ch3:sock:tp for faster tests ReadMe.md: Mention 2.9.2 release Cleanup server output a bit (ornladios#3914) ci: set openmpi and openmp params Example using Kokkos buffers with SST Changes to MallocV to take into consideration the memory space of a variable Change install directory of Gray scott files again ci,crusher: increase supported num branches ci: add shellcheck coverage to source and testing Change install directory of Gray scott files Only rank 0 should print the initialization message in perfstub Defining and computing derived variables (ornladios#3816) Add Remote "-status" command to see if a server is running and where (ornladios#3911) examples,hip: use find_package(hip) once in proj Add Steps Tutorial Add Operators Tutorial Add Attributes Tutorial Add Variables Tutorial Add Hello World Tutorial Add Tutorials' Download and Build section Add Tutorials' Overview section Improve bpStepsWriteRead* examples Rename bpSZ to bpOperatorSZWriter Convert bpAttributeWriter to bpAttributeWriteRead Improve bpWriter/bpReader examples Close file after reading for hello-world.py Fix names of functions in engine Fix formatting warnings Add dataspaces.rst in the list of engines Add query.rst cmake: find threads package first docs: update new_release.md Bump version to v2.9.2 ci: update number of task for mpich build clang-format: Correct format to old style Merge pull request ornladios#3878 from anagainaru/test-null-blocks Merge pull request ornladios#3588 from vicentebolea/fix-mpi-dp bp5: make RecMap an static anon namespaced var Replace LookupWriterRec's linear search on RecList with an unordered_map. For 250k variables, time goes from 21sec to ~1sec in WSL. The order of entries in RecList was not necessary for the serializer to work correctly. (ornladios#3877) Fix data length calculation for hash (ornladios#3875) Merge pull request ornladios#3823 from eisenhauer/SstMemSel gha,ci: update checkout to v4 Blosc2 USE ON: Fix Module Fallback cmake: correct prefer_shared_blosc behavior cmake: correct info.h installation path ci: disable MGARD static build operators: fix module library ci: add downloads readthedocs cmake: Add Blosc2 2.10.1 compatibility. Fix destdir install test (ornladios#3850) cmake: update minimum cmake to 3.12 (ornladios#3849) MPI: add timeout for conf test for MPI_DP (ornladios#3848) MPI_DP: do not call MPI_Init (ornladios#3847) install: export adios2 device variables (ornladios#3819) Merge pull request ornladios#3799 from vicentebolea/support-new-yaml-cpp Merge pull request ornladios#3737 from vicentebolea/fix-evpath-plugins-path Partial FFS Upstream, only changes to type_id bpls -l with scalar string variable: print the value (since min/max is empty). This changes the code for all types using Engine.Get() to get the value now. Set AWS version requirement to 1.10.15 and also turn it OFF by default as it is not a stable feature of ADIOS just yet. Fix local values block reading docs,ci: backport fixes for readthedocs
* master: Have HDF5 write raise error if operator(s) requested (ornladios#3951) fix for ASAN issue related to JoinedDimArray handling in BP5 deserializer (ornladios#3963) New operator MDR, for refactoring floating point arrays using MGARD's new MDR extension. (ornladios#3826) restricted http transport from windows builds. XMLConfigTest: Add RemoveIO test adios2::core::ADIOS: Initialize new IO objects with config file removed unsused variable Update readme for heat transfer example with new location and build instructions Ignore tests with defects for now Adapt libfabric dataplane of SST to Cray CXI provider (ornladios#3672) ci: fix path to lsan suppressions, fix broken gh status post Use adios2_mode_readRandomAccess in matlab open to make it work for BP5 (ornladios#3956) Add Global Array Capabilities and Limitations Add Section for Anatomy of an ADIOS Program Enable Shell-Check for gh-actions scripts Enable Shell-Check for circle CI scripts Enable Shell-Check for tau contract scripts Enable Shell-Check for scorpio contract scripts Enable Shell-Check for lammps contract scripts Delete VTK code in examples Fix MATLAB bindings for MacOS (ornladios#3950) Set the compiler for the Kokkos DataMan example to what is used to build Kokkos Fix the HIP architecture CMAKE variable (ornladios#3931) perfstubs 2023-11-27 (845d0702) (ornladios#3944) Revert "Only rank 0 should print the initialization message in perfstub" Formatting Formatting Revision Added buffered data receive in the client side. A socket version of HTTP connector. Proxy server host is hardwired to "localhost" and port to 9999 Remote bpls: bpls -E bp4 -T "Library=HTTP" /remote_path/myVector_cpp.bp -d bpInts
This seems to hang on full-scale Frontier (half-scale works fine). My current suspicion is that this might have to do with the manual data progress; maybe #3964 can help there. |
This PR is a first step towards trying to make use of libfabric+CXI on Frontier for SST streaming.
It successfully connects to the provider in
init_fabric()
in my tests, and the hello_sst examples finishes without error – but for some reason without loading any data. I will provide a detailed run log later (I lost my earlier logs somehow and a new job is currently enqueued), but essentiallyfi_cq_sread
returns one finished task from the queue that reports a successful remote read of zero bytes.When changing the data size in the helloSst examples, different errors occur (permission violations or (if I remember correctly) error code 90 - message too long) depending on the chosen data size.
As you can see in the diff, the configuration of libfabric is somewhat different from the one used in SST so far.
The most important differences are:
fi_mr_reg()
, but maybe this flag has more implications.The old implementation uses FI_MR_BASIC which implies FI_MR_LOCAL. I just noticed that I did not set this flag for some reason. I don't remember if this is necessary or if I just forgot this. I will need to try this out.Setting FI_MR_LOCAL does not make a difference, I pushed a commit containing it.fi_cq_sread()
into a blocking/synchronous operation (it is called without a timeout in ADIOS2).Note that I did not test the latest two commits (cleanup and documentation) yet, but their changes are not major.Most things in this PR are currently hardcoded specifically for the requirements of the CXI provider.
Once I get a job on the system again, I will try to be a bit more specific in some things and answer some questions above, and provide a logfile.
cc @pnorbert @eisenhauer
I used this submission script in my tests:
Current diff: release_29...franzpoeschel:ADIOS2:libfabric-cray