-
Notifications
You must be signed in to change notification settings - Fork 263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nc_put_vars_double (or float) fails in parallel #448
Comments
I think I reported this bug back in 2012; please see the first message and the follow-up. (I had to use the Wayback Machine to dig up the corresponding ticket (NCF-152)...) A fix would have to use the method described in the HDF5 FAQ. |
@gsjaardema is this issue still active or should it be closed? If it's active, what should we do to fix it? |
@edwardhartnett After looking at the code in Note that my original bug report mentioned above includes a minimal example you can use to check this yourself and to create an automatic test. To fix it you would need to add a block of code to |
@edwardhartnett Hmm. I may have written the comment above too soon. Sorry. Let me run that minimal example myself -- I'll report when I actually feel like I have something to say. |
@edwardhartnett All right. I re-built NetCDF 4.8.1 (with HDF5 1.12.0) with debugging symbols, built my minimal example (see the link above; it needs one more line ( One of the two processes was waiting in an I was wrong about the way to fix this... but I wasn't too far off: you do need to create "empty" write requests to use with One fix I can imagine would alter this loop:
I realize that this happens at the dispatch level (i.e. this code does not know that we're writing to an HDF5 file), so it may be necessary to alter code for other backends to make sure they can handle "empty" requests. I explained what is going on in my follow up e-mail from 10 year ago (some code locations changed and
|
I feel like an idiot (and I may need more coffee). I finally realized that I keep talking about an issue with Sorry about all this noise. |
OK, note that the varm functions are deprecated. Basically they are so complicated that no one here even understands what they do or how they are supposed to work. ;-) |
Environment Information
configure
)C
code to recreate the issue?Summary of Issue
NOTE: my dvarput.c is modified from 4.5.1-devel as described in #447 -- the early return if nels==0 has been removed.
If
nc_put_vars_double
is called in parallel with stride != 1 and some processors have data to output and some do not and netcdf-4 (hdf5-based) output is being used in a collective mode, then the code will hang since only the processors with data to output will call down in to theH5Dwrite
function. This function assumes that all processors will call whether they have data or not and uses a PMPI_Allreduce down in the call stack.The issue arises in
NCDEFAULT_put_vars
. If stride is 1, then everything works ok since all processors callNC_put_vars
at line 246 of dvarput.c (4.5.1-devel)However, if the stride is not 1, then the code falls down to the
odometer
code below that. All processors callodom_init
, but then thewhile
is only called by the processors that have data (some lines deleted below):If netcdf-4 (hdf5-based) collective output is being done, then the code will hang down below
H5Dwrite
due to hdf5 library callingPMPI_Allreduce
.I don't have a suggested fix for this issue. I tried rewriting my code to use
nc_put_vara_double
instead, but that is not easily done for this particular call.This does work if I use pnetcdf non-collective output and probably also netcdf-4 non-collective
The text was updated successfully, but these errors were encountered: